Searching API URLs and Extracting Objects with Puppeteer and Playwright

October 25, 2024

contents

Stop manually looking through network responses

Intercepting JSON API responses is a great strategy in web scraping and (to a lesser extent) testing. However, manually searching through network responses in the inspector can get annoying.

That's why in this article I'll look at using a library such as Playwright or Puppeteer to automate the process. The steps will be scripts to:

Log the URLs of JSON responses containing a target string
Locate the precise values within those JSON responses
Traverse all sibling objects and extract a full array
Operate on the data programmatically or write it to disk

The code examples are in Puppeteer but can be adapted to Playwright, and rely primarily on vanilla JS data structure manipulation.

I'll assume you have worked with nested data structures in JavaScript and have exposure to Puppeteer or Playwright.

Capturing JSON responses

Let's build from the ground up, starting with a simple program that visits a site and logs all of the JSON responses containing a target substring.

The site under automation is a tiny page that makes two simple JSON API requests. Although this scenario is heavily simplified for reproducibility, the code to automate it should be equally applicable to realistic use cases.


<!doctype html>
<html lang="en">
  <body>
    <script>
      fetch("https://jsonplaceholder.typicode.com/comments");
      fetch("https://jsonplaceholder.typicode.com/users");
    </script>
  </body>
</html>

Take a minute to check out the jsonplaceholder mocks to see what the structure looks like. We're mainly interested in the /users response. The /comments response will be ignored, which helps ensure that our interception logic only matches what it's supposed to.

Note that you'll typically need to perform clicking and typing actions to trigger requests you can intercept. I'll skip these interactions for brevity in this article. In the real world, adding actions won't change the fundamental interception ideas described here.

To run this site, save it in a directory as index.html, then start a web server using (for example) python -m http.server 8000. The site will be live at http://localhost:8000.

Here's the first Node Puppeteer script, which navigates to the site running on localhost and logs response URLs with JSON containing the target substring. You can save it as scrape.js and run it with node scrape. I used Node 22 and Puppeteer 23.3.0.


import puppeteer from "puppeteer";

let browser;
(async () => {
  const url = "http://localhost:8000";
  const searchTerm = "bifurcated";

  browser = await puppeteer.launch();
  const [page] = await browser.pages();

  const handleResponse = async response => {
    const contentType = response.headers()["content-type"];

    if (
      !contentType ||
      !contentType.includes("application/json")
    ) {
      return;
    }

    const data = await response.json().catch(() => false);

    if (!data) {
      return;
    }

    const text = await response.text();

    if (text.toLowerCase().includes(searchTerm.toLowerCase())) {
      console.log(response.url());
    }
  };

  page.on("response", handleResponse);
  await page.goto(url, {waitUntil: "networkidle0"});
  page.off("response", handleResponse);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

This code should log https://jsonplaceholder.typicode.com/users since that's the URL of the JSON response that includes the substring "bifurcated" within its response text.

Note that networkidle isn't the optimal goto predicate for most cases. It's used here to keep the script alive until requests resolve, but a more precise approach will be revealed later in this article.

This code block will be the basis for future versions which will omit the imports, error handling and IIFE for brevity.

Although this code checks JSON responses in particular, it works on non-JSON text responses as well with slight modifications. Remove the content type checks and JSON parse attempt.

Locating a value in a JSON response

The basic code above is already useful. It programmatically detects which URL contains a particular piece of data we're interested in, which is normally done manually in the network tab in the browser developer tools.

But it's a rough approximation: the substring search might identify a key rather than a value in the structure, or confuse characters you're searching for with JSON structure characters that aren't part of the data. Traversing and identifying a precise value in the parsed data structure avoids these issues:


const containsRecursive = (data, searchTerm) => {
  const normalizedSearchTerm = searchTerm.toLowerCase();

  const search = obj => {
    if (Array.isArray(obj)) {
      for (const [i, e] of obj.entries()) {
        const found = search(e);

        if (found) {
          return true;
        }
      }
    } else if (typeof obj === "object" && obj !== null) {
      for (const key in obj) {
        const found = search(obj[key]);

        if (found) {
          return true;
        }
      }
    } else if (typeof obj === "string") {
      if (obj.toLowerCase().includes(normalizedSearchTerm)) {
        return true;
      }
    }

    return false;
  };

  return search(data);
};

// ...

const url = "http://localhost:8000";
const searchTerm = "bifurcated";

browser = await puppeteer.launch();
const [page] = await browser.pages();

const handleResponse = async response => {
  const contentType = response.headers()["content-type"];

  if (
    !contentType ||
    !contentType.includes("application/json")
  ) {
    return;
  }

  const data = await response.json().catch(() => false);

  if (!data) {
    return;
  }

  if (containsRecursive(data, searchTerm.toLowerCase())) {
    console.log(response.url());
  }
};

page.on("response", handleResponse);
await page.goto(url, {waitUntil: "networkidle0"});
page.off("response", handleResponse);

// ...

For the price of more code and a minor performance hit, this approach precisely determines if a particular value exists in a JSON response. Adjust case sensitivity and exact matching in the comparison as fits your needs.

Extracting an array of objects from a JSON response

The search target isn't usually the only piece of data to extract. Rather, it's typically one easily-identifiable item in an array of similar items. The next step is to traverse all sibling objects.

Doing this involves determining the path to the target data, which can be done with modifications to the containsRecursive function:


const findPaths = (data, searchTerm) => {
  const paths = [];
  const normalizedSearchTerm = searchTerm.toLowerCase();

  const search = (obj, path) => {
    if (Array.isArray(obj)) {
      for (const [i, e] of obj.entries()) {
        search(e, [...path, i]);
      }
    } else if (typeof obj === "object" && obj !== null) {
      for (const key in obj) {
        search(obj[key], [...path, key]);
      }
    } else if (typeof obj === "string") {
      if (obj.toLowerCase().includes(normalizedSearchTerm)) {
        paths.push(path);
      }
    }
  };

  search(data, []);
  return paths;
};

This complexity can be abstracted away using the object-scan package:


import objectScan from "object-scan"; // ^19.0.5

const findPaths = (data, searchTerm) => {
  return objectScan(["**"], {
    filterFn: ({value}) =>
      typeof value === "string" &&
      value.toLowerCase().includes(searchTerm.toLowerCase()),
  })(data);
};

Since multiple paths may lead to matching values, this function returns all such paths. It could be modified to return a generator or return the first matching path, as desired.

Once a path is found, it can be expanded to identify the value it leads to:


const expandPath = (data, path) => {
  let value = data;

  for (const key of path.slice(0, -1)) {
    value = value[key];
  }

  return value;
};

But more importantly, the path can be analyzed to identify an enclosing array and traverse that array:


const extractDataByPath = (data, path) => {
  const flattened = [];

  const lastIndex =
    path.length -
    path.findLastIndex(e => typeof e === "number") -
    1;

  const flatten = (obj, path) => {
    if (path.length === lastIndex) {
      flattened.push(obj);
    } else if (Array.isArray(obj)) {
      if (typeof path[0] !== "number") {
        throw Error("Malformed path");
      }

      for (const chunk of obj) {
        flatten(chunk, path.slice(1));
      }
    } else {
      flatten(obj[path[0]], path.slice(1));
    }
  };

  flatten(data, path);
  return flattened;
};

lastIndex can be modified to choose a parent array other than the deepest. This is useful for handling response arrays like the following.


[
  {names: ["Amy", "Bob"], id: 1},
  {names: ["Chris", "David"], id: 2},
];

Rather than capturing ["Amy", "Bob"] as the original code would if given a search target of "Bob", you can use a function to find the second-to-last array index in the path to use as the basis for looping:


function findNthLastNumberIndex(arr, n = 1) {
  let count = 0;

  for (let i = arr.length - 1; i >= 0; i--) {
    if (typeof arr[i] === "number") {
      if (++count === n) {
        return i;
      }
    }
  }

  return -1;
}

Making this truly dynamic is use case specific and left as an exercise.

Here's how these functions can be used to improve the ongoing example:


// ...
const url = "http://localhost:8000";
const searchTerm = "bifurcated";

browser = await puppeteer.launch();
const [page] = await browser.pages();

const handleResponse = async response => {
  const contentType = response.headers()["content-type"];

  if (
    !contentType ||
    !contentType.includes("application/json")
  ) {
    return;
  }

  const data = await response.json().catch(() => false);

  if (!data) {
    return;
  }

  const matchingPaths = findPaths(data, searchTerm);
  if (matchingPaths.length === 0) {
    return;
  }

  console.log("_".repeat(60));
  console.log(response.url().slice(0, 60));

  for (const path of matchingPaths) {
    const value = expandPath(data, path);
    const flattened = extractDataByPath(data, path);
    console.log({
      path,
      value,
      extractedData: flattened.slice(0, 3),
    });
  }
};

page.on("response", handleResponse);
await page.goto(url, {waitUntil: "networkidle0"});
page.off("response", handleResponse);
// ...

Here's the output:


https://jsonplaceholder.typicode.com/users
{
  path: [ 2, 'company', 'catchPhrase' ],
  value: {
    name: 'Romaguera-Jacobson',
    catchPhrase: 'Face to face bifurcated interface',
    bs: 'e-enable strategic applications'
  },
  extractedData: [
    {
      id: 1,
      name: 'Leanne Graham',
      username: 'Bret',
      email: 'Sincere@april.biz',
      address: {
        street: 'Kulas Light',
        suite: 'Apt. 556',
        city: 'Gwenborough',
        zipcode: '92998-3874',
        geo: { lat: '-37.3159', lng: '81.1496' }
      },
      phone: '1-770-736-8031 x56442',
      website: 'hildegard.org',
      company: {
        name: 'Romaguera-Crona',
        catchPhrase: 'Multi-layered client-server neural-net',
        bs: 'harness real-time e-markets'
      }
    },
    {
      id: 2,
      name: 'Ervin Howell',
      username: 'Antonette',
      email: 'Shanna@melissa.tv',
      address: {
        street: 'Victor Plains',
        suite: 'Suite 879',
        city: 'Wisokyburgh',
        zipcode: '90566-7771',
        geo: { lat: '-43.9509', lng: '-34.4618' }
      },
      phone: '010-692-6593 x09125',
      website: 'anastasia.net',
      company: {
        name: 'Deckow-Crist',
        catchPhrase: 'Proactive didactic contingency',
        bs: 'synergize scalable supply-chains'
      }
    },
    {
      id: 3,
      name: 'Clementine Bauch',
      username: 'Samantha',
      email: 'Nathan@yesenia.net',
      address: {
        street: 'Douglas Extension',
        suite: 'Suite 847',
        city: 'McKenziehaven',
        zipcode: '59590-4157',
        geo: { lat: '-68.6102', lng: '-47.0653' }
      },
      phone: '1-463-123-4447',
      website: 'ramiro.info',
      company: {
        name: 'Romaguera-Jacobson',
        catchPhrase: 'Face to face bifurcated interface',
        bs: 'e-enable strategic applications'
      }
    }
  ]
}

Results have been truncated to 3 elements, but we're now extracting the fully array of siblings based on a search term. This is usually the target output from a scrape.

Consuming the data

The examples so far only fire off logs. Let's look at how you can operate on the data programmatically or write it to disk for further analysis.

It's possible to promisify page.on("response", handler) so you can invoke it from the main promise chain, but this is already available with page.waitForRespone(predicate):


// ...
const responsePredicate = async response => {
  const contentType = response.headers()["content-type"];

  if (
    !contentType ||
    !contentType.includes("application/json")
  ) {
    return;
  }

  const data = await response.json().catch(() => false);

  if (!data) {
    return;
  }

  const text = await response.text();
  return text.toLowerCase().includes(searchTerm.toLowerCase());
};
const responsePromise = page.waitForResponse(responsePredicate);
await page.goto(url, {waitUntil: "domcontentloaded"});
const response = await responsePromise;
// process response using the extraction functions in the previous sections
console.log(response.url());

The above can be modified to any arbitrary predicate. If you want to wait for multiple responses, you can use a counter with a promisifed page.on("response, handler) or use Promise.all(), depending on your needs.

Note that the page.waitForResponse listener is set prior to the event that triggers the target response to be captured. In most cases, this will be a click.

Converting analysis into code

An interesting extension of network response analysis is using GPTs to extract the JSON payload, and even generate code to traverse the nested structure.

It's tempting to use GPTs to automate websites in real time, but JSON structure processing is many factors simpler than live page automation. Timing, visibility and messy document structures can easily thwart GPT automation attempts at the time of writing, but this may change as GPT capabilities improve.

Using a GPT for response JSON processing offers significant effort and code savings, but comes at the cost of speed and reliability, and likely introduces dependencies. Large response payloads can pose problems for GPT analysis.

GPTs can be effective for prototyping, but for scaling up, it's probably best to convert to traditional, deterministic processing code.

You can also use traditional code to convert the paths used throughout this article into code, but this is out of scope of this post.

Conclusion

In this post, I've built up a series of techniques that assist in programmatically extracting data from payloads during Puppeteer or Playwright automation. This code can be used with modifications for various use cases, automating some of the drudgery in web scraping tasks and helping create more reliable scraping scripts that intercept data before it reaches the document.

There are no silver bullets in web scraping that can handle any page, but hopefully this will provide a surprisingly general approach that can help extract data from many single page applications and provide the foundation for other high-level scraping tricks,tools and techniques.

Share this article