Stop manually looking through network responses
Intercepting JSON API responses is a great strategy in web scraping and (to a lesser extent) testing. However, manually searching through network responses in the inspector can get annoying.
That's why in this article I'll look at using a library such as Playwright or Puppeteer to automate the process. The steps will be scripts to:
- Log the URLs of JSON responses containing a target string
- Locate the precise values within those JSON responses
- Traverse all sibling objects and extract a full array
- Operate on the data programmatically or write it to disk
The code examples are in Puppeteer but can be adapted to Playwright, and rely primarily on vanilla JS data structure manipulation.
I'll assume you have worked with nested data structures in JavaScript and have exposure to Puppeteer or Playwright.
Capturing JSON responses
Let's build from the ground up, starting with a simple program that visits a site and logs all of the JSON responses containing a target substring.
The site under automation is a tiny page that makes two simple JSON API requests. Although this scenario is heavily simplified for reproducibility, the code to automate it should be equally applicable to realistic use cases.
Take a minute to check out the jsonplaceholder mocks to see what the structure looks like. We're mainly interested in the /users response. The /comments response will be ignored, which helps ensure that our interception logic only matches what it's supposed to.
Note that you'll typically need to perform clicking and typing actions to trigger requests you can intercept. I'll skip these interactions for brevity in this article. In the real world, adding actions won't change the fundamental interception ideas described here.
To run this site, save it in a directory as index.html
, then start a web server using (for example) python -m http.server 8000
. The site will be live at http://localhost:8000.
Here's the first Node Puppeteer script, which navigates to the site running on localhost and logs response URLs with JSON containing the target substring. You can save it as scrape.js
and run it with node scrape
. I used Node 22 and Puppeteer 23.3.0.
This code should log https://jsonplaceholder.typicode.com/users since that's the URL of the JSON response that includes the substring "bifurcated"
within its response text.
Note that networkidle isn't the optimal goto
predicate for most cases. It's used here to keep the script alive until requests resolve, but a more precise approach will be revealed later in this article.
This code block will be the basis for future versions which will omit the imports, error handling and IIFE for brevity.
Although this code checks JSON responses in particular, it works on non-JSON text responses as well with slight modifications. Remove the content type checks and JSON parse attempt.
Locating a value in a JSON response
The basic code above is already useful. It programmatically detects which URL contains a particular piece of data we're interested in, which is normally done manually in the network tab in the browser developer tools.
But it's a rough approximation: the substring search might identify a key rather than a value in the structure, or confuse characters you're searching for with JSON structure characters that aren't part of the data. Traversing and identifying a precise value in the parsed data structure avoids these issues:
For the price of more code and a minor performance hit, this approach precisely determines if a particular value exists in a JSON response. Adjust case sensitivity and exact matching in the comparison as fits your needs.
Extracting an array of objects from a JSON response
The search target isn't usually the only piece of data to extract. Rather, it's typically one easily-identifiable item in an array of similar items. The next step is to traverse all sibling objects.
Doing this involves determining the path to the target data, which can be done with modifications to the containsRecursive
function:
This complexity can be abstracted away using the object-scan
package:
Since multiple paths may lead to matching values, this function returns all such paths. It could be modified to return a generator or return the first matching path, as desired.
Once a path is found, it can be expanded to identify the value it leads to:
But more importantly, the path can be analyzed to identify an enclosing array and traverse that array:
lastIndex
can be modified to choose a parent array other than the deepest. This is useful for handling response arrays like the following.
Rather than capturing ["Amy", "Bob"]
as the original code would if given a search target of "Bob"
, you can use a function to find the second-to-last array index in the path to use as the basis for looping:
Making this truly dynamic is use case specific and left as an exercise.
Here's how these functions can be used to improve the ongoing example:
Here's the output:
Results have been truncated to 3 elements, but we're now extracting the fully array of siblings based on a search term. This is usually the target output from a scrape.
Consuming the data
The examples so far only fire off logs. Let's look at how you can operate on the data programmatically or write it to disk for further analysis.
It's possible to promisify page.on("response", handler)
so you can invoke it from the main promise chain, but this is already available with page.waitForRespone(predicate)
:
The above can be modified to any arbitrary predicate. If you want to wait for multiple responses, you can use a counter with a promisifed page.on("response, handler)
or use Promise.all()
, depending on your needs.
Note that the page.waitForResponse
listener is set prior to the event that triggers the target response to be captured. In most cases, this will be a click.
Converting analysis into code
An interesting extension of network response analysis is using GPTs to extract the JSON payload, and even generate code to traverse the nested structure.
It's tempting to use GPTs to automate websites in real time, but JSON structure processing is many factors simpler than live page automation. Timing, visibility and messy document structures can easily thwart GPT automation attempts at the time of writing, but this may change as GPT capabilities improve.
Using a GPT for response JSON processing offers significant effort and code savings, but comes at the cost of speed and reliability, and likely introduces dependencies. Large response payloads can pose problems for GPT analysis.
GPTs can be effective for prototyping, but for scaling up, it's probably best to convert to traditional, deterministic processing code.
You can also use traditional code to convert the paths used throughout this article into code, but this is out of scope of this post.
Conclusion
In this post, I've built up a series of techniques that assist in programmatically extracting data from payloads during Puppeteer or Playwright automation. This code can be used with modifications for various use cases, automating some of the drudgery in web scraping tasks and helping create more reliable scraping scripts that intercept data before it reaches the document.
There are no silver bullets in web scraping that can handle any page, but hopefully this will provide a surprisingly general approach that can help extract data from many single page applications and provide the foundation for other high-level scraping tricks,tools and techniques.