The following article is inspired by the talk above by Greg Gorlen, delivered as part of The Browser Conference we hosted last month.
Improving the performance of web scraping with Puppeteer and Playwright
When scraping websites, there are many suboptimal patterns you might accidentally use. They won't break anything, but they're inefficient ways of doing things.
They introduce many complexities such as interacting with UI elements before the corresponding API responses are ready, decreased performance, increased chances of scripts breaking, and causing debugging difficulties.
In this article you'll learn about these antipatterns, especially in event handling and promises in web scraping using Puppeteer or Playwright.
#1: Avoid Using page.on
for Single Responses; Use waitForResponse
Instead
When dealing with network events in web scraping, many developers use the page.on method to listen for responses. The page.on methods with Playwright or Puppeteer allow you to attach an event listener for various events on the page, including network responses, allowing you to statically define when the site is ready for further processing.
While this is useful for handling multiple events or continuous monitoring, using it for single responses can introduce unnecessary complexity and inefficiency.
Common Usage of page.on
Method
The page.on
method handles events the page emits. For example, you might use it to listen for network requests (Playwright), responses (Playwright), console messages, or errors.
Here's an example showing how to use page.on
to listen for network responses.
In the next section, you will learn about the pitfalls of using the page.on
EventEmitter methods:
Pitfalls with page.on
for Single Responses
Event Listener Management
When using page.on
for single responses, you need to manage event listeners manually. This means managing adding and removing listeners, which can lead to writing code that is harder to maintain and debug.
Comment
If not properly managed, event listeners can accumulate and cause memory leaks. See this example below:
Unnecessary Complexity
Using page.on
for single responses adds unnecessary complexity to your code. It would entail including additional logic to ensure the listener is only triggered once and removed. This can make your codebase more difficult to read and maintain.
Common Usage of waitForResponse
Method
The waitForResponse
method in Playwright and Puppeteer is designed to wait for a specific network response that matches a given condition.
The waitForResponse
method helps sync user actions on the page with API responses. This method waits until a matching pattern is made or throws an error if it takes too long, thereby, simplifying the waiting process for a single response
by removing the need for manual event listener management.
For example:
Benefits of Using waitForResponse
for Single Responses
Using waitForResponse
provides several advantages over page.on
for handling single network responses:
- Simplicity - The code is more straightforward to understand.
- Improved Performance - Reduces the overhead of managing event listeners
- Reduced Memory Usage - Eliminates the risk of memory leaks associated with lingering event listeners.
- Listener removed automatically -
waitForResponse
returns a promise that resolves with the response, and the event listener is automatically removed when the promise is settled.
Difference Between page.on
and waitForResponse
Using page.on
for a response would be:
which can be refactored using waitForResponse
:
Best Practices for Using waitForResponse
- Define Clear Conditions - Ensure the condition used in waitForResponse accurately identifies the desired response.
- Handle Timeouts - Use appropriate timeout values to avoid indefinite waiting.
- Error Handling - Implement error handling to manage cases where the expected response is not received.
#2: Keep Code Keep Code Flat and Avoid Nesting for Better Readability
Code nesting is when functions, loops, or conditionals are placed inside one another. While some nesting is often necessary, excessive nesting can significantly impact code readability. Deeply nested code can obscure the logical flow of your program, making it difficult to follow and understand.
Common Pitfalls of Deeply Nested Code
Deeply nested code can create a convoluted and tangled flow of execution. When multiple levels of nesting are involved, it becomes challenging to trace the path of execution, which can hinder comprehension and collaboration among developers.
Nesting also introduces more points of failure in your code. Each additional level of nesting increases the likelihood of bugs and makes it harder to debug and maintain your code. Lastly, logical errors, such as incorrect indentation or misplaced braces, are more common in nested structures.
How to Keep Your Code Flat
Break Down Tasks into Smaller Functions
One effective strategy to avoid deep nesting is to break down complex tasks into smaller, more manageable functions. By encapsulating specific functionality into separate functions, you can maintain a flat structure and improve code modularity.
The code snippet below demonstrates the difference between nested code and flat code:
Using Early Returns
Early returns can help simplify conditional logic by exiting a function early when a certain condition is met. This reduces the need for nested conditionals and helps keep the code flat.
#2: Use await
Instead of .then
for Promise Handling
Promises provide a cleaner alternative to callbacks by allowing you to chain operations and handle errors more effectively.
The .then
method allows you to specify what should be done when a promise resolves or rejects.
While functional, it can lead to nested and complex code structures. On the other hand, the await
keyword, used inside an async function, pauses the function's execution until the promise resolves or rejects, providing a more linear and readable flow.
A key benefit of await
is that it allows asynchronous code to be written in a style that resembles synchronous code.
When using .then
, error handling requires an explicit catch block at the end of the promise chain. In contrast, await
integrates seamlessly with try...catch
blocks, enabling more consistent and centralized error management.
Want An Easy Way to Deploy Playwright or Puppeteer?
I hope this article gave you ideas about writing more efficient code for your web scraping.
If you also want an easy way to host and run your scripts, try out Browserless. Just connect your scripts to our pool of managed browsers, and we'll take care of scaling your scraping.
Try it today with a 7-day trial.