Event Handling and Promises in Web Scraping

August 8, 2024

contents

The following article is inspired by the talk above by Greg Gorlen, delivered as part of The Browser Conference we hosted last month.

Improving the performance of web scraping with Puppeteer and Playwright

When scraping websites, there are many suboptimal patterns you might accidentally use. They won't break anything, but they're inefficient ways of doing things.

They introduce many complexities such as interacting with UI elements before the corresponding API responses are ready, decreased performance, increased chances of scripts breaking, and causing debugging difficulties.

In this article you'll learn about these antipatterns, especially in event handling and promises in web scraping using Puppeteer or Playwright.

#1: Avoid Using `page.on` for Single Responses; Use `waitForResponse` Instead

When dealing with network events in web scraping, many developers use the page.on method to listen for responses. The page.on methods with Playwright or Puppeteer allow you to attach an event listener for various events on the page, including network responses, allowing you to statically define when the site is ready for further processing.

While this is useful for handling multiple events or continuous monitoring, using it for single responses can introduce unnecessary complexity and inefficiency.

Common Usage of `page.on` Method

The page.on method handles events the page emits. For example, you might use it to listen for network requests (Playwright), responses (Playwright), console messages, or errors.

Here's an example showing how to use page.on to listen for network responses.


page.on("response", response => {
  if (response.url().includes("api/data")) {
    //  Handle the response
  }
});

In the next section, you will learn about the pitfalls of using the page.on EventEmitter methods:

Pitfalls with `page.on` for Single Responses

Event Listener Management

When using page.on for single responses, you need to manage event listeners manually. This means managing adding and removing listeners, which can lead to writing code that is harder to maintain and debug.
Comment

If not properly managed, event listeners can accumulate and cause memory leaks. See this example below:


function logRequest(interceptedRequest) {
  console.log("A request was made:", interceptedRequest.url());
}
page.on("request", logRequest);
// Sometime later...
page.removeListener("request", logRequest);

Unnecessary Complexity

Using page.on for single responses adds unnecessary complexity to your code. It would entail including additional logic to ensure the listener is only triggered once and removed. This can make your codebase more difficult to read and maintain.

Common Usage of `waitForResponse` Method

The waitForResponse method in Playwright and Puppeteer is designed to wait for a specific network response that matches a given condition.

The waitForResponse method helps sync user actions on the page with API responses. This method waits until a matching pattern is made or throws an error if it takes too long, thereby, simplifying the waiting process for a single response by removing the need for manual event listener management.

For example:


const response = await page.waitForResponse(response => response.url().includes("api/data"));
// Handle the response

Benefits of Using `waitForResponse` for Single Responses

Using waitForResponse provides several advantages over page.on for handling single network responses:

Simplicity - The code is more straightforward to understand.
Improved Performance - Reduces the overhead of managing event listeners
Reduced Memory Usage - Eliminates the risk of memory leaks associated with lingering event listeners.
Listener removed automatically - waitForResponse returns a promise that resolves with the response, and the event listener is automatically removed when the promise is settled.

Difference Between `page.on` and `waitForResponse`

Using page.on for a response would be:


page.on("response", response => {
  if (response.url().includes("api/data")) {
    // Process the response
    // Remove the event listener to avoid the memory leaks
    page.removeListener("response", this);
  }
});

which can be refactored using waitForResponse:


const response = page.waitForResponse(response => response.url().includes("api/data") && response.status() === 200, {timeout: 60000}),
// Process the response

Best Practices for Using `waitForResponse`

Define Clear Conditions - Ensure the condition used in waitForResponse accurately identifies the desired response.
Handle Timeouts - Use appropriate timeout values to avoid indefinite waiting.
Error Handling - Implement error handling to manage cases where the expected response is not received.

#2: Keep Code Keep Code Flat and Avoid Nesting for Better Readability

Code nesting is when functions, loops, or conditionals are placed inside one another. While some nesting is often necessary, excessive nesting can significantly impact code readability. Deeply nested code can obscure the logical flow of your program, making it difficult to follow and understand.

Common Pitfalls of Deeply Nested Code

Deeply nested code can create a convoluted and tangled flow of execution. When multiple levels of nesting are involved, it becomes challenging to trace the path of execution, which can hinder comprehension and collaboration among developers.

Nesting also introduces more points of failure in your code. Each additional level of nesting increases the likelihood of bugs and makes it harder to debug and maintain your code. Lastly, logical errors, such as incorrect indentation or misplaced braces, are more common in nested structures.

How to Keep Your Code Flat

Break Down Tasks into Smaller Functions

One effective strategy to avoid deep nesting is to break down complex tasks into smaller, more manageable functions. By encapsulating specific functionality into separate functions, you can maintain a flat structure and improve code modularity.

The code snippet below demonstrates the difference between nested code and flat code:


// Nested Code Example
async function processPage(page) {
  await page.goto("https://browserless.io/");
  const data = await page.evaluate(() => {
    const elements = document.querySelectorAll(".item");
    const results = [];
    elements.forEach(element => {
      const text = element.textContent;
      if (text.includes("keyword")) {
        results.push(text);
      }
    });
    return results;
  });
  return data;
}

// Refactored Flat Code Example
async function processPage(page) {
  await page.goto("https://browserless.io/");
  const data = await extractData(page);
  return data;
}

async function extractData(page) {
  return await page.evaluate(() => {
    const elements = document.querySelectorAll(".item");
    return Array.from(elements)
      .filter(element => element.textContent.includes("keyword"))
      .map(element => element.textContent);
  });
}

Using Early Returns

Early returns can help simplify conditional logic by exiting a function early when a certain condition is met. This reduces the need for nested conditionals and helps keep the code flat.


// Nested Code Example
function validateUser(user) {
  if (user) {
    if (user.isActive) {
      if (user.age > 18) {
        return true;
      }
    }
  }
  return false;
}

// Refactored Flat Code Example
function validateUser(user) {
  if (!user) return false;
  if (!user.isActive) return false;
  if (user.age < 18) return false;
  return true;
}

#2: Use `await` Instead of `.then` for Promise Handling

Promises provide a cleaner alternative to callbacks by allowing you to chain operations and handle errors more effectively.

The .then method allows you to specify what should be done when a promise resolves or rejects.

While functional, it can lead to nested and complex code structures. On the other hand, the await keyword, used inside an async function, pauses the function's execution until the promise resolves or rejects, providing a more linear and readable flow.

A key benefit of await is that it allows asynchronous code to be written in a style that resembles synchronous code.


// Using .then
promiseFunction()
  .then(result => {
    console.log(result);
    return anotherPromise(result);
  })
  .then(anotherResult => {
    console.log(anotherResult);
  })
  .catch(error => {
    console.error(error);
  });

// Using await
async function asyncFunction() {
  try {
    const result = await promiseFunction();
    console.log(result);
    const anotherResult = await anotherPromise(result);
    console.log(anotherResult);
  } catch (error) {
    console.error(error);
  }
}

When using .then, error handling requires an explicit catch block at the end of the promise chain. In contrast, await integrates seamlessly with try...catch blocks, enabling more consistent and centralized error management.

Want An Easy Way to Deploy Playwright or Puppeteer?

I hope this article gave you ideas about writing more efficient code for your web scraping.

If you also want an easy way to host and run your scripts, try out Browserless. Just connect your scripts to our pool of managed browsers, and we'll take care of scaling your scraping.

Try it today with a 7-day trial.

Share this article

Event Handling and Promises in Web Scraping

Improving the performance of web scraping with Puppeteer and Playwright

#1: Avoid Using page.on for Single Responses; Use waitForResponse Instead

Common Usage of page.on Method

Pitfalls with page.on for Single Responses