Puppeteer is a great tool for web scraping, data extraction and automated testing. It uses headless Chrome to perform tasks high efficiently.
However, complex tasks can make your execution times creep up. This post will look at key reasons behind slow automations and how to speed them back up.
Why does automation speed matter?
Optimizing your automations has two key benefits:
- If you're running on-demand automations, it gives the user a faster response time.
- Even if you're only running scheduled automations, it reduces resource consumption.
We've seen developers wrestle with growing hosting bills due to slow automations on their self built browser hosting or when using a slow competitor.
Check out this case study of getting 5x faster scraping with Browserless
What's slowing down your Puppeteer scripts?
Before diving into the solutions, it's crucial to identify what might be holding up your Puppeteer scripts. Several factors can contribute to the sluggishness:
- Network Latency in Loading Resources: Web pages often contain numerous resources like images, CSS, and JavaScript files. Loading all these resources can significantly increase the time your automation script takes to complete a task.
- Proxies: While proxies are invaluable for tasks like web scraping and bypassing bot detection mechanisms, they can introduce latency, especially if the proxy server is slow or geographically distant from the target website's server.
- Headful Chrome: Running Chrome in headless mode (without a GUI) is generally faster and consumes less memory, but is a giveaway that you're a bot and has inconsistent rendering. Using a headful browser often gets better results, but will require more processing.
- Geolocation: If you're running your automations through a proxy or on a server, the geolocation can introduce a delay for loading up webpages.
Turbocharging Your Puppeteer Scripts: Practical Solutions
Now that we're familiar with the common speed bumps, let's explore the practical solutions to supercharge your Puppeteer scripts:
- Reusing Browser Instances: Launching a new browser instance for every task can be time-consuming. Instead, reuse browser instances whenever possible to save the time taken for the startup process by using our keepalive flag.
- Reusing Cache and Cookies: Puppeteer allows you to specify a user data directory, enabling the reuse of cache and cookies. This means that resources loaded in previous sessions can be served from the cache, significantly reducing load time. You can use the --user-data-dir flag if you're on our Dedicated cloud.
- Going Headless for Text-based Tasks: If your task doesn't require GUI or avoiding bot detectors, running Chrome in headless mode can lead to better performance.
- Intercepting the Network to Skip Unnecessary Resources: Puppeteer lets you intercept network requests to block requests for resources you don't need, like images or CSS. This can drastically reduce the load time, especially for resource-heavy websites. If you're using our REST APIs you can define the resource types to exclude by using this object
"rejectResourceTypes": []
as detailed in our Swagger docs. If you're using the Puppeteer library, here's how we implement that for our REST APIs in our Github repo. - GPU Hardware Acceleration: if running in headful mode, GPU acceleration can boost your loading times for image intensive websites. As an added bonus, it is also a sign to bot detectors that you're human, since most scrapers run only on CPUs. If using Browserless you can enable it with a flag of --enable-unsafe-webgpu
- Server and Proxy Location: Try to optimize the geolocation of both the servers your scripts are on and the proxies you're using, so that they are as close as possible to your target website. This can also important for GDPR compliance. How to use geolocated proxies with browserless.
Conclusion
In the world of automation, efficiency is key. By understanding the factors that can slow down your Puppeteer scripts and implementing the strategies outlined in this post, you can ensure that your headless Chrome automation is lightning-fast.
Remember, the goal is not just to automate but to automate efficiently, making the most out of every second your script runs.
It's why we have prioritised speed in our setup at Browserless. If your automations are getting sluggish, check out our trial to get started.