We’re excited to announce that we’ve recently just crossed over 2 million sessions served! That’s millions of screenshots generated, PDF’s printed, and websites tested. We’ve done just about everything you can think of with a headless browser.
While we’re excited to have achieved this milestone there’s definitely been a lot of hiccups and issues we’ve seen along the way. Because of the amount of traffic we’ve gotten, we’d like to take some time and outline common best practices when running headless browsers (and puppeteer) in a production environment.
1. Don’t run a headless browser
Volatile resource usage of Headless Chrome
By all accounts, if at all possible, just don’t run a headless browser. Especially not on the same infrastructure as your app (see above). Headless browsers are un-predictable, hungry, and are the process version of a Rick and Morty “Meseeks.” Almost everything you can do with a browser (save for interpolating and running JavaScript) can be done with simple Linux tools. Cheerio and others libraries offer elegant Node API’s for fetching data via HTTP requests and scraping if that’s your end-goal.
For example, you can fetch a page (assuming it produces useful HTML) and scrape something as simple as this:
Obviously this doesn’t cover each and every use-case, and if you’re reading this then chances are you have to use a headless browser, so let’s press on.
2. Don’t run a headless browser when you don’t need to
We’ve ran into numerous users that attempt to keep the browser open, even when not in use, so that it’s always available for connections. While this might be a good strategy to help expedite session launch it’ll only end in misery after a few hours. This is largely because browsers like to cache stuff and slowly eat more memory. Any time you’re not actively using the browser, close it!
At browserless we generally try to cover this error internally by always having some kind of session-timer, and closing the browser whenever the WebSocket is disconnected. However, if you’re not using the service or the backing docker image, then be sure that something is closing the browser, otherwise you’ll have a pretty awful time debugging the issue when you’re paged in the middle of the night.
3. page.evaluate
is your friend
Be with transpilers like babel or typescript as they like to create helper functions and assume they’re available with closures. Meaning your .evaluate callback might not work properly.
Puppeteer has a lot of cool sugar methods that let you do things like save DOM selectors or more in the Node runtime. While this is of great convenience, you can easily shoot yourself in the foot if something happens on the page that mutates that DOM node in some fashion. As much as it feels “hacky” it’s actualy just better to all your browser-side work in the context of the browser. This generally means loading up page.evaulate
with all the work that needs to be done.
For instance, instead of doing something like this (which has 3 async actions):
Do this instead (1 async action)
The other pro about doing it in an evaluate
call is that it’s portable: meaning you can run this code inside of a browser to test versus trying to rewrite the Node code. Of course, you should always use the debugger where possible to shorten development time.
A quick rule-of-thumb is to count the number of await
‘s or then
‘s happening in your code and if there’s more than one then you’re probably better off running the code inside a page.evaluate
call. The reason here is that all async actions have to go back-and-forth between Node’s runtime and the browser’s, which means all the JSON serialization and deserializiation. While it’s not a huge amount of parsing (since it’s all backed by WebSockets) it still is taking up time that could better be spent doing something else.
4. Parallelize with browsers, not pages
Since we’ve determined that it’s not good to run a browser, and that we should only run one when absolutely necessary, the next best-practice is to run only one session through each browser. While you actually might save some overhead by parallelizing work through page
s, if one page crashes it can bring down the entire browser with it. That, plus each page isn’t guaranteed to be totally clean (cookies and storage might bleed-through as seen here).
Instead of this:
Do this:
Each new browser instance gets a clean --user-data-dir
(unless otherwise specified), which means it’s treated as a fresh session entirely. If Chrome crashes for whatever reason, it won’t bring down any other sessions that are happening as well.
5. Queue and limit concurrent work
One of the core features of browserless is its ability to limit parallelization and queue in a seamless way. This means that consuming applications can simply puppeteer.connect
without having to implement a queue themselves. This prevents a huge host of issues, mostly around concurrent Chrome instances killing your app’s available resources.
The best, and easiest way, is to pull our docker image and run it with the parameters you’d like:
This limits the number of concurrent requests that can run to 10 (including debug sessions and more). You can also configure how much you’d like to queue as well with the MAX_QUEUE_LENGTH
variable. As a general rule, you can typically run roughly 10
concurrent requests per GB of memory. CPU usage can spike for various things, but for the most part you’ll need lots and lots of RAM.
6. Don’t forget about page.waitForNavigation
One of the most common issues we’ve seen is actions that trigger a pageload, and the sudden loss of your scripts execution. This is because actions that trigger a pageload
can often cause subsequent work to get swallowed. In order to get around this issue, you’ll generally have to invoke the page-loading-action and immediately wait for the next pageload.
For instance, this console.log
won’t work here (see a demo):
But it will here (see the demo):
You can read more about waitForNavigation here, which roughly gets the same interface options that page.goto
has, just only the “wait” part.
7. Use docker to contain it all
Chrome takes a lot of dependencies to get running properly. A lot. Even after all of that’s complete then there’s things like fonts and phantom-processes you have to worry about so it’s ideal to use some sort of container to, well, contain it. Docker is almost custom-built for this task as you can limit the amount resources available and sandbox it. If you want to create your own Dockerfile
yourself, look below for all the required deps:
And to avoid running into zombie processes (which commonly happen with Chrome), you’ll want to use something like dumb-init to properly start-up:
If you’re interested in seeing more about this, take a look at our Dockerfile for more details.
8. Remember: there’s two different runtimes goin’ on
It’s helpful to remember that there’s two JavaScript runtimes going on (Node and the browser). This is great for the purposes of shareability, but it comes at the cost of confusion since some page methods will require you to explicitly pass in references (versus doing so with closures or hoisting).
Let’s take page.evaluate
as an example. Deep down in the bowels of the protocol, this literally stringifies the function and passes it into Chrome, so things like closures and hoisting won’t work at all. If you need to pass some references or values into an evaluate call, simply append them as arguments which get properly handled.
So, instead of referencing a selector
via closures:
Pass the parameter in:
You can append one or more arguments to page.evaluate
since it’s variadic in what it accepts. Be sure to use this to your advantage!
The Future
We’re incredibly excited about the future of headless browsers, and all the automation they unlock. With powerful tools like puppeteer and browserless, we’re hopeful that debugging and running headless work in production becomes easier and faster. We’ll launching functions soon, so be sure to check back when those go live to better run your headless work!