Observations running 2 million headless browser sessions

June 4, 2018

contents

We’re excited to announce that we’ve recently just crossed over 2 million sessions served! That’s millions of screenshots generated, PDF’s printed, and websites tested. We’ve done just about everything you can think of with a headless browser.

While we’re excited to have achieved this milestone there’s definitely been a lot of hiccups and issues we’ve seen along the way. Because of the amount of traffic we’ve gotten, we’d like to take some time and outline common best practices when running headless browsers (and puppeteer) in a production environment.

1. Don’t run a headless browser

Volatile resource usage of Headless Chrome

By all accounts, if at all possible, just don’t run a headless browser. Especially not on the same infrastructure as your app (see above). Headless browsers are un-predictable, hungry, and are the process version of a Rick and Morty “Meseeks.” Almost everything you can do with a browser (save for interpolating and running JavaScript) can be done with simple Linux tools. Cheerio and others libraries offer elegant Node API’s for fetching data via HTTP requests and scraping if that’s your end-goal.

For example, you can fetch a page (assuming it produces useful HTML) and scrape something as simple as this:


import cheerio from 'cheerio';
import fetch from 'node-fetch';

async function getPrice(url) {
    const res = await fetch(url);
    const html = await res.test();
    const $ = cheerio.load(html);
    return $('buy-now.price').text();
}

getPrice('https://my-cool-website.com/');

Obviously this doesn’t cover each and every use-case, and if you’re reading this then chances are you have to use a headless browser, so let’s press on.

2. Don’t run a headless browser when you don’t need to

We’ve ran into numerous users that attempt to keep the browser open, even when not in use, so that it’s always available for connections. While this might be a good strategy to help expedite session launch it’ll only end in misery after a few hours. This is largely because browsers like to cache stuff and slowly eat more memory. Any time you’re not actively using the browser, close it!


import puppeteer from 'puppeteer';

async function run() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.goto('https://www.example.com/');

    // More stuff ...page.click() page.type()

    browser.close(); // Always do this!
}

At browserless we generally try to cover this error internally by always having some kind of session-timer, and closing the browser whenever the WebSocket is disconnected. However, if you’re not using the service or the backing docker image, then be sure that something is closing the browser, otherwise you’ll have a pretty awful time debugging the issue when you’re paged in the middle of the night.

3. `page.evaluate` is your friend

Be with transpilers like babel or typescript as they like to create helper functions and assume they’re available with closures. Meaning your .evaluate callback might not work properly.

Puppeteer has a lot of cool sugar methods that let you do things like save DOM selectors or more in the Node runtime. While this is of great convenience, you can easily shoot yourself in the foot if something happens on the page that mutates that DOM node in some fashion. As much as it feels “hacky” it’s actualy just better to all your browser-side work in the context of the browser. This generally means loading up page.evaulate with all the work that needs to be done.

For instance, instead of doing something like this (which has 3 async actions):


const $anchor = await page.$('a.buy-now');
const link = await $anchor.getProperty('href');
await $anchor.click();

return link;

Do this instead (1 async action)


await page.evaluate(() => {
    const $anchor = document.querySelector('a.buy-now');
    const text = $anchor.href;
    $anchor.click();
});

The other pro about doing it in an evaluate call is that it’s portable: meaning you can run this code inside of a browser to test versus trying to rewrite the Node code. Of course, you should always use the debugger where possible to shorten development time.

A quick rule-of-thumb is to count the number of await‘s or then‘s happening in your code and if there’s more than one then you’re probably better off running the code inside a page.evaluate call. The reason here is that all async actions have to go back-and-forth between Node’s runtime and the browser’s, which means all the JSON serialization and deserializiation. While it’s not a huge amount of parsing (since it’s all backed by WebSockets) it still is taking up time that could better be spent doing something else.

4. Parallelize with browsers, not pages

Since we’ve determined that it’s not good to run a browser, and that we should only run one when absolutely necessary, the next best-practice is to run only one session through each browser. While you actually might save some overhead by parallelizing work through pages, if one page crashes it can bring down the entire browser with it. That, plus each page isn’t guaranteed to be totally clean (cookies and storage might bleed-through as seen here).

Instead of this:

Do this:


import puppeteer from 'puppeteer';

const runJob = async (url) {
    // Launch a clean browser for every "job"
    const browser = puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    const title = await page.title();

    browser.close();

    return title;
};

Each new browser instance gets a clean --user-data-dir (unless otherwise specified), which means it’s treated as a fresh session entirely. If Chrome crashes for whatever reason, it won’t bring down any other sessions that are happening as well.

5. Queue and limit concurrent work

One of the core features of browserless is its ability to limit parallelization and queue in a seamless way. This means that consuming applications can simply puppeteer.connect without having to implement a queue themselves. This prevents a huge host of issues, mostly around concurrent Chrome instances killing your app’s available resources.

The best, and easiest way, is to pull our docker image and run it with the parameters you’d like:


# Pull in Puppeteer@1.4.0 support
$ docker pull browserless/chrome:release-puppeteer-1.4.0
$ docker run -e "MAX_CONCURRENT_SESSIONS=10" browserless/chrome:release-puppeteer-1.4.0

This limits the number of concurrent requests that can run to 10 (including debug sessions and more). You can also configure how much you’d like to queue as well with the MAX_QUEUE_LENGTH variable. As a general rule, you can typically run roughly 10 concurrent requests per GB of memory. CPU usage can spike for various things, but for the most part you’ll need lots and lots of RAM.

6. Don’t forget about `page.waitForNavigation`

One of the most common issues we’ve seen is actions that trigger a pageload, and the sudden loss of your scripts execution. This is because actions that trigger a pageload can often cause subsequent work to get swallowed. In order to get around this issue, you’ll generally have to invoke the page-loading-action and immediately wait for the next pageload.

For instance, this console.log won’t work here (see a demo):


await page.goto('https://example.com');
await page.click('a');
const title = await page.title();
console.log(title);

But it will here (see the demo):


await page.goto('https://example.com');
page.click('a');
await page.waitForNavigation();
const title = await page.title();
console.log(title);

You can read more about waitForNavigation here, which roughly gets the same interface options that page.goto has, just only the “wait” part.

7. Use docker to contain it all

Chrome takes a lot of dependencies to get running properly. A lot. Even after all of that’s complete then there’s things like fonts and phantom-processes you have to worry about so it’s ideal to use some sort of container to, well, contain it. Docker is almost custom-built for this task as you can limit the amount resources available and sandbox it. If you want to create your own Dockerfile yourself, look below for all the required deps:

And to avoid running into zombie processes (which commonly happen with Chrome), you’ll want to use something like dumb-init to properly start-up:


ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
RUN chmod +x /usr/local/bin/dumb-init

If you’re interested in seeing more about this, take a look at our Dockerfile for more details.

8. Remember: there’s two different runtimes goin’ on

It’s helpful to remember that there’s two JavaScript runtimes going on (Node and the browser). This is great for the purposes of shareability, but it comes at the cost of confusion since some page methods will require you to explicitly pass in references (versus doing so with closures or hoisting).

Let’s take page.evaluate as an example. Deep down in the bowels of the protocol, this literally stringifies the function and passes it into Chrome, so things like closures and hoisting won’t work at all. If you need to pass some references or values into an evaluate call, simply append them as arguments which get properly handled.

So, instead of referencing a selector via closures:


const anchor = 'a';

await page.goto('https://example.com/');

// `selector` here is `undefined` since we're in the browser context
const clicked = await page.evaluate(() => document.querySelector(anchor).click());

Pass the parameter in:


const anchor = 'a';

await page.goto('https://example.com/');

// Here we add a `selector` arg and pass in the reference in `evaluate`
const clicked = await page.evaluate((selector) => document.querySelector(selector).click(), anchor);

You can append one or more arguments to page.evaluate since it’s variadic in what it accepts. Be sure to use this to your advantage!

The Future

We’re incredibly excited about the future of headless browsers, and all the automation they unlock. With powerful tools like puppeteer and browserless, we’re hopeful that debugging and running headless work in production becomes easier and faster. We’ll launching functions soon, so be sure to check back when those go live to better run your headless work!

Share this article