Step back with me, almost two years ago, when Google announced that Chrome would support a first-class headless mode. Even though it was only April it felt like Christmas had arrived early! After years of personally dealing with projects that attempted to automate and scale a web-browser (Selenium, phantom and so on), the timing couldn't have been better as I was in desperate need of a performant solution. As much as I wanted to believe that headless Chrome would solve all of our collective development woes, and it does solve a good chunk mind you, the shocking truth is that there's still quite a bit that needs to be done.
Moving back to today, and browserless has just recently had its first birthday! What originally started as an attempt at containing, managing, and debugging headless work has now blossomed into the biggest development effort of my life. Along the way, there's been some pretty incredible findings, some even harsher realities, and an incredible amount of time invested. Though there is so much to write about, today I want to go over the core of what browserless does and pass on all of our biggest findings, whether you're just beginning your headless Chrome journey or have been doing so for a while.
Let's jump in.
Sandbox everything
Chrome comes with built-in sand-boxing ... if your Linux distro supports it. I highly recommend doing so, as does all the documentation I've read, as sand-boxing can help contain malicious attacks in cases where something terrible were to happen (you can read more about this here). In order to get sand-boxing to work your kernel needs to support unprivileged user namespace cloning -- cause you're not going to run this as root, right? Doing this is a relatively simple task in most cases:
However, inside of docker, this can be relatively tricky to do since it largely depends on the host machine to support it, and not based off of your base image. Distributions like Ubuntu have no problem with this, but be sure to refer to your distro's documentation before doing so.
Speaking of docker, you have another task to keep in mind, which is the storage-driver parameter. You'll need to ensure that you're using overlay2
as overlay
doesn't quite work well with sand-boxing and others aspects of Chromium, like running in "head-full" which we'll talk about later, and will result in puppeteer simply locking up without any hints as to why. docker-machine
can be particularly susceptible as up until version 0.16.0
the default driver was overlay
.
Lastly, when running un-trusted code on the NodeJS side, which is something we do in our interactive debugger, you'll also want to sandbox that as well with something like vm2. This helps prevent scripts from gaining access to globals like process
if you so desire, and can help prevent prototype chain vulnerabilities (read more about that here). Now you might think that this is enough, and in most cases it is, but your service can still get killed with a simple while (true) do {}
being executed, so you'll have to go even further by spawning a child process to handle the actual running of the sand-boxed script. That way your parent process can kill it safely in it's own thread, and ensure your application remains performant.
If you're curious as to how we run debugger code, checkout our source here.
Skip the bundled Chromium for google-chrome-stable
One of the great things about puppeteer is its bundled copy of Chrome. It had what nearly every library didn't: a guarantee that it would work as expected when installed. Other libraries (including my own) didn't have this as a feature since it was time-consuming to check or back-port changes in the remote interface. The only caveat to this is that it's practically impossible to use the bundled copy of Chromium on linux (where your application is likely going to be deployed anyways) without installing a bunch of additional software. Even the example Dockerfile in puppeteer uses google-chrome-stable
vs getting their own copy to work, due in part to the sheer number of external dependencies you'll need to download.
The other benefits to using google-chrome-stable
is that it works with practically every site out there. It'll play mp4
videos, it'll operate more sanely, and it's a single RUN
command in most cases. This results in faster builds, less layers, and just better maintainability -- and most package managers (including alpine's) have it as a dependency to install without much fuss.
Of course if you need the bleeding-edge then you'll have to opt-into managing all those dependencies yourself. You can use our Dockerfile as a reference, or just pull the image and use it since we maintain versions that line up exactly with puppeteer as well as a chrome-stable
release.
Isolate Chrome at a hardware level
Hopefully this is not too novel of concept, but you'd be surprised at the number of applications out there that run alongside Chromium. To those that consider this too gratuitous, let me ask you a question: would run your database on the same hardware as your app? Maybe at the start for simplicity sake, but certainly not at any meaningful scale. This simple coupling can have you up late at night when one process decides to take more than its share.
Given that I'd go so far to argue that it's even more crucial to run Chrome elsewhere versus a database. It's harder to provision for in general, has the ability to consume 100% of your machines resources, and opens up your application to more security vulnerabilities as well. When it locks up your machine, your app unfortunately will go with it. Databases, on the other hand, have a much longer history and are often easier to plan for; whereas most headless workloads can be at the mercy of the open internet.
Similarly, having your application bundled alongside Chrome can lock you into a corner when it comes to deployment and scaling. Want to use serverless or deploy your app elsewhere? Having Chrome as a dependency will definitely limit your possibilities. Need to go from one to thousands of headless chrome sessions? If you bundled your app and Chrome together you're going to over-provision then.
Like any well-thought out system, separation of concerns always goes a long way, and pays dividends later.
Know your limits and queue the rest
Internally, browserless leverages the incredible queue
npm module for, well, queuing overflow requests. While not an ideal solution, especially when you really just want work to be done, it's a much much better alternative then dealing with gridlocked infrastructure. Queueing is also a pretty cost-effective way to ensure that sessions aren't being denied, while still maintaining availability.
In browserless this happens seamlessly. At its core, the system listens for HTTP Upgrade events, which is how WebSockets initiate connections and are fundamental to puppeteer, and delays this upgrade when concurrency is already met. Obviously this comes at the cost of a longer overall session, but all things considered it's a tradeoff most folks will make.
But how and when do you decide to queue? Does infrastructure play a hand in this? The answer to the latter is "most certainly," and the for the former this is our best recommendation: only run 10-20 concurrent browser sessions on one machine. Depending on the type of workload being done this number can skew heavily one way or the other.
Let's say, for instance, you want to generate a PDF of your site, and that PDF produces 20 pages. While not a huge resulting asset by any means, the amount of processing power, and memory, required can be fairly high. For something in that realm at least a 4GB/2CPU machine can handle about 12 or so concurrent sessions. Alternatively if you're just scraping HTML from a single-page application (say for SEO), then you'll likely only need a 1GB/1CPU machine and can likely run north of 15 concurrent sessions. Take all of this with a grain of salt, as planning for this properly takes time and testing, and doing due diligence here will result in less surprises later.
Of course all of this can easily get thrown out the window once WebGL, screen-casting or canvas elements enter the picture. Sessions of this nature will require larger machines regardless, especially if high frame-rates and availability are prioritized.
Remember, Xvfb is still there
While likely not a requirement for the majority headless work out there, there's going to be times where you'll have to run Chrome in non-headless mode. For cases like these, and we've definitely ran seen and support them, you'll have to rely on this virtualization technology. Why, you might ask?
First: --headless
doesn't, and likely won't, ever support running of actual "Chrome" extensions. This means that automating your tests for extensions is going to require you running in "head-full" mode, and the only way to do so in a reasonable way is via Xvfb. Flash and others are in a similar boat, where they will likely never be available (and shouldn't be) in headless mode. The solution here is to use Xvfb and drop the --headless
switch.
Second point here is that many anti-automation platforms can detect if Chrome is headless regardless of your User-Agent being sent. If you want to ensure your crawler, scraper, or screenshots work well then you'll have to fallback to running in a head-full context. Well... and using google-chrome-stable
of course!
Finally, user settings don't yet work in headless mode. According to the headless-dev Google group, and more specifically this issue here, most user-settings aren't planned to be supported. The solution, you guessed it, is not running in headless.
However, remember that things like pdf
generation don't work in head-full mode, so keep in mind that there's tradeoffs either way depending on what you're trying to accomplish.
Final words
This list isn't by an means exhaustive, but hopefully will help you in your efforts to run Chrome at scale. Obviously if you want to avoid some of this pain, and move forward with your efforts altogether, I'd invite you to checkout our Github repository. We've got many years of historical knowledge and time baked into browserless, and hope it can help you move forward in your ambitions. If you're looking for more tips on running puppeteer in production, be sure to checkout our other articles on the subject. Thanks!