Using browserless to train your LLM

June 14, 2023

contents

Large language models, or LLMs, are a great way to make data from various sources accessible to end users in a variety of ways. How you train your LLM is particularly important. Projects like huggingface and others do a great job of providing some starting datasets that make it easy to get these models off the ground. But what if you want to start competing with the likes of OpenAI and ingest even more data? What if the data you're accessing is dynamically generated with JavaScript or uses other sophisticated technologies that require a web-browser to render? With browserless, you can craft a simple API call to do just that.

This guide assumes you have some familiarity with LLMs and how to use them, and focuses more on the data aspect of training these models. Feel free to read more about how to train your large-language model with the framework you're using.

About browserless

browserless is a service that manages and makes Chrome programmatically accessible from a developers standpoint. In most cases, you need to use a library like puppeteer or playwright in order to get Chrome to do whatever it is you need it to do. This is great for certain projects, but since most LLMs are only interested in raw data, using a programmatic API to get this data can be a bit heavy-handed. This is where browserless shines since we have REST-based APIs on common use-cases across tech.

In particular are two APIs we want to highlight that make it extremely easy to fetch website data: the Scrape and Content APIs.

Using Scrape to train LLMs

Our Scrape API is well suited for fetching website data, after JavaScript has parsed and ran, returning website content back to you. Like most REST-based APIs, the Scrape API utilizes a JSON body describing the nature of your request, and what it should look for. A simple example looks like this:


curl --request POST \
  --url 'https://chrome.browserless.io/scrape?token=YOUR-API-KEY' \
  --header 'Content-Type: application/json' \
  --data '{
	"url": "https://cnn.com",
	"elements": [
		{
			"selector": "body"
		}
	]
}'

The request above navigates to CNN.com, waits for JavaScript to parse and run, gets data for the body of the document, and returns the following (note that this is truncated for brevity sake):


const results = {
	"data": [
		{
			"selector": "body",
			"results": [
				{
					"text": "Audio\nLive TV\nLog In\nHAPPENING NOW\nAnalysis of Donald Trump's historic arraignment on 37 federal criminal charges. Watch CNN\nLive Updates: Ukraine \nTrump arraignment \nTrending: Tori Bowie autopsy \nInstant Brands bankruptcy \nHatch Act \nFather’s Day gifts \nPodcast: 5 Things\nTrump pleads not guilty to mishandling classified intelligence documents\nAnna Moneymaker/Getty Images\nFACT CHECK\nTrump has responded to his federal indictment with a blizzard of dishonesty. Here are 10 of his claims fact-checked\nOpinion: Trump backers are going bonkers with dangerous threats\nDoug Mills/The New York Times/Redux\nGALLERY\nIn pictures: The federal indictment of Donald Trump\n‘Dejected’: Grisham describes Trump’s demeanor as he headed to court\nTrump didn’t speak during the historic hearing, sitting with his arms crossed and a scowl on his face. Here’s what else happened\nLive Updates: Judge says Trump must not communicate with co-defendant about the case\nTakeaways from Trump’s historic court appearance\nWatch: Hear how Trump acted inside the ‘packed courtroom’ during arraignment\nJudge allows E. Jean Carroll to amend her defamation lawsuit to seek more damages against Trump\nInteractive: Former President Donald Trump’s second indictment, annotated\nDoug Mills/The New York Times/Redux\nVIDEO\nTrump stops at famous Cuban restaurant after his arrest...© 2016 Cable News Network.",
					"width": 800,
					"height": 25409,
					"top": 0,
					"left": 0,
					"attributes": [
						{
							"name": "class",
							"value": "layout layout-homepage cnn"
						},
						{
							"name": "data-page-type",
							"value": "section"
						}
					]
				}
			]
		}
	]
}

LLMs, in particular, are mostly interested in the "text" of a website, which this API generates inside that JSON structure. Furthermore, you can get more metadata about the content as well: things like size (in pixels) and positioning. These elements can further enhance your model's knowledge of the data and add another dimension to weigh potential importance.

You can learn more about our Scrape API, including all the additional options, here.

Content API to train your LLM

The Content API is similar to the Scrape API in that it returns content after JavaScript has parsed and executed. It differs in that it will only return HTML content of the site itself with no additional parsing. Using it is rather similar to Scrape in that you'll POST over a JSON body containing details about the URL you care about.

Below is an example of what this looks like:


curl --request POST \
  --url 'https://chrome.browserless.io/content?token=YOUR-API-KEY' \
  --header 'Content-Type: application/json' \
  --data '{
	"url": "https://cnn.com"
}'

Doing so returns purely the HTML of the page. Using other libraries can help with extracting content further if you wish, but a few LLMs can continue to parse these happily: