Large language models, or LLMs, are a great way to make data from various sources accessible to end users in a variety of ways. How you train your LLM is particularly important. Projects like huggingface and others do a great job of providing some starting datasets that make it easy to get these models off the ground. But what if you want to start competing with the likes of OpenAI and ingest even more data? What if the data you're accessing is dynamically generated with JavaScript or uses other sophisticated technologies that require a web-browser to render? With browserless, you can craft a simple API call to do just that.
This guide assumes you have some familiarity with LLMs and how to use them, and focuses more on the data aspect of training these models. Feel free to read more about how to train your large-language model with the framework you're using.
About browserless
browserless is a service that manages and makes Chrome programmatically accessible from a developers standpoint. In most cases, you need to use a library like puppeteer or playwright in order to get Chrome to do whatever it is you need it to do. This is great for certain projects, but since most LLMs are only interested in raw data, using a programmatic API to get this data can be a bit heavy-handed. This is where browserless shines since we have REST-based APIs on common use-cases across tech.
In particular are two APIs we want to highlight that make it extremely easy to fetch website data: the Scrape and Content APIs.
Using Scrape to train LLMs
Our Scrape API is well suited for fetching website data, after JavaScript has parsed and ran, returning website content back to you. Like most REST-based APIs, the Scrape API utilizes a JSON body describing the nature of your request, and what it should look for. A simple example looks like this:
The request above navigates to CNN.com, waits for JavaScript to parse and run, gets data for the body of the document, and returns the following (note that this is truncated for brevity sake):
LLMs, in particular, are mostly interested in the "text" of a website, which this API generates inside that JSON structure. Furthermore, you can get more metadata about the content as well: things like size (in pixels) and positioning. These elements can further enhance your model's knowledge of the data and add another dimension to weigh potential importance.
You can learn more about our Scrape API, including all the additional options, here.
Content API to train your LLM
The Content API is similar to the Scrape API in that it returns content after JavaScript has parsed and executed. It differs in that it will only return HTML content of the site itself with no additional parsing. Using it is rather similar to Scrape in that you'll POST over a JSON body containing details about the URL you care about.
Below is an example of what this looks like:
Doing so returns purely the HTML of the page. Using other libraries can help with extracting content further if you wish, but a few LLMs can continue to parse these happily: