Bonus: if you like our content and this “How to scrape Twitter” article, you can join our web automation Slack community.
Let’s talk about Twitter! It is one of the largest social media platforms; where people can post small pieces of texts and reach millions of readers across the world. It is so influential that someone can alter the price of a crypto coin just by tweeting an image. Apart from individuals looking to share and gain information, companies, political institutions, and governments maintain accounts. The ubiquitous nature of Twitter has fascinated data analysts who look into account activity to analyze and gain insights about various trends, social phenomena, and even research how to maximize their client's profits.
In this article, we will learn how to retrieve helpful information about a Twitter account’s activity by employing web scraping techniques using a free automation platform like Browserless and the flexibility of a scripting programming language like JavaScript.
So let's dive deeper into a Twitter scraper guide:
Twitter scraper step #1 - Get a Browserless account
Browserless is a headless automation platform that provides fast, scalable, and reliable web browser automation, ideal for data analysis assignments. It’s an open-source platform with more than 7.2K stars on GitHub. It also has a hosted SaaS platform. Some of the largest companies worldwide use the platform daily to conduct QA testing and data collection tasks.
To get started, we first have to create an account.
The hosted SaaS platform offers free and paid plans if we need more powerful processing power. The free tier offers up to 6 hours of usage, which is more than enough for our case.
After completing the registration process, the platform supplies us with an API key. We will use this key to access the Browserless services later on.
Twitter scraper step #2 - Set up a Node script with Puppeteer
The next step is to set up our project. While Browserless has excellent support on programming languages and platforms, we will use JavaScript on Node.js due to its simplicity and robust environment.
First, let's initialize a new Node project and install the puppeteer
package.
$ npm init -y && npm i puppeteer-core
In case you didn’t know, puppeteer is a popular Javascript library used for web-scraping. It counts more than 78K stars on GitHub and is actively maintained. The puppeteer-core package provides all the functionalities of the main puppeteer package without downloading the browser, resulting in reduced dependency artifacts. By the way, if you like puppeteer-core, check out our "How to do web automation with Puppeteer-core & Browserless [3 code examples]" article.
Once we have installed our dependency, we can create the script's structure.
There are a couple of things to notice here, so let's make a quick walk through the code:
- First, we import the
puppeteer-core
module. - We declare a variable
BROWSERLESS_API_KEY
, whose value is the Browserless API key we retrieved from the dashboard earlier. - Then, we declare an asynchronous function getTwitterData which accepts the profile URL as a parameter, e.g., “https://twitter.com/NASA”.
- We call
getProfileData
and print the results to the terminal. Note that we use top-level await syntax, supported by ESM from Node version 14 and afterward.
Inside the getTwitterData function, we connect to the Browserless service by calling the connect method of the puppeteer module and use the browserWSEndpoint property to indicate the connection URI, which consists of two parts:
- The base URI
wss://chrome.browserless.io
- The
token
query-string parameter, which value is the API key we retrieved from the dashboard.
Then we instantiate a new browser page and navigate to the desired Twitter account by using the value of the url parameter. The following statement is critical: We call waitForSelector
on the page
instance to instruct the underline puppeteer engine to wait until all the tweets are loaded. Twitter’s browser-based UI is built with React as a SPA (Single Page Application), and the corresponding tweets need to be fetched after the page is initially loaded. If we did not use the waitForSelector
method, we would not be able to retrieve the available tweets. An <article /> element represents each tweet, so we use that as a query selector. Finally, we disconnect
from the remote browser instance before returning the results.
Twitter scraper step #3 - Retrieve profile info
The first information we will retrieve is regarding the profile itself: profile name, username, number of followers and following are all good and provide helpful insight about the account performance.
At the time of this writing, this is what the Twitter profile page looks like on desktop computers:
We can use the highlighted div
element to get the text content of its children.
The resulting values will contain the profile name and username. We will use the same tactic to retrieve the number of followers and following by accessing the corresponding href
attributes. We encapsulate this logic into a function getProfileInfo
that we can later call from inside getTwitterData
.
The call to evaluate
method is used to execute the provided callback from within the browser instance.
Twitter scraper step #4 - Retrieve tweets statistics
Now that we have gathered basic profile details, we can retrieve some metrics for the latest tweets. The most common statistics we want to know about a tweet are the post time, the number of likes, retweets, and replies. Recall that we mentioned each tweet being an <article/>
DOM element. It turns out that tweets are the only component that uses the <article/>
tag. This makes our job easier because we can use querySelectorAll to gather all the articles and use the appropriate selectors to retrieve the desired info for each metric. We’ll also encapsulate this functionality into its function, getTweetMetrics
.
Like we did when we wanted to get the profile info, we will call the appropriate selectors on each element and get the inner text. For each tweet, we can access the post time by retrieving the value of the datetime
attribute of the <time/> element. We can target the corresponding DOM elements for likes, retweets, and replies using the data-testid
attribute.
Executing the Twitter scraping script
Here is the complete script:
Running the above, we get an output similar to the below:
Epilogue
In this article, we learned how we can leverage an automation platform like Browserless together with JavaScript, through Node.js, to gather statistics about a Twitter profile activity. We hope we taught something interesting today so you can improve your workflow. As always, stay tuned for more educational articles.
If you like our content, you can check out how our clients use Browserless for different use cases:
- @IrishEnergyBot used web scraping to help create awareness around Green Energy
- Dropdeck automated slide deck exporting to PDF and generation of PNG thumbnails
- BigBlueButton, an open-source project, runs automated E2E tests with Browserless
__
George Gkasdrogkas,