How to Scrape Yelp With Browserless

January 7, 2025

contents

Yelp is a go-to platform for discovering businesses, from restaurants to service providers. It offers a wealth of information, such as reviews, ratings, addresses, and contact details. This data is invaluable for competitor analysis, market research, and lead generation tasks.

However, scraping Yelp is challenging due to advanced bot detection systems, such as CAPTCHAs, IP blocking, and dynamic content loading.

In this article, we’ll look at how a library such as Puppeteer can handle these constraints with some modifications.

Understanding Yelp’s Page Structure

Page Layout Overview

Yelp’s business listing pages are structured to provide users with detailed information about restaurants, service providers, and local businesses.

When scraping Yelp, it’s helpful to know where specific data points are on the page so you can target them efficiently.

Below are the key areas to focus on:

Business Name: Typically displayed prominently at the top of the listing.
Address: Found just below the business name, showing the physical location.
Phone Number: Usually displayed in the contact information section.
Ratings and Reviews: Located near the top, often alongside a star rating and a total review count.
Business Hours: Found further down, providing details about opening and closing times.
Categories and Services Offered: Displayed as tags or descriptors near the name or in the details section.

Limitations of Puppeteer with Stealth Libraries

If you’ve tried using Puppeteer or Playwright to scrape Yelp, you’ve likely encountered some frustrating roadblocks. While these tools are great for general automation tasks, Yelp’s advanced detection systems are built to sniff out even the smallest signs of automation.

Repeated browser fingerprints, unusual headers, or repetitive browsing patterns are all red flags that can quickly lead to IP bans or CAPTCHAs. Even stealth plugins that disguise automation often get caught, as they’re rarely updated and can’t keep up with Yelp’s constantly improving detection methods.

Over time, Yelp’s systems have become highly sophisticated, capable of identifying subtle inconsistencies like how Puppeteer always has the DevTools Debugger activated in the background or identical browsing behaviors across sessions.

Even with tactics like rotating proxies and delays, scraping large amounts of data often leads to blocks or incomplete results. This can be frustrating when you just want reliable access to public information.

The Need for Specialized Scraping Tools

For Yelp scraping, you need a tool that doesn’t just work but genuinely understands how to bypass modern bot detection. BrowserQL was built for exactly this kind of challenge.

What sets BrowserQL apart is how it avoids leaving behind obvious automation traces. Instead of piling on extra plugins and trying to clean up Puppeteer’s mess, it uses efficient commands and humanized browsers to blend into the site’s normal traffic. If you’re trying to scrape Yelp data reliably and avoid constant roadblocks, BrowserQL gives you the tools to make it happen smoothly.

What is BrowserQL?

BrowserQL is a modern GraphQL-based language that controls browsers through the Chrome DevTools Protocol (CDP). Instead of relying on general automation tools like Puppeteer or Playwright, BrowserQL provides precise control over browser actions.

How To Get Started with BrowserQL

To get started, it's simple. We offer a 7-day free trial so that you can test its features without any upfront cost. Sign up to activate your trial, and explore the documentation to get started.

How BrowserQL Avoids Yelps Bot-Detection

Minimal Fingerprints

BrowserQL avoids the common automation footprints left behind by libraries like Puppeteer or Playwright. Yelp’s detection systems often look for header patterns or the overuse of specific browser commands. BrowserQL’s streamlined approach eliminates unnecessary interactions, making it less likely to be flagged.

Humanized Interactions

Yelp’s systems don’t just watch what you do—they evaluate how you do it. BrowserQL mimics realistic human actions, such as slight delays between typing, variable scrolling speeds, and natural mouse movements. These behaviors make automated sessions feel indistinguishable from real users, improving reliability when scraping Yelp data.

Runs on Real Hardware

Some of Yelp’s detection measures go beyond software, checking for signs of real hardware like GPUs or other device-specific identifiers. BrowserQL supports running on actual consumer hardware, which provides authentic device fingerprints. This feature is especially useful for enterprise-level projects where higher success rates are critical.

How to Start Scraping Yelp with BrowserQL

Yelp’s Use of Hardware Fingerprinting

At Browserless, we’ve closely monitored how platforms like Yelp evolve their bot detection strategies. One of the standout techniques they’ve implemented is hardware fingerprinting. This goes beyond standard browser-based detection and actively analyzes the physical attributes of the device accessing their platform, such as the CPU, GPU, and other hardware-level characteristics.

Why Does Hardware Fingerprinting Matter?

Hardware fingerprinting makes it significantly harder for virtualized environments, like shared cloud instances or emulated hardware, to go undetected. Even if your scraping behavior is indistinguishable from that of a human user—thanks to realistic typing, scrolling, and clicking—the lack of a unique hardware profile can still flag your activity as suspicious.

This challenge requires real hardware solutions, and we’re excited to offer a cutting-edge approach.

Real Hardware with Browserless

Browserless provides access to dedicated hardware to overcome Yelp's hardware fingerprinting. Running your scripts on real hardware mimics the exact profile of a legitimate device, giving you a massive edge in bypassing even the most advanced detection systems. This feature is currently available exclusively through our Enterprise Plan, but we are working on extending it to shared accounts in the future.

With our Enterprise Plan, you’ll be able to:

Leverage real hardware for scraping tasks.
Seamlessly integrate this capability into your BrowserQL workflows.
Access dedicated resources for the most challenging scraping projects.

What’s Next for Shared Accounts?

We’re actively working to bring this feature to our shared accounts, making it more accessible for users who don’t need the scale of an enterprise setup. While it’s only available for Enterprise users, we’re committed to expanding access soon.

Step 1: Setting Up Your Environment

The first step is to use BrowserQL to interact with Yelp, perform a search for "Sushi," and retrieve the HTML content of the results page. BrowserQL allows us to simulate real human actions like typing and clicking, avoiding detection by anti-bot systems.

Here’s how we define a structured BrowserQL query to handle this:


// Required Libraries
import fetch from 'node-fetch';
import * as cheerio from 'cheerio';
import { createObjectCsvWriter } from 'csv-writer';

// Constants
const browserQLUrl = 'https://production-sfo.browserless.io/chromium/bql?token=YOUR_API_KEY';
const outputCsv = 'yelp_sushi_data.csv'; // CSV file to store the scraped data

What’s Happening Here

Import Libraries: We include the libraries node-fetch for making HTTP requests, cheerio for parsing HTML, and csv-writer for writing the output data into a CSV file.
Set BrowserQL URL: The browserQLUrl is configured with the API endpoint and your unique API key to communicate with the Browserless platform.
Define Output File Path: The outputCsv specifies where the extracted data will be stored as a CSV file.

Step 2: Fetch HTML Using Node.js

Here’s how we define a structured BrowserQL query to handle this:


// BrowserQL Query
const browserQLQuery = `
  mutation SearchYelp {
    # Navigate to Yelp homepage
    goto(url: "https://www.yelp.com", waitUntil: networkIdle) {
      status
      time
    }

    # Input "Sushi" into the search field
    type(
      text: "Sushi",
      selector: "input#search_description",
      delay: [1, 10]
    ) {
      selector
      text
      time
    }

    # Click the search button
    click(
      selector: "button[aria-label='Search']",
      visible: true
    ) {
      selector
      time
    }

    # Extract HTML from the page
    htmlContent: html(visible: false) {
      html
    }
  }
`;

What’s Happening Here

Navigate to Yelp: The goto command opens Yelp and waits until all resources are fetched and the page becomes idle.
Type Query: The type command enters "Sushi" into the search field, simulating a real user typing behavior with a slight delay.
Click Search: The click command simulates clicking the search button, initiating the search.
Extract HTML: Captures the full HTML content of the search results page.

Step 3: Parse Restaurant Data from HTML

With the BrowserQL query defined, we use a Node.js script to send the query to the BrowserQL API and retrieve the HTML content for parsing.


const fetchHtmlFromBrowserQL = async () => {
  try {
    console.log('Fetching HTML from Yelp...');
    const response = await fetch(browserQLUrl, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ query: browserQLQuery }),
    });

    if (!response.ok) {
      throw new Error(`Request failed with status ${response.status}`);
    }

    const json = await response.json();
    if (json.errors) {
      console.error('BrowserQL returned errors:', JSON.stringify(json.errors, null, 2));
      return null;
    }

    return json.data.htmlContent.html;
  } catch (error) {
    console.error('Error fetching HTML from BrowserQL:', error);
    return null;
  }
};

What’s Happening Here

Send Query: Sends the BrowserQL query to the API endpoint using a POST request.
Validate Response: Checks the HTTP response and handles any errors.
Extract HTML: Returns the HTML content if the query executes successfully.

Step 4: Parse Restaurant Data from HTML

With the HTML content retrieved, we use Cheerio to parse and extract details like restaurant names, ratings, and review counts.


const parseRestaurantData = (html) => {
  console.log('Parsing HTML...');
  const $ = cheerio.load(html);

  const restaurantData = [];
  $('li.y-css-mhg9c5').each((_, element) => {
    const name = $(element).find('a[href*="/biz/"]').first().text().trim();
    const rating = $(element).find('div[role="img"][aria-label*="star rating"]').attr('aria-label') || '';
    const reviews = $(element).find('span:contains("reviews")').text().match(/\(([^)]+)\)/)?.[1] || '';

    if (name) {
      restaurantData.push({
        name,
        rating: rating.split(' ')[0], // Extract the numeric rating
        reviews,
      });
    }
  });

  console.log(`Extracted ${restaurantData.length} restaurants.`);
  return restaurantData;
};

What’s Happening Here

Load HTML: Cheerio loads the HTML content for easy traversal.
Find Elements: CSS selectors are used to locate restaurant names, ratings, and review counts.
Structure Data: The extracted details are stored as objects in an array.

Step 5: Write Data to a CSV

We save the structured restaurant data into a CSV file for easy sharing and analysis.


const writeDataToCsv = async (data) => {
  const csvWriter = createObjectCsvWriter({
    path: outputCsv,
    header: [
      { id: 'name', title: 'Name' },
      { id: 'rating', title: 'Rating' },
      { id: 'reviews', title: 'Number of Reviews' },
    ],
  });

  await csvWriter.writeRecords(data);
  console.log(`Data written to ${outputCsv}`);
};

What’s Happening Here

Define Output File: Sets the file path and column headers for the CSV.
Write Data: Writes the structured restaurant data into the CSV file.
Confirm Completion: Logs a message once the file is created successfully.

Step 6: Run the Script

Once all steps are implemented, you can execute the script directly from the terminal. The script will:

Fetch the HTML data from Yelp using BrowserQL.
Parse the HTML to extract restaurant details like name, rating, and the number of reviews.
Save the structured data to a CSV file named yelp_sushi_data.csv.

Simply run the script with the following command in your terminal:


node your_yelp_scraper_script.js

Conclusion

BrowserQL transforms how you scrape Yelp by addressing modern anti-bot systems with intelligent features that enhance efficiency and reliability. To elevate your scraping projects, sign up for a free trial and explore the BrowserQL IDE. Dive into the documentation, test your scripts, and experience how BrowserQL simplifies data extraction while complying with best practices.

FAQs

Is it legal to scrape data from Yelp?

Scraping publicly available information on Yelp is generally allowed, but reviewing Yelp’s terms of service carefully is important. Some restrictions may apply depending on your intended use of the data. For added clarity and compliance, consult legal professionals to align your project with applicable regulations.

How does BrowserQL differ from Puppeteer and Playwright?

BrowserQL is purpose-built to handle advanced bot detection. It minimizes automation fingerprints and integrates human-like actions such as natural scrolling, realistic typing, and mouse movements. These features make it far more effective for scaling large scraping projects on platforms like Yelp, which have sophisticated anti-bot measures.

Can I use BrowserQL with my existing scraping projects?

Absolutely. BrowserQL scripts can be exported as cURL commands or JSON objects, making integrating them into your current workflows or tech stack easy. Whether using Python, JavaScript, or another language, BrowserQL fits in seamlessly.

What if BrowserQL doesn’t bypass a site’s bot detection?

If you encounter challenges with a specific bot detection system, Browserless offers dedicated support to help you troubleshoot and find solutions. The BrowserQL team actively monitors changes in detection methods to provide updated strategies and unblock challenging sites like Yelp.

How do I get started with BrowserQL?

Getting started is simple. Head to the Browserless website, sign up for a free trial, and download the BrowserQL IDE from your account page. You’ll have access to all the tools you need to begin scraping Yelp efficiently and effectively.

Share this article

How to Scrape Yelp With Browserless

Understanding Yelp’s Page Structure

Page Layout Overview

Limitations of Puppeteer with Stealth Libraries

The Need for Specialized Scraping Tools

What is BrowserQL?

How To Get Started with BrowserQL

How BrowserQL Avoids Yelps Bot-Detection

Minimal Fingerprints

Humanized Interactions

Runs on Real Hardware

How to Start Scraping Yelp with BrowserQL

Yelp’s Use of Hardware Fingerprinting

Why Does Hardware Fingerprinting Matter?

Real Hardware with Browserless

What’s Next for Shared Accounts?

Step 1: Setting Up Your Environment

What’s Happening Here

Step 2: Fetch HTML Using Node.js

What’s Happening Here

Step 3: Parse Restaurant Data from HTML

What’s Happening Here

Step 4: Parse Restaurant Data from HTML

What’s Happening Here

Step 5: Write Data to a CSV

What’s Happening Here

Step 6: Run the Script

Conclusion

FAQs

Is it legal to scrape data from Yelp?

How does BrowserQL differ from Puppeteer and Playwright?

Can I use BrowserQL with my existing scraping projects?

What if BrowserQL doesn’t bypass a site’s bot detection?

How do I get started with BrowserQL?

Ready to try the benefits of Browserless?