Web Crawling & Scraping with Puppeteer in Node.js
July 3, 2025
.png)
.png)
Ever wished you could pull data from websites without the hassle of copying and pasting it by hand? While the data is right there on the page, accessing it programmatically can be tricky. Maybe you're trying to:
- Track product prices from e-commerce platforms
- Collect news headlines for sentiment analysis
- Gather job listings for your own custom aggregator
At this point, we bring in web crawling and scraping. In this blog, we'll explore both concepts and build a web scraper and basic crawling to handle dynamic content step by step using Puppeteer, a powerful Node.js library for controlling headless Chrome.
A Brief Introduction to Web Scraping and Crawling in Node.js
Web Crawling in Node.js
Web crawling is an automated process where specialized software programs, called crawlers or spiders, systematically navigate websites by following links from page to page. Starting from initial seed URLs, these crawlers discover new pages and collect content, similar to how search engines like Google explore and index the web.
- “Think of it like a robot that surfs the internet, hopping from one link to another, building a map of the web.”
Web Scraping in Node.js
Web scraping, on the other hand, is about extracting specific information from web pages, like grabbing product prices, article titles, or job listings. It involves parsing the HTML of a page and selecting the relevant data using tools or libraries (e.g., Puppeteer, Cheerio, BeautifulSoup).
- “It's like copying data from a webpage — but doing it with code instead of by hand.”
Why Puppeteer Is a Must for Web Scraping and Crawling?
When it comes to scraping and crawling modern websites, JavaScript-heavy ones, Puppeteer is a game-changer.
Puppeteer is a Node library developed by Google. It provides a high-level API for controlling Chrome or Chromium browsers through the DevTools Protocol. Unlike traditional scraping tools (like Cheerio or Axios) that only fetch static HTML, Puppeteer gives you full browser control, letting you interact with pages just like a real user would.
Here’s why it stands out:
- Handling JavaScript-heavy websites
- Full Browser Automation
- Running in both headless and non-headless modes
- Taking screenshots or PDFs
- Clicking buttons, filling forms
- Full-page rendering (good for SPAs)
Prerequisites:
For performing web scraping and web crawling using Puppeteer, you’ll need:
- Node.js Installed (v14+ recommended)
- Basic Knowledge of JavaScript & Async/Await
- Familiarity with HTML/CSS Selectors
- Node packages:
- express
- puppeteer
Building a Web Scraper & Crawler with Express and Puppeteer
Ever wanted your own mini API that scrapes live data from websites with just a single endpoint? Let’s build one.
In this blog, we’ll build an Express server that spins up a browser using Puppeteer, navigates to a target site, extracts data, and returns it through an API call, all in real time. For demonstration, we’ll be scraping and crawling data from https://books.toscrape.com, a beginner-friendly website built specifically for practising scraping.
Step 1: Set Up the Scraper Server and Endpoint
Once you've installed all the necessary Node packages and met the prerequisites, go ahead and create an index.js file. This will set up your Express server and neatly encapsulate all the scraping logic inside a single /get API route.
Step 2: Set up the Puppeteer and Launch the Browser
Inside the /get route, use Puppeteer to launch a browser instance and navigate to the target website — https://books.toscrape.com. This is where the actual scraping begins.
We begin with puppeteer.launch() to start a new browser session. This returns a browser instance.
- Setting headless: false opens a visible browser window for debugging purposes. Since this is our first website to scrape, we'll keep it visible to better understand what's happening in real time.
- defaultViewport: null allows the full-screen browser window.
Step 3: Open a new tab and navigate to the target site
We'll use the newPage() function to open a new page. It’ll give us a page object that we'll use to interact with the website.
The target page https://books.toscrape.com is loaded using page.goto().
waitUntil: "domcontentloaded" makes sure the page's DOM is fully available before we start scraping.
It will open the page below in a new browser window.

Step 4: Scrape the data
‍Web scraping requires direct interaction with the HTML DOM elements. A good starting point is to inspect the page and explore the elements to locate the data you want.
On inspecting the page structure, you'll notice that each book is wrapped inside an <article> tag with the class product_pod. This detail is crucial, as web scraping relies heavily on CSS selectors — in this case, .product_pod helps us target individual book elements.
‍

‍
In this example, we’ll extract the book’s title and price. To locate the title, we can inspect the page and observe that it's nested inside the .product_pod class, specifically within the <h3> tag and its child <a> element.
Inside page.evaluate(), we execute browser-side JavaScript directly within the page context to extract the first book’s title and price. We first locate the .product_pod element, then access its child elements to retrieve the title and price text, and finally return the extracted data.
Step 5: Log the result and close the browser
We print the scraped data to the terminal for confirmation, which looks like the following:
Finally, the browser window is closed once scraping completes.
Note: If you want to extract all the book titles and prices from the first page, you can modify the page.evaluate() function to loop through all elements using document.querySelectorAll. This returns an array of book titles from the first page in a clean and efficient way — perfect when you only need the titles without additional details like price or author.
Add Crawling: Navigate to Next Pages:
Web crawling involves navigating through multiple pages (or links) automatically, like a bot exploring the website. Here's a simple yet solid example:
In the above code, we didn’t change much; we simply added a loop to visit multiple pages and collect book titles from each. In this example, we loop through 3 pages, but you can extend it to as many as needed. This approach forms the basis of a simple web crawler that navigates through paginated content and extracts data.
Bonus: Screenshot Scraped Page:
The above line captures a screenshot of the current page and saves it as books_page.png. The fullPage: true option ensures that the entire scrollable page is captured, not just the part visible in the viewport.
Tips for Real-World Scraping & Crawling:
- While headless: false is helpful for debugging, switch to headless: true for better performance in production.
- Add a delay or setTimeout between page visits or requests to prevent sending too many hits in a short time.
- Always check the site’s robots.txt file to understand which pages or paths are allowed or disallowed for crawling.
- Instead of hardcoding page numbers, dynamically detect and follow the “Next” button for scalable crawling.
- Add delays “page.waitForTimeout(ms)” to mimic human behaviour.
Conclusion:
You’ve just built a real-world web scraper with Puppeteer! Now you can:
- Track prices (e.g., Amazon, eBay).
- Build datasets for ML projects.
Web crawling and scraping are powerful tools to automate data collection. With Puppeteer, you get full control over a browser, making it easier to deal with modern, JavaScript-heavy sites.
Whether you want to scrape job listings, stock prices, or e-commerce data, Puppeteer gives you the precision you need.
Note: “Always scrape ethically and respect site policies to avoid legal or ethical issues.”