Build an Asynchronous Web Scraping Tool with Node.js and Cheerio.js

This guide will walk you through the process of building an asynchronous web scraper for any website you want easily using Node.js, Axios and CheerioJS. In this tutorial, we are going to create an asynchronous Node.js script to gather a list of Top Rated Movies and their details from the IMDBb website and store the data as JSON.

By definition web scraping is the process of collecting useful information from web pages through automated scripts. This should simplify the process of getting large amounts of data from websites where no API is available; eliminate the hassle of having to browse each page manually and structure the collected data the way you want.

Node.js is a good choice to use for creating an asynchronous web scraping tool. The open source modules provided by NPM - the Node Package Manager - simplifies the process of building web scraping applications in a couple of lines of code.

To summarize, web scraping is the process of browsing and gathering data automatically from web pages instead of the manual browsing made by a human, which decreases the time needed for scraping enormously.

The Main Steps of Web Scraping

Getting started with Node.js web scraping is simple, and the method can be broken down into 3 main steps:

  • Fetch the HTML source code from the website using an HTTP request.
  • Analyze the HTML content, finding the data we want, and extracting it.
  • Store the extracted and structured data in the form of your choice. (text file, database, etc.)

This guide will walk you through the process of building an asynchronous web scraper for any website you want easily using Node.js, Axios and CheerioJS.

In this tutorial, we are going to create an asynchronous Node.js script to gather a list of Top Rated Movies and their details from the IMDBb website and store the data as JSON.

How to Scrap It Manually?

Let’s see first what we will be scraping exactly from the website.

We will generate a list of top rated movies and some details about them like the rating, the movie link, the date of release, the director, and the synopsis. Then we will export the results as a JSON file.

This is how we will proceed manually:

Go to the specific link about Top Rated Movies list.

Store the name of the first movie and the rating.

Store the link to the movie and follow it for more details.

Collect other details about the movie and store them.

Repeat these steps and you will have a list of top rated movies with their details.

Automating Web Scraping Using Axios and Cheerio.js

Requirements

Before starting the tutorial, make sure you have Node.js (version 8.x or later) and npm installed on your machine. These tools will be used to complete our tutorial.

Getting started

Now, let’s create a new directory called imdbScraper and initialize it with a package.json file using the npm init -y command from inside the new created directory.

After that, let’s install the required modules that will help creating our web scraper using npm.

npm install axios cheerio --save

Running this command line will install the required libraries in the node_modules directory and save them as dependencies in the package.json file.

  • Axios: Promise-based HTTP client library for Node.js and the browser.
  • Cheerio: jQuery implementation for Node.js. Cheerio simplifies the parsing of markup and offers an API for traversing/manipulating the data structure.

Extracting Information

Now, we are ready to start the implementation of our asynchronous Node.js web scraping tool.

Create a new trm-scrape.js file in the root of your project and open it using your favorite text editor. This file will contain the logic for scraping the top rated movies from the imdb website.

Copy and paste the following code.

const cheerio = require("cheerio");
const axios = require("axios");
const fs = require("fs");
const baseUrl = "https://www.imdb.com";
const trmUrl = baseUrl+"/chart/top?ref_=nv_mv_250";

const fetchPage = async (url, n) => {
    try {
        const result = await axios.get(url);
  //console.log(result.data)
  return result.data;
    } catch(err) {
        if (n === 0) throw err;

        console.log("fetchPage(): Waiting For 3 seconds before retrying the request.")
  await waitFor(3000);
        console.log(`Request Retry Attempt Number: ${7 - n} ====> URL: ${url}`)
        return await fetchPage(url, n - 1);
    }
};

The fetchPage function will prints a long html string on your console and returns the HTML document extracted by axios. This function retries the request if any problem happens during the html page fetching.

The remaining question is how can we parse the HTML document to extract the data we are interested in? This is where Cheerio comes in.

Cheerio uses the jQuery methods allowing us to parse HTML string and retrieve the data we want. However before jumping into the code, we will need to use the Chrome DevTools to search through the HTML element for the information we are interested in.

Using Chrome DevTools

Right click on the element you want to scrape and choose the inspect option or open the browser dev tool and use the inspector tool to highlight the body of the table of the top rated movies. It is the same process for every browser.

browser devtools inspector node.js web scraper

Parsing HTML using Cheerio.js

As shown above, the table body has a class of .lister-list. Let’s use Cheerio.js to select all rows with the following code $('.lister-list > tr'). Update the trm-scraper.js file by adding the following function.

const topRatedMovies = async () => {

 try {
  const html = await fetchPage(trmUrl, 6);

  const $ = cheerio.load(html);

  const topRatedMoviesMap = $('.lister-list > tr').map( async (index, element) => {
 
   let moviePoster = $(element).find('.posterColumn > a > img').attr('src');
   let movieName = $(element).find('.titleColumn > a').text();
   let movieDate = $(element).find('.titleColumn > span').text();
   let movieRating = $(element).find('.imdbRating > strong').text();
   
   console.log("Created Promise for movie: "+ movieName);

   return {
    moviePoster,
    movieName,
    movieDate,
    movieRating
   }
  }).get();
 
  return Promise.all(topRatedMoviesMap);
 } catch (error) {
  throw error;
 }
}

To be able to parse the HTML document, we have to pass it as a parameter to the load function of Cheerio.js.

After loading the HTML string, we select the table rows in .lister-list and we map with an Asynchronous arrow function each row into an array of objects having the properties moviePoster, movieName, movieDate and movieRating constructed using the find() method that extracts the data we want.

Then, we store a reference to the resulted array in the topRatedMoviesMap variable which will be passed to the Promise.all() function for parallel processing and returns a Promise that resolves into an array of JavaScript objects.

Storing Data as JSON

After extracting the data we are interested in, we can export it as a JSON file using the Node.js fs module.

Copy and paste the following function into the trm-scraper.js file. This function will store the extracted array of objects into a JSON file.

const exportResults = (results, outputFile) => {

 try {
  fs.writeFile(outputFile, JSON.stringify(results, null, 4), (err) => {
   if (err) {
     console.log(err);
   }
   console.log('\n' + results.length + ' Results exported successfully to '+ outputFile);
    })
 } catch (error) {
  throw error;
 }

}

The final touch is to call the topRatedMovies function like this.

topRatedMovies()
  .then(results => {
   console.log("number of results: "+results.length);
   exportResults(results, "top-rated-movies.json");
    console.log(results);
  }).catch(err => {
   console.log("Error while fetching top rated movies with error :::: "+err);
  })

The topRatedMovies function will produce a Promise that resolves into an array of objects handled in the Then block which has an arrow function responsible of printing the results, the length and exporting the produced data into a JSON file.

If any error occurs within the Promise while it is pending, we can handle the error in the catch block.

Now open your terminal and run the following command node trm-scraper.

You should get a top-rated-movies JSON file inside the root of your working directory. Open it and you will find the list of the top rated movies from imdb.com as shown below.

node.js web scraping cheerio to json file

Takeaway

As you may notice, our Node.js web scraping tutorial comes to an end. We have defined what a web scraping tool is and how to build an asynchronous web scraping tool with Node.js to extract useful information from websites and storing it elsewhere.

Even if we have the possibility to automate the data extraction process using asynchronous tools built with Node.js for web scraping, which will result in considerable time savings, we should always pay attention to the fact that our IP address may be blocked or banned by the website we are scraping. So be careful when it comes to using these tools.

I hope that you have learned something new today and we will catch up soon for another tutorial, until then stay tuned.

Name

Angular,7,Angular 8,1,Best Practices,1,Design,1,Firebase,1,Ionic,1,Java,5,Nodejs,2,Python,1,Restful API,1,Software Development,1,Spring,3,Spring Batch,1,Spring Boot 2,1,Web Development,1,
ltr
item
Programming Tutorials, News and Reviews: Build an Asynchronous Web Scraping Tool with Node.js and Cheerio.js
Build an Asynchronous Web Scraping Tool with Node.js and Cheerio.js
This guide will walk you through the process of building an asynchronous web scraper for any website you want easily using Node.js, Axios and CheerioJS. In this tutorial, we are going to create an asynchronous Node.js script to gather a list of Top Rated Movies and their details from the IMDBb website and store the data as JSON.
https://3.bp.blogspot.com/-VdrZV3tiwDc/XbIWPY4SfLI/AAAAAAAAAYU/bGvb32ls7Qk2N3XbCIugrKMwb3OUftXLgCLcBGAsYHQ/s200/build-asynchronous-web-scraping-tool-with-node-js-cheerio-js.jpg
https://3.bp.blogspot.com/-VdrZV3tiwDc/XbIWPY4SfLI/AAAAAAAAAYU/bGvb32ls7Qk2N3XbCIugrKMwb3OUftXLgCLcBGAsYHQ/s72-c/build-asynchronous-web-scraping-tool-with-node-js-cheerio-js.jpg
Programming Tutorials, News and Reviews
https://www.ninjadevcorner.com/2019/10/build-an-asynchronous-web-scraping-tool-with-node-js-and-cheerio-js.html
https://www.ninjadevcorner.com/
https://www.ninjadevcorner.com/
https://www.ninjadevcorner.com/2019/10/build-an-asynchronous-web-scraping-tool-with-node-js-and-cheerio-js.html
true
493653397416713395
UTF-8
Loaded All Posts Not found any posts VIEW ALL Readmore Reply Cancel reply Delete By Home PAGES POSTS View All RECOMMENDED FOR YOU LABEL ARCHIVE SEARCH ALL POSTS Not found any post match with your request Back Home Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sun Mon Tue Wed Thu Fri Sat January February March April May June July August September October November December Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec just now 1 minute ago $$1$$ minutes ago 1 hour ago $$1$$ hours ago Yesterday $$1$$ days ago $$1$$ weeks ago more than 5 weeks ago Followers Follow THIS CONTENT IS PREMIUM Please share to unlock Copy All Code Select All Code All codes were copied to your clipboard Can not copy the codes / texts, please press [CTRL]+[C] (or CMD+C with Mac) to copy