How to make Web Scraper with Javascript Instead of Python or PHP – A Practical Guide

June 24, 2019

If you want to collect data from the Web, you will have plenty of resources to explain how to build a web scrubber using fixed tools, such as Python or PHP.

But there are not a lot of guides about the other important emerging tool called Node.js.

Thanks to the presence of Node.js, Javascript has become an excellent language for web crawling.

Not only is the Node fast, but many of the methods used in FrontPage JavaScript can be used to queue the DOM.

Node.js has tools for querying static and dynamic Web pages and is well suited to many useful APIs and node modules.

This article explores a powerful way to use javascript to build a web scrubber.

We will also consider one of the key concepts for writing persistent codes to fetch data, “asynchronous code.”

Asynchronous code

Fetching data is often one of the first things beginners need for asynchronous coding.

Javascript is synchronized by default, which means events are run line-by-line.

Whenever a function is called, the program waits until the function returns before going to the next line of code.

But fetching data is generally dependent on asynchronous coding. Such codes are removed from the synchronous event stream, and thus the synchronization code can continue to perform its tasks as the asynchronous code waits for the completion of an operation, such as fetching data from a website.

Combining these two types of implementation means synchronizing and asynchronous complex structures that may confuse beginners. We use the async and await keywords that are introduced in ES7.

These keywords are in fact the pseudonyms for the Promise structure introduced in ES6, and of course promise is a nickname for the previous callback system in javascript.

وب اسکرپر

Callback-Submissions

In the old days, javascript used callback, in which each function contained an asynchronous function, and thus something called “pyramid of doom” or “callback hell” was known. . The following example is a simple example:

/* Passed-in Callbacks */
doSomething(function(result) {
doSomethingElse(result, function(newResult) {
doThirdThing(newResult, function(finalResult) {
console.log(finalResult);
}, failureCallback);
}, failureCallback);
}, failureCallback);

 

Then, Promise and Catch

ES6 was introduced in a new structure that made it much easier and easier to debug asynchronous codes. This structure was based on an object called Promise and then and catch methods:

/* “Vanilla” Promise Syntax */
doSomething()
.then(result => doSomethingElse(result))
.then(newResult => doThirdThing(newResult))
.then(finalResult => {
console.log(finalResult);
})
.catch(failureCallback);

Async and Await

Finally, the ES7 introduced two keywords, called Async and Await, which provided asynchronous coding, similar to the JavaScript syntax code as far as possible. You can see an example in the example below.

This latest development is gradually considered as the most readable way to execute asynchronous tasks in javascript and, as compared to the usual structure of Promise, even boosts memory performance.

 

/* Async/Await Syntax */
(async () => {
try {
const result = await doSomething();
const newResult = await doSomethingElse(result);
const finalResult = await doThirdThing(newResult);
console.log(finalResult);
} catch(err) {
console.log(err);
}
})();

 

Static variables

In the past, data retrieval from another domain required the use of XMLHttpRequest or the XHR object. Today we can use JavaScript fetching API for this.

The fetch method () is an obligatory argument indicating the path of the source that we want to fetch and returns a Promise.

To use fetch () in Node.js, you need to import a fetch implementation. Isomorphic Fetch is a common choice. Install it by entering the following command in the terminal:

npm install isomorphic-fetch es6-promise

Then, require it at the beginning of the document:

JSON

If you want to fetch JSON data, you must run json () on it before processing the response:

(async () => {
const response = await fetch(‘https://wordpress.org/wp-json’);
const json = await response.json();
console.log(JSON.stringify(json));
})()

 

JSON has converted the data needed and processed into a relatively straightforward process. But what if data are not in the JSON format?

HTML

For most websites, we need to extract the data we want from the HTML template. On static websites, there are two ways to do this:

First method: regular expressions

If your needs are simple and you have no problem writing regex, you can simply use the () text method and then extract the data you need using the match method. For example, you will see a code that is used to extract the contents of the original h1 tag on a page:

(async () => {
const response = await fetch(‘https://example.com’);
const text = await response.text();
console.log(text.match(/(?<=\<h1>).*(?=\<\/h1>)/));
})()

 

Second method: DOM analyzer

If you are dealing with more sophisticated documents, it’s best to use an array of internal JavaScript methods to quit DOM. To do this, use methods such as getElementById and querySelector.

If we want to write the front end code, we can use the DOMParser interface. Because we use Node.js, we can use the node module instead. A common jsdom option is to install it with the following command:

npm i jsdom

We also require the following commands:

const jsdom = require(“jsdom”);
const { JSDOM } = jsdom;

 

Using jsdom, we can import imported HTML as our DOM object using the querySelector and related methods:

(async () => {
const response = await fetch(‘https://example.com’);
const text = await response.text();
const dom = await new JSDOM(text);
console.log(dom.window.document.querySelector(“h1”).textContent);
})()

Dynamic websites

How to get data from a dynamic website like a social network? The content of such websites will be generated at the moment and so the process of work will be completely different.

In this case, running a fetch request will not work because it returns the static code of the site, not the dynamic content that is likely to be desired by us.

If this is the case, the best node module is to run the puppeteer, because the original PhantomJS substitute will no longer be developed.

Puppeteer has the ability to run Chrome or Chromium on the DevTools protocol, and features features such as automatic page navigation and image capture from the page.

DevTools will by default launch a headless browser, but changing these settings will be useful for debugging.

Beginning

To install Puppeteer, go to the project directory in the terminal and enter the following statement:

npm i puppeteer

You will see some initial code to get started:

const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({
  headless: false,
});
const page = await browser.newPage();
await page.setRequestInterception(true);
await page.goto('http://www.example.com/');

We first launch the puppeteer. We also disable adless mode so that we can see what we do. Then open a new tab. The following method is optional and allows you to use abort, continue, and respond methods. Finally, we go to the selected page.

page.setRequestInterception(true)

In the DOM Parser example above, we can query elements using document.querySelector and related methods.

Login

If you need to login to a website, this can be done simply by using the type and click methods. Thus, the DOM elements are identified using the same querySelector structure:

await page.type(‘#username’, ‘UsernameGoesHere’);
await page.type(‘#password’, ‘PasswordGoesHere’);
await page.click(‘button’);
await page.waitForNavigation();

 

Infinite scroll management

Dynamic websites generally display content through the scroll mechanism. To overcome this problem, we need to do something that puppeteer scrolls on a standard basis.

The following is a simple example that scrolls 5 times and waits for content to load every 1 second between the scrolls.

for (let j = 0; j < 5; j++) {
await page.evaluate(‘window.scrollTo(0, document.body.scrollHeight)’);
await page.waitFor(1000);
}

Because loading times are different, the above code does not necessarily load the same number of times each time. If you encounter such a problem, you can scroll or select another criterion until a certain number of elements are found.

Optimization

Finally, there are several methods that can optimize our code, and so the code runs as fast and smoothly as possible. For example, you will see a way to prevent puppeteer from loading fonts and images.

 

await page.setRequestInterception(true);
page.on(‘request’, (req) => {
if (req.resourceType() == ‘font’ || req.resourceType() == ‘image’){
req.abort();
}
else {
req.continue();
}
});

 

You can also deactivate CSS in a similar way, but in some cases CSS is integrated into the dynamic data you want to upload, so you should be alert in this regard.

Ending speech

The material in this article was almost all that was needed to build an efficient web-based scrubber. When you collect data in memory, you can save them using a fs module in a local document, upload it to a database, or send a document directly using an API like Google Sheets.

If you are new to the field of web crawling or have information in this field, but in the beginner’s field of Node.js, this article might have been helpful to you and familiarized you with some powerful Node.js tools that make web crawling possible. Is.