Valuable competition data made right

logo

Parsing sites with Puppeteer. Collecting price data for a specific book in a book store

Automatic collecting of data called scraping is often a usual task for many online businessmen or even for regular users trying to get data conveniently and on a regular basis in a handy format like csv, pdf, docx or others. For example we’ve already ran into tasks of collecting airplane delays data, local markets’ prices, blogs excerpts etc. In this article I’ll shed some light on web pages, we’re not considering telegram, whatsapp or other sources here. Usually the first thing the coder needs to do is to parse a webpage. There are multiple tools you could use to do that. If you’re using Python it could be BeautifulSoup, if you’re using JavaScript (more specifically – NodeJS) – it will be PlayWright. Until recently in case of using JavaScript the tool of choice would be Puppeteer, which is PlayWright predecessor. We’ll use Puppeteer and leave PlayWright for later articles, so the comparison would be more vivid.

We’ll write a short tool that will parse price data for the books from the NoStarch press website.

When preparing for writing a parsing script, usually the first thing to do is to make a reconnaissance of the page – to locate all the necessary elements on the pages and extract their according selectors. Just use Developers Tools in your favorite browser. The more advanced cases would include some counteraction from the web resource. The site might use some sorts of obfuscation, IP-blocking, CAPTCHA protection, bot-detection and others. We’ll be using some naive bot-detection bypass methods, the more advanced ones will be described in the upcoming articles. For example, the most basic bot-detection methods could include checking user agent, testing webdriver presence, getting plugins and languages. Our code will include some approach found on the internet that helps Puppeteer to bypass the aforementioned tests, for more details check the preparePageForTests() function in the code below.

The more advanced counter-bot approaches would include IP-restrictions for the site. In this case you would need a good source of proxies/socks. There are also multiple ways you could check the quality of your proxies, they will be described in the upcoming articles.

If you don’t have funds to buy good Proxy/Socks, you can use Tor Browser as a Socks Provider. The Tor Browser can be downloaded from herehttps://www.torproject.org/download/Once it’s installed, you need to generate password for it, which will be used to control Tor. Go to Tor Browser\Browser\TorBrowser\Tor where Tor Browser is the directory, where you installed tor at. And run the following command:

tor.exe --hash-password password > yourtorpassword.txt
yourtorpassword.txt will contain the generated hash for the password we specified, it will look like follows:
16:843444F9DD902387606A6AA04DB497680613712805E609D03BD3037792
In our case we used ‘password’ as a password. Then you’ll need to specify the password in the config file, which can be found here: Tor Browser\Browser\TorBrowser\Data\Tor\torrc The file will contain something like follows:
# This file was generated by Tor; if you edit it, comments will not be preserved 
# The old torrc file was renamed to torrc.orig.1 or similar, and Tor will ignore it

ClientOnionAuthDir C:\Tor Browser\Browser\TorBrowser\Data\Tor\onion-auth
DataDirectory C:\Tor Browser\Browser\TorBrowser\Data\Tor
GeoIPFile C:\Tor Browser\Browser\TorBrowser\Data\Tor\geoip
GeoIPv6File C:\Tor Browser\Browser\TorBrowser\Data\Tor\geoip6
HashedControlPassword 16:843444F9DD902387606A6AA04DB497680613712805E609D03BD3037792

HashedControlPassword is where we need to specify the password generated on the previous step.

In order to control Tor we’ll need netcat, which can be downloaded from here:https://eternallybored.org/misc/netcat/The MD5 hash of the nc64 file is 523613A7B9DFA398CBD5EBD2DD0F4F38

Attention! It’s wise not to trust any executable files downloaded from the Internet. I recommend to at least check them with your antivirus and if you don’t trust the file, use it in a virtual machine only. In case of netcat it will be recognized as a hack tool by some antiviruses:https://www.virustotal.com/gui/file/3e59379f585ebf0becb6b4e06d0fbbf806de28a4bb256e837b4555f1b4245571/detection

Once the Tor is started, the socks control by default will be served here localhost 9151. We’ll need a file with the following content

AUTHENTICATE "password"
SETEVENTS SIGNAL
SIGNAL NEWNYM
QUIT

In our case we choose to name the file tor-change.txt. As you can see it should authenticate, change the node and quit. The following command can be used to test the node changing.

nc64 localhost 9151 <tor-change.txt

As you can see the IP changes

ip before socks changeip after socks change

The Tor configuration part is completed, so we’re ready to write our first lines. After creating our project

npm init -y
We’ll install the required dependencies
npm i puppeteer
Our index.js will look like follows:

const puppeteer = require('puppeteer');
const util = require('util');
const exec = util.promisify(require('child_process').exec);


const changeProxy = () => {

    (async () => {

        const { stdout, stderr } = await exec("nc64 localhost 9151 <tor-change.txt");
        console.log('stdout:', stdout);
    })();

    return;

}


const preparePageForTests = async (page) => {
    // Pass the User-Agent Test.
    const userAgent = 'Mozilla/5.0 (X11; Linux x86_64)' +
        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';
    await page.setUserAgent(userAgent);

    // Pass the Webdriver Test.
    await page.evaluateOnNewDocument(() => {
        Object.defineProperty(navigator, 'webdriver', {
            get: () => false,
        });
    });

    // Pass the Chrome Test.
    await page.evaluateOnNewDocument(() => {
        // We can mock this in as much depth as we need for the test.
        window.navigator.chrome = {
            runtime: {},
            // etc.
        };
    });

    // Pass the Permissions Test.
    await page.evaluateOnNewDocument(() => {
        const originalQuery = window.navigator.permissions.query;
        return window.navigator.permissions.query = (parameters) => (
            parameters.name === 'notifications' ?
                Promise.resolve({ state: Notification.permission }) :
                originalQuery(parameters)
        );
    });

    // Pass the Plugins Length Test.
    await page.evaluateOnNewDocument(() => {
        // Overwrite the `plugins` property to use a custom getter.
        Object.defineProperty(navigator, 'plugins', {
            // This just needs to have `length > 0` for the current test,
            // but we could mock the plugins too if necessary.
            get: () => [1, 2, 3, 4, 5],
        });
    });

    // Pass the Languages Test.
    await page.evaluateOnNewDocument(() => {
        // Overwrite the `plugins` property to use a custom getter.
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en'],
        });
    });
}


const scrape = async (pageURL, resolve) => {
	const browser = await puppeteer.launch({ headless: false, args: ['--proxy-server=socks5://localhost:9150'] });
}


scrape();

Please notice the following part

const browser = await puppeteer.launch({ headless: false, args: ['--proxy-server=socks5://localhost:9150'] });
Tor we configured and started before will be serving socks on localhost on port 9150. By specifying
headless: false
we configure puppeteer to show the actual browser window. When it’s launched
node index.js
we can browse to whatismyip.com and make sure that we’re using socks as intended.

As a part of naive anti-bot-detection routine we might want to specify the window size during the start. Statcounter shows that the most common screen resolutions are 1920x1080, 1366x768, and 1536x864. We’ll be using this data to choose one those three resolutions at random:

const resolutions = [[1366, 768], [1920, 1080], [1536, 864]];
let rndRes = Math.floor(Math.random() * resolutions.length);
rndRes = resolutions[rndRes];
await page.setViewport({ width: rndRes[0], height: rndRes[1] });

We’ll be parsing this page https://nostarch.com/algorithmic-thinking trying to collect the book price. As you can see, the price string, which selector is .form-type-radio, contains some unnecessary description, which we’ll have to filter out. So, we’ll have to wait until the necessary selector appears on the screen and then get it’s content and filter out the unnecessary text. The code is like follows:

const priceStringSelector = ".form-type-radio";

await page.goto(pageURL, { waitUntil: 'domcontentloaded', referer: "" });


    try {
        await page.waitForSelector(priceStringSelector, {
            timeout: 3000
        })

    } catch (err) {
        throw err
    }

    const priceString = await page.evaluate((priceStringSelector) => {

        const price = document.querySelector(priceStringSelector);
      
        const priceText = price.innerText;
      
        
      
        return priceText;
    }, priceStringSelector);

As you can see, we’re using puppeteer page.evaluate to run our JavaScript in the page context. We’re getting the price string from it. Now we can use our regular expression to extract the necessary data and then we can close the puppeteer browser.

const priceRegEx = /($)(d{1,4}.d{1,2})$/;
const matchedPrice = priceString.match(priceRegEx);
if(!matchedPrice){
    console.log(`price matching went wrong, alarm! The string was ${priceString}`);
    return;
}
await browser.close();

The matchedPrice variable will contain the required book price. Here is the complete code which includes parsing of two books one after another. Here, in webdataparsing, we’re using the most advanced solutions for data parsing and web scraping. So if you were looking for professionals, drop us a line support@webdataparsing.comIt also contains blockImages() function, which will prevent images on a page from loading. In some cases it may affect your parsing speed in a good way.