Valuable competition data made right

logo

scraping Reddit post data

Many of modern SPA applications frequently have data stored in JSON object, usually they could be found at the bottom of the page source. During our long web scraping experience in webdataparsing we figured out that these JSON object are the most convenient options to scrape data from site along with the official APIs. Let's consider scraping Reddit page. For Reddit post the JSON data object name will be window.___r

"Reddit scraping", the art by AI called rudalle: 
logo

As an example we'll be using this post https://www.reddit.com/r/wallstreetbets/comments/rb9ej8/buy_the_dip_with_cat_as_requested/ Just open the page in your favorite browser and view sources (usually it's done by using hotkey Ctrl+u). And then at the bottom of sources we can see the following:

However when using Puppeteer we get undefined. Probably it gets deleted somewhere along the way. That's not a problem, we can do multiple things to help it. This time we'll just use request interception. And when we intercept the required page, we'll inject the one line that will copy the required JSON object into our own variable called ___r2

const ourScript = `<script>window.___r2 = {...window.___r}</script>`;

The function itself will intercept the request, use fetch to get the original data and then concatenate this data with our one liner.

const addOurScript = async (page, address) => {

await page.setRequestInterception(true);
page.on('request', interceptedRequest => {


    if (interceptedRequest.url() === address) {
       

        fetch(interceptedRequest.url()).then(e => e.text()).then(res => {

            // console.log(res);
            interceptedRequest.respond({
                status: 200,
                contentType: 'text/html',
                body: res + ourScript
            });


        })
    } else interceptedRequest.continue();
});

}

Now when we have access to the JSON object in ___r2 var, we can use it to extract data. That's a big JSON object (almost 500kb) with all the required data. Let's save it as a text file and look into it. There are two different approaches we can take to figure out the interesting parts of the structure. The first approach is manually by using some convenient tool like https://jsonlint.com/. The other approach is to write a function which will show a path to required key or value in this JSON object. The latter is preferrable. We'll also need this function to accept number of entrances of the required. We can see that the original post starter is someone with nickname "Rapid Response Meme Strike Force". This nick appears in our text file at least twice. Let's find the path to the second one. The function will traverse the object tree and get us the required path. Here is the function:

const foundPath = [];
let numberOfOccurance = 1;

function findDeepObjectPropertyValue(theObject, propertyToFind) {

    var result = null;
    if (theObject instanceof Array) {
        for (var i = 0; i < theObject.length; i++) {
            foundPath.push(i);
            result = findDeepObjectPropertyValue(theObject[i], propertyToFind);
            if (result) {
                break;
            }
            else{
                foundPath.pop();
            }

        }
    } else {
        for (var prop in theObject) {
            
            if (prop == propertyToFind){
                if(!--numberOfOccurance){
                    result = theObject;
                    break;
                }
                
            }
            else if(theObject[prop] == propertyToFind){
                if(!--numberOfOccurance){
                    foundPath.push(prop);
                    result = prop;
                    break;
                }
            }
            if ((theObject[prop] instanceof Object )||( theObject[prop] instanceof Array)) {
                foundPath.push(prop);
                result = findDeepObjectPropertyValue(theObject[prop], propertyToFind);
                if (result) {
                    break;
                } else
                    foundPath.pop();
            } 
        }
    }
    return result;
}

We get the result in foundPath. It's ___r2.authorFlair.models.t5_2th52.pittluke.richtext[0].t So we can use this function to find and understand the path useful for scraping the current page. For example the comments section authors and texts. Let's say we need to scrape reddit comments. We can see that there is a comment by dirkve33. Let's find it in our data. We'll just use it in the same function findDeepObjectPropertyValue(r, "dirkve33"); And we get [ 'features', 'comments', 'models', 't1_hnn8fql', 'author' ] That's what we need to get the ids for the comments. We can scrape Reddit data and much more. Just contact us support@webdataparsing.com