Website Scraping Using HAR File

All new employees at Serpstat get a lot of information in the first few weeks of their work. We prefer to structure all this information and write it down to have a source of truth that you can reference at any time.

As a knowledge base and onboarding tool we use AcademyOcean. It allows you to create an academy as a container for courses. These courses might be useful for your employees (internal use case) or clients (external use case). Check their website for more details.

As the company growth some content of academies becomes obsolete. So from time to time I have to update the academy for our product managers. The amount of these updates might vary a lot but then comes the day for a total cleanup and restructuring.

It turned out that AcademyOcean had no way to export all the content in a single file. Editing some lessons in the course is easy and convenient but if you have to rebuild everything from scratch it becomes pain. So it was a good time to unpack my programming skills to scrape everything :)

The obvious solution was to use PhantomJS or headless Chrome to pass the authorisation, walk through all the pages and save the content to a database. Easy, right? As mestre Rui (my Capoeira master) tells us while asking did we get what he is saying: “I don’t think so”.

Playing with a headless browser is a good exercise but I was sure that it will shift my focus from the main problem of editing the courses. So I had to find a faster way to achieve my goal. For the same reason I didn’t use any scraping software that I was aware of.

One of the cool features of AcademyOcean is autosaving the content while editing. Just like Google Docs do. So my assumption was that there was a mechanism to transfer data from my browser to their servers with these autosaves.

In the Network panel of Chrome you can find all the requests that your browser sends to the server. I usually use CTRL+SHIFT+J hotkey to open JavaScript console on my Windows machine and then jump to Network tab:

Google Chrome Network tab

So I checked was there any network activity related to autosaving. And it turned out that with every autosave browser sends a POST requests with all the HTML of edited content. So the next step was to get this data out of Network tab:

POST request payload

There is a magic button in Chrome console that lets you export a HAR file. This file contains all the network activity in a JSON format. You can filter the requests in Network tab to reduce the amount of data in exported HAR file. So I set up a filter and walked through all the lessons to capture the requests in a single file:

Filter requests and export HAR file from Chrome developer tools

You can use any tool that can read JSON to get all the content from a HAR file. Let’s add some code here to solve this task. The programming language that I use all the time is R so here is a piece of code to parse JSON from HAR:

data <- jsonlite::fromJSON('')$log$entries

lessons <- data$request$postData$params[
    data$request$method == 'POST' & 
    grepl(pattern = '/account/courses/edit/', 
          x       = data$request$url, 
          fixed   = TRUE)

texts <- unlist(
            iconv(URLdecode(x$value[2]), from = "UTF-8",  to = "1251")

texts <- gsub('+', ' ', texts, fixed = TRUE)
write(texts, 'rezult.html')

With the first line of code you read JSON from HAR file to get all the requests. Then you have to look through the JSON tree to find the nested elements with a content of each lesson from academy. You can do that by filtering POST requests by requested URL.

We have some cyrilyc symbols in our academy so some necessary encoding manipulations come next. And finally you have to do a little cleanup to drop redundant symbols and save the results to HTML file with all the markup from academy’s text editor.

And that’s it. All the text content looks like a regular HTML page now. Editing HTML files is beyond the scope of this article :)

Side note. Vladimir Polo, I told you that I’ll scrape the academy. This approach is quite lazy but it took me less time than to write this article.


comments powered by Disqus