mala::home Davide “+mala” Eynard’s website

6Feb/113

Javascript scraper basics

As you already know, I am quite much into scraping. The main reason is that I think that what is available on the Internet (and in particular on the Web) should be consumed not only in the way it is provided, but also in a more customized one. We should be able to automatically gather information from different sources, integrate it, and obtain new information as the result of the elaboration of what we found.

Of course, this is highly related to my research topic, that is the Semantic Web. In the SW we suppose we already have data in formats that are easy for a machine to automatically consume, and a good part of the efforts of the SW research community are directed towards enabling standards and technologies that allow us to publish information this way. Of course, even if we made some progresses in this, there are still a couple of problems:

  1. the casual Internet user does not (and does not want to) know about these technologies and how to use them, so the whole "semantic publishing" process should be made totally invisible to her;
  2. most of the data that has been already provided on the Web in the last 20 years does not follow these standards. It has been produced for human consumption and cannot be easily parsed in an automatic way.

Fortunately, some structure in data exists anyway: information written in tables, infoboxes, or generated by a software from structured data (i.e. data saved in a database), typically shows a structure that, albeit informal (that is, not following a formal standard), can be exploited to get the original structured data back. Scrapers rely on this principle to extract relevant information from Web pages and, if provided a correct "crawling plan" (I'm makin' up this name, but I think you can easily understand what I mean ;)), they can more or less easily gather all the contents we need, reconstructing the knowledge we are interested in and allowing us to perform new operations on it.

There are, however, cases in which this operation is not trivial. For instance, when the crawling plan is not easy to reproduce (i.e. contents belong to pages which do not follow a well defined structure). Different problems happen when the page is generated on the fly by some javascript code, thus making HTML source code parsing unuseful. This reminds me much about dynamically generated code as a software protection: when we moved from statically compiled code to one that was updating itself at runtime (such as with packed or encrypted executables), studying a protection on its "dead listing" became more difficult and so we had to change our tools (from simple disassemblers to debuggers or interactive disassemblers) and approaches... but that is, probably, another story ;-) Finally, we might want to provide an easier, more interactive approach to Web scraping that allows users to dynamically choose the page they want to analyze while they are browsing it, and that does not require them to switch to another application to do this.

So, how can we run scrapers dynamically, on the page we are viewing, even if part of its contents have been generated on the fly? Well, we can write few lines of Javascript code and run it from our own browser. Of course there are already apps that allow you to do similar things very easily (anyone said Greasemonkey?), but starting from scratch is a good way to do some practice and learn new things... then you'll always be able to revert to GM later ;-) So, I am going to show you some very easy examples of Javascript code that I have tested on Firefox. I think they should be simple enough to run from other browsers too... well, if you have any problem with them just let me know.

First things first: as usual I'll be playing with regular expressions, so check the Javascript regexp references at regular-expressions.info and javascriptkit.com (here and here) and use the online regex tester tool at regexpal.com. And, of course, to learn basics about bookmarklets and get an idea about what you can do with them check Fravia's and Ritz's tutorials. Also, check when these tutorials have been written and realize how early they understood that Javascript would have changed the way we use our browsers...

Second things second: how can you test your javascript code easily? Well, apart from pasting it inside your location bar as a "javascript:" oneliner (we'll get to that later when we build our bookmarklets), I found very useful the "jsenv" bookmarklet at squarefree: just drag the jsenv link to your bookmark toolbar, click on the new bookmark (or just click on the link appearing in the page) and see what happens. From this very rudimental development environment we'll be able to access the page's contents and test our scripts. Remember that this script is able to access only the contents of the page that is displayed when you run it, so if you want it to work with some other page's contents you'll need to open it again.

Now we should be ready to test the examples. For each of them I will provide both some "human readable" source code and a bookmarklet you can "install" as you did with jsenv.

Example 1: MySwitzerland.com city name extractor

This use case is a very practical one: for my work I needed to get a list of the tourist destinations in Switzerland as described within MySwitzerland.com website. I knew the pages I wanted to parse (one for every region in Switzerland, like this one) but I did not want to waste time collecting them. So, the following quick script helped me:

var results="";
var content = document.documentElement.innerHTML;
var re = new RegExp ("<div style=\"border-bottom:.*?>([^<]+)</div>","gi");
while (array = re.exec(content)){
  results += array[1]+"<br/>";
}
document.documentElement.innerHTML=results;

Just few quick comments about the code:

  • line 2 gets the page content and saves it in the "content" variable
  • line 3 initializes a new regular expression (whose content and modifiers are defined as strings) and saves it in the "re" variable
  • line 4 executes the regexp on the content variable and, as long as it returns matches, it saves them in the "array" variable. As the matching group (the ones defined within parentheses inside the regexp) is only one, the results of the matches will be saved inside array[1]
  • in line 5 the "results" variable, which was initialized as an empty string in line 1, is appended the result of the match plus "<br/>"
  • in the last line, the value of "results" is printed as the new Web page content

The bookmarklet is here: MySwitzerland City Extractor.

Example 2: Facebook's phonebook conversion tool

This example shows how to quickly create a CSV file out of Facebook's Phonebook page. It is a simplified version that only extracts the first phone number for every listed contact, and the regular expression is kind of redundant but is a little more readable and should allow you, if you want, to easily find where it matches within the HTML source.

var output="";
content = document.documentElement.innerHTML;
var re = new RegExp ("<div class=\"fsl fwb fcb\">.*?<a href=\"[^\"]+\">([^<]+)<.*?<div class=\"fsl\">([^<]+)<span class=\"pls fss fcg\">([^<]+)</span>", "gi");
while (array = re.exec(content)){
 output += array[1]+";"+array[2]+";"+array[3]+"<br/>";
}
document.documentElement.innerHTML = output;

As you can see, the structure is the very same of the previous example. The only difference is in the regular expression (that has three matching groups: one for the name, one for the phone number, and one for the type, i.e. mobile, home, etc.) and, obviously, in the output. You can get the bookmarklet here: Facebook Phonebook to CSV.

Example 3: Baseball stats

This last example shows you how you can get new information from a Web page just by running a specific tool on it: data from the Web page, once extracted from the HTML code, becomes the input for an analysis tool (that can be more or less advanced - in our case is going to be trivial) that returns its output to the user. For the example I decided to calculate an average value from the ones shown in a baseball statistics table at baseball-reference.com. Of course I know nothing about baseball so I just got a value even I could understand, that is the team members' age :) Yes, the script runs on a generic team roster page and returns the average age of its members... probably not very useful, but should give you the idea. Ah, and speaking of this, there are already much more powerful tools that perform similar tasks like statcrunch, which automatically detects tables within Web pages and imports them so you can calculate statistics over their values. The advantage of our approach, however, is that it is completely general: any operation can be performed on any piece of data found within a Web page.

The code for this script follows. I have tried to be as verbose as I could, dividing data extraction in three steps (get team name/year, extract team members' list, get their ages). As the final message is just a string and does not require a page on its own, the calculated average is returned within an alert box.

var content, payrolls, team, count, sum, avg, re;
var content = document.documentElement.innerHTML;

// get team name/year
re = new RegExp ("<h1>([^<]+)<", "gi");
if (array = re.exec(content)){
 team = array[1];
}

// extract "team future payrolls" to get team members' list
re = new RegExp ("div_payroll([\\s\\S]+?)appearances_table", "i");
if (array = re.exec(content)){
 payrolls = array[1];
}

sum=0;count=0;
// extract members' ages and sum them
re = new RegExp ("</a></td>\\s+<td align=\"right\">([^<]+)</td>", "gi");
while (array = re.exec(payrolls)){
 sum += parseInt(array[1]);
 count++;
}

alert("The average age of "+team+" is "+sum/count+" out of "+count+" team members");

The bookmarklet for this example is available here: Baseball stats  - Age.

Conclusions

In this article I have shown some very simple examples of Javascript scrapers that can be saved as bookmarklets and run on the fly while you are browsing specific Web pages. The examples are very simple and should both give you an idea about the potentials of this approach and teach you the basics of javascript regular expressions so that you can apply them in other contexts. If you want to delve deeper into this topic, however, I suggest you to give a look at Greasemonkey and its huge repository of user-provided scripts: that is probably the best way to go if you want to easily develop and share your powerbrowsing tools.