mala::home Davide “+mala” Eynard’s website

2May/110

The importance of detecting patterns

A student whose M.S. thesis project I have followed in the last months has recently handed out his final thesis report. We have spent some time talking about his work, what parts of it he liked most, and what he had learned from this experience. The project turned out to be quite appreciated (this is one of the papers that we published with him) and I was glad to hear that he liked a lot the part involving scraping. He has also gained quite much experience about the topic, enough to find its weakest parts (i.e. relying only on regexps to extract page contents is bad, as some kind of pages change pretty often and you might need to rewrite them each time) and its most interesting generalizations. During our talk, he told me that one of the most impressive things for him was that websites belong to families, and if you can spot the common patterns within a family you can write general scrapers that work well with many sites.

Actually, this is not such an amazing piece of news, but that's SO true that I felt it was worth to spend some time over it. I have been experimenting this concept on my skin, while I was writing one of my first crawlers: it was called TWO (The Working Offline forum reader, a name which tells much about my previous successes in writing scrapers ;)) and its purpose was to fully download the contents of a forum inside a DB, allowing users to read them when offline. This tool relied on a set of plugins which allowed it to work with different forum technologies (I was supporting PhPBB, VBulletin, and a much less standard software used for some very specific forums about reversing and searching). Of course, what I noticed while writing it is that all forums could be roughly described in the same way: every forum has some threads, every thread has different messages, and every message has specific properties (Author, Date, Subject, Message body) that are repeated throughout the system. Having different systems complying to the same standards provides some advantages:

  1. crawlers are usually built over two main components: a "browsing component" that deals with finding the right starting pages and following the right links in them, and a "scraping component" that deals with extracting relevant page contentes. We can exploit browsing patterns to use the same browsing component for crawling different websites/systems. For instance, we know that in every main forum page we will follow some links (with a format that can be specified with a regexp) to access different sub-forums containing lists of threads, and then we will access one or more pages with the typical "list" pattern (go to first, last, previous, or next page);
  2. similarly, we can exploit scraping patterns to use the same data structures for objects retrieved from different sources. For instance, we can create a "forumMessage" object whose attributes are the message content, the subject, the date, and the author, and these will hold for all the messages we get from any website;
  3. being able to bring together, in the same objects, homogeneous types of information from heterogeneous sources allows us to automatically analyze it, bringing some new knowledge out. For instance, if we know we have a "rating" field in two websites which is comparable (i.e. uses a compatible range of values, or can be normalized), then we can calculate an average of this rating over different services (a value that most of websites, competing one with each other, will never give you).

To give you a practical example, I have applied this concept on search engines to write a simple but very flexible search engine aggregator. I needed to get results from different search engines, merge them, and calculate for instance in how many different domains results could be grouped, which engine returned more results, and so on. I made some example searches with google, yahoo, bing, and others, and described the general access patterns as follows:

  1. first of all, a user agent connects to the search engine at a specified address (in most of the cases it is the default s.e. domain, but sometimes you might want to force a different URL, i.e. to get results in a given language and not the default one according to your location);
  2. then, the main search form is filled with the search string and some parameters you want your results to comply to (i.e. number of results per page, language, date restrictions, and so on);
  3. when the results are returned, they are usually shown in a well defined way (i.e. easily parsable with a given regular expression) and if they do not fit in one page it is possible to follow links to the next or the following ones. If you want all the results, you just need to load the next results page until no more "next" link appears.

The result is a (well, yet another) scraper. Its input is a JSON like the following one:

{
  "engines": {
    "google": {
      "fields": {
        "num": 10,
        "q": "$search"
      },
      "id": 1,
      "next": "Next",
      "nextURL": "google.com/search",
      "regexp": "<a href=\"([^\"]+)\" class=l>",
      "sleep": 2,
      "url": "http://www.google.com/ncr"
    },
    "bing": {
      "fields": {
        "q": "$search"
      },
      "id": 2,
      "next": "^Next$",
      "regexp": "<div class=\"sb_tlst\"><h3><a href=\"([^\"]+)\"",
      "sleep": 5,
      "url": "http://www.bing.com/?scope=web&mkt=it-IT&FORM=W0LH"
    },
    "yahoo": {
      "fields": {
        "n": 10,
        "p": "$search"
      },
      "id": 3,
      "next": "^Next",
      "regexp": "<a class=\"yschttl spt\" href=\"([^\"]+)\"",
      "sleep": 5,
      "url": "http://www.yahoo.com"
    }
  },
  "search": {
    "1": "\"Davide Eynard\"",
    "2": "aittalam",
    "3": "3564020356"
  }
}

As you can see, all the search engines are described in the same way (miracles of patterns :)). Each one of them has a name and an id (used later to match it with its own results), a URL and a set of input fields, a sleep time (useful if anyone tries to ban you because you are hammering it too much), plus regular expressions to parse both the single URLs returned as results and the "Next" links inside result pages. To test the scraper, download it here. Pass it the text above as input (either by pasting it or by saving it in a file and redirecting its contents, i.e. perl se.pl <testsearch.json). The scraper is going to take some time as it is querying all the three search engines with all the three search strings (I left $DEBUG=1 to allow you to see what is happening while it runs). At the end you'll be returned another JSON file, containing the results of your combined search: a list of all the matching URLs grouped by domain, together with the search engine(s) that returned them and their ranking.

Of course, this is just a skeleton app over which you can build something much more useful. Some examples? I have used a variation of this script to build a backlink analysis tool, returning useful statistics about who is linking a given URL, and another one to get blog posts about a given topic that are later mined for opinions. Hope you got the main teaching of this story: patterns are everywhere, it's up to you to exploit them to make something new and useful ;)

{
"engines": {
"google": {
"fields": {
"num": 10,
"q": "$search"
},
"id": 1,
"next": "Next",
"nextURL": "google.com/search",
"regexp": "<a href=\"([^\"]+)\">",
"sleep": 2,
"url": "http://www.google.com/ncr"
},
"bing": {
"fields": {
"q": "$search"
},
"id": 2,
"next": "^Next$",
"regexp": "<div class=\"sb_tlst\"><h3><a href=\"([^\"]+)\"",
"sleep": 5,
"url": "http://www.bing.com/?scope=web&mkt=it-IT&FORM=W0LH"
},
"yahoo": {
"fields": {
"n": 10,
"p": "$search"
},
"id": 3,
"next": "^Next",
"regexp": "<a class=\"yschttl spt\" href=\"([^\"]+)\"",
"sleep": 5,
"url": "http://www.yahoo.com"
}
},
"search": {
"1": "\"Davide Eynard\"",
"2": "aittalam",
"3": "3564020356"
}
}
6Feb/113

Javascript scraper basics

As you already know, I am quite much into scraping. The main reason is that I think that what is available on the Internet (and in particular on the Web) should be consumed not only in the way it is provided, but also in a more customized one. We should be able to automatically gather information from different sources, integrate it, and obtain new information as the result of the elaboration of what we found.

Of course, this is highly related to my research topic, that is the Semantic Web. In the SW we suppose we already have data in formats that are easy for a machine to automatically consume, and a good part of the efforts of the SW research community are directed towards enabling standards and technologies that allow us to publish information this way. Of course, even if we made some progresses in this, there are still a couple of problems:

  1. the casual Internet user does not (and does not want to) know about these technologies and how to use them, so the whole "semantic publishing" process should be made totally invisible to her;
  2. most of the data that has been already provided on the Web in the last 20 years does not follow these standards. It has been produced for human consumption and cannot be easily parsed in an automatic way.

Fortunately, some structure in data exists anyway: information written in tables, infoboxes, or generated by a software from structured data (i.e. data saved in a database), typically shows a structure that, albeit informal (that is, not following a formal standard), can be exploited to get the original structured data back. Scrapers rely on this principle to extract relevant information from Web pages and, if provided a correct "crawling plan" (I'm makin' up this name, but I think you can easily understand what I mean ;)), they can more or less easily gather all the contents we need, reconstructing the knowledge we are interested in and allowing us to perform new operations on it.

There are, however, cases in which this operation is not trivial. For instance, when the crawling plan is not easy to reproduce (i.e. contents belong to pages which do not follow a well defined structure). Different problems happen when the page is generated on the fly by some javascript code, thus making HTML source code parsing unuseful. This reminds me much about dynamically generated code as a software protection: when we moved from statically compiled code to one that was updating itself at runtime (such as with packed or encrypted executables), studying a protection on its "dead listing" became more difficult and so we had to change our tools (from simple disassemblers to debuggers or interactive disassemblers) and approaches... but that is, probably, another story ;-) Finally, we might want to provide an easier, more interactive approach to Web scraping that allows users to dynamically choose the page they want to analyze while they are browsing it, and that does not require them to switch to another application to do this.

So, how can we run scrapers dynamically, on the page we are viewing, even if part of its contents have been generated on the fly? Well, we can write few lines of Javascript code and run it from our own browser. Of course there are already apps that allow you to do similar things very easily (anyone said Greasemonkey?), but starting from scratch is a good way to do some practice and learn new things... then you'll always be able to revert to GM later ;-) So, I am going to show you some very easy examples of Javascript code that I have tested on Firefox. I think they should be simple enough to run from other browsers too... well, if you have any problem with them just let me know.

First things first: as usual I'll be playing with regular expressions, so check the Javascript regexp references at regular-expressions.info and javascriptkit.com (here and here) and use the online regex tester tool at regexpal.com. And, of course, to learn basics about bookmarklets and get an idea about what you can do with them check Fravia's and Ritz's tutorials. Also, check when these tutorials have been written and realize how early they understood that Javascript would have changed the way we use our browsers...

Second things second: how can you test your javascript code easily? Well, apart from pasting it inside your location bar as a "javascript:" oneliner (we'll get to that later when we build our bookmarklets), I found very useful the "jsenv" bookmarklet at squarefree: just drag the jsenv link to your bookmark toolbar, click on the new bookmark (or just click on the link appearing in the page) and see what happens. From this very rudimental development environment we'll be able to access the page's contents and test our scripts. Remember that this script is able to access only the contents of the page that is displayed when you run it, so if you want it to work with some other page's contents you'll need to open it again.

Now we should be ready to test the examples. For each of them I will provide both some "human readable" source code and a bookmarklet you can "install" as you did with jsenv.

Example 1: MySwitzerland.com city name extractor

This use case is a very practical one: for my work I needed to get a list of the tourist destinations in Switzerland as described within MySwitzerland.com website. I knew the pages I wanted to parse (one for every region in Switzerland, like this one) but I did not want to waste time collecting them. So, the following quick script helped me:

var results="";
var content = document.documentElement.innerHTML;
var re = new RegExp ("<div style=\"border-bottom:.*?>([^<]+)</div>","gi");
while (array = re.exec(content)){
  results += array[1]+"<br/>";
}
document.documentElement.innerHTML=results;

Just few quick comments about the code:

  • line 2 gets the page content and saves it in the "content" variable
  • line 3 initializes a new regular expression (whose content and modifiers are defined as strings) and saves it in the "re" variable
  • line 4 executes the regexp on the content variable and, as long as it returns matches, it saves them in the "array" variable. As the matching group (the ones defined within parentheses inside the regexp) is only one, the results of the matches will be saved inside array[1]
  • in line 5 the "results" variable, which was initialized as an empty string in line 1, is appended the result of the match plus "<br/>"
  • in the last line, the value of "results" is printed as the new Web page content

The bookmarklet is here: MySwitzerland City Extractor.

Example 2: Facebook's phonebook conversion tool

This example shows how to quickly create a CSV file out of Facebook's Phonebook page. It is a simplified version that only extracts the first phone number for every listed contact, and the regular expression is kind of redundant but is a little more readable and should allow you, if you want, to easily find where it matches within the HTML source.

var output="";
content = document.documentElement.innerHTML;
var re = new RegExp ("<div class=\"fsl fwb fcb\">.*?<a href=\"[^\"]+\">([^<]+)<.*?<div class=\"fsl\">([^<]+)<span class=\"pls fss fcg\">([^<]+)</span>", "gi");
while (array = re.exec(content)){
 output += array[1]+";"+array[2]+";"+array[3]+"<br/>";
}
document.documentElement.innerHTML = output;

As you can see, the structure is the very same of the previous example. The only difference is in the regular expression (that has three matching groups: one for the name, one for the phone number, and one for the type, i.e. mobile, home, etc.) and, obviously, in the output. You can get the bookmarklet here: Facebook Phonebook to CSV.

Example 3: Baseball stats

This last example shows you how you can get new information from a Web page just by running a specific tool on it: data from the Web page, once extracted from the HTML code, becomes the input for an analysis tool (that can be more or less advanced - in our case is going to be trivial) that returns its output to the user. For the example I decided to calculate an average value from the ones shown in a baseball statistics table at baseball-reference.com. Of course I know nothing about baseball so I just got a value even I could understand, that is the team members' age :) Yes, the script runs on a generic team roster page and returns the average age of its members... probably not very useful, but should give you the idea. Ah, and speaking of this, there are already much more powerful tools that perform similar tasks like statcrunch, which automatically detects tables within Web pages and imports them so you can calculate statistics over their values. The advantage of our approach, however, is that it is completely general: any operation can be performed on any piece of data found within a Web page.

The code for this script follows. I have tried to be as verbose as I could, dividing data extraction in three steps (get team name/year, extract team members' list, get their ages). As the final message is just a string and does not require a page on its own, the calculated average is returned within an alert box.

var content, payrolls, team, count, sum, avg, re;
var content = document.documentElement.innerHTML;

// get team name/year
re = new RegExp ("<h1>([^<]+)<", "gi");
if (array = re.exec(content)){
 team = array[1];
}

// extract "team future payrolls" to get team members' list
re = new RegExp ("div_payroll([\\s\\S]+?)appearances_table", "i");
if (array = re.exec(content)){
 payrolls = array[1];
}

sum=0;count=0;
// extract members' ages and sum them
re = new RegExp ("</a></td>\\s+<td align=\"right\">([^<]+)</td>", "gi");
while (array = re.exec(payrolls)){
 sum += parseInt(array[1]);
 count++;
}

alert("The average age of "+team+" is "+sum/count+" out of "+count+" team members");

The bookmarklet for this example is available here: Baseball stats  - Age.

Conclusions

In this article I have shown some very simple examples of Javascript scrapers that can be saved as bookmarklets and run on the fly while you are browsing specific Web pages. The examples are very simple and should both give you an idea about the potentials of this approach and teach you the basics of javascript regular expressions so that you can apply them in other contexts. If you want to delve deeper into this topic, however, I suggest you to give a look at Greasemonkey and its huge repository of user-provided scripts: that is probably the best way to go if you want to easily develop and share your powerbrowsing tools.

21Mar/070

Perl Hacks: a bot for Google Scholar

Lately I was asked to write a bot which allowed people to easily query Google Scholar and get citations for a person/paper. I wrote this little perl script which requires two parameters: the name of the author and the paper title. It then queries Scholar and returns the number of citations for that paper. Quick and easy! Unfortunately Scholar data are not always consistent, but they are still helpful in some way... And well, of course Publish or Perish is a better tool if you want to know statistics about your publications ;-)

19Feb/070

Perl Hacks: del.icio.us scraper

At last, I built it: a working, quite stable del.icio.us scraper. I needed a dataset big enough to make some experiments on it (for a research project I'll talk you about sooner or later), so I had to create something which would not only allow me to download my stuff (like with del.icio.us API), but also data from other users connected with me.

Even if it's a first release, I have tested the script quite much in these days and it's stable enough to let you backup your data and get some more if you're doing research on this topic (BTW, if so let me know, we might exchange some ideas ;-) Here are some of its advantages:

  • it just needs a list of users to start and then downloads all their bookmarks
  • it saves data inside a DB, so you can query them, export them in any format, do some data mining and so on
  • it runs politely, with a 5 seconds sleep between page downloads, so to avoid bombing del.icio.us website with requests
  • it supports the use of a proxy
  • it's very tweakable: most of its parameters can be easily changed
  • it's almost ready for a distributed version (that is, it supports table locking so you can run many clients which connect to a centralized database)

Of course, it's far from being perfect:

  • code is still quite messy: probably a more modular version would be easier to update (perl coders willing to give a hand are welcome, of course!)
  • I haven't tried the "distributed version" yet, so it just works in theory ;-)
  • it's sloooow, especially compared to the huge size of del.icio.us: at the beginning of this month, they said they had about 1.5 million users, and I don't believe that a single client will be able to get much more than few thousand users per day (but do you need more?)
  • the way it is designed, the database grows quite quickly and interesting queries won't be very fast if you download many users (DB specialists willing to give a hand are welcome, of course!)
  • the program misses a function to harvest users, so you have to provide the list of users you want to download manually. Actually, I made mine with another scraper but I did not want to provide, well, both the gun and the bullets to everyone. I'm sure someone will have something to say about this, but hey, it takes you less time to write your ad-hoc scraper than to add an angry comment here, so don't ask me to give you mine

That's all. You'll find the source code here, have phun ;)

15Jan/070

Perl Hacks: more sudokus

Hi all,

incredibly, I got messages asking for more sudokus: aren't you satisfied with the ones harvested from Daily Sudoku? Well... once a script like SukaSudoku is ready, the only work that has to be done is to create a new wrapper for another website! Let's take, for instance, Number Logic site: here's a new perl script which inherits most of its code from SukaSudoku and extracts information from Number Logic. Just copy it over the old "suka.pl" file and double the number of sudokus to play with.

8Jan/070

Perl Hacks: quotiki bot

I like quotiki. I like fortunes. Why not make a bot which downloads a random quote from quotiki and shows it on my computer? Here's a little perl bot which does exactly this. It's very short and easy, so I guess this could be a good starting point for beginners.
And, by the way:

"Age is an issue of mind over matter. If you don't mind, it doesn't matter."
-- Mark Twain

;)

18Dec/060

Perl Hacks: Google search

I'm at page 14 inside Google results for the search string "perl hacks". Well, at least in google.it which is where I'm automatically redirected...

Who bothers? Well, not even me, I was just curious about it. How did I do it? Well, of course automatically with a perl bot!

Here it is. You call it passing two parameters from the command line: the first one is the search string (ie. "\"perl hacks\""), the second one is (part of) the title of the page (ie. "mala::home"). Then the program connects to Google, searches for the search string and then finds at which page the second string occurs.

Probably someone noticed the filename ends with "02". Is there a "01" out there too? Of course there is: it's a very similar version which doesn't search only inside titles but also in the whole snippet of text related to the search result. Less precise, but a little more flexible.

11Dec/060

Perl Hacks: Googleicious!

At last, here's my little googleicious, my first attempt at merging google results with del.icio.us tags. I did it some months ago to see if I could automatically retrieve information from del.icio.us. Well... it's possible :)

What does this script do? It takes a word (or a collection of words) from the command line and searches for it on google; then it takes the search results, extracts URLs from it and searches for them within del.icio.us, showing the tags that have been used to classify them. Even if the script is quite simple, there are many details you can infer from its output:

  • you can see how many top results in Google are actually interesting for people
  • you can use tags to give meaning to some obscure words (try for instance ESWC and you'll see it's a conference about semantic web, or screener and you'll learn it's not a term used only by movie rippers)
  • starting from the most used tags returned by del.icio.us, you can search for similar URLs which haven't been returned by Google

Now some notes:

  • I believe this is very near to a limit both Google and del.icio.us don't want you to cross, so please behave politely and avoid bombing them with requests
  • For the very same reason, this version is quite limited (it just takes the first page of results) but lacks some more controls I put in later versions such as sleep time to avoid request bursts. So, again, be polite please ^__^
  • I know that probably one zillion people have already used the term googleicious and I don't want to be considered as its inventor. I just invented the script, so if you don't like the name just call it foobar or however you like: it will run anyway.

Ah, yes, the code is here!

4Dec/060

Perl Hacks: SukaSudoku!

Hi everybody :)

As I decided to post one "perl hack" each week, and as after two posts I lost my inspiration, I've decided to recycle an old project which hasn't been officially published yet. Its name is sukasudoku, and as the name implies it sucks sudokus from a website (http://www.dailysudoku.com) and saves them on your disk.

BUT...

Hah! It doesn't just "save them on your disk", but creates a nice book in PDF with all the sudokus it has downloaded. Then you can print the book, bind it and either play your beloved sudokus wherever you want or give it to your beloved ones as a beautiful, zero-cost Xmas gift.
Of course, the version I publish here is limited just to avoid having every script kiddie download the full archive without efforts. But you have the source, and you have the knowledge. So what's the problem?

The source file is here for online viewing, but to have a working version you should download the full package. In the package you will find:

  • book and clean, two shell scripts to create the book from the LaTeX source (note: of course, you need latex installed in your system to make it work!) and to remove all the junk files from the current dir
  • sudoku.sty, the LaTeX package used to create sudoku grids
  • suka.pl, the main script
  • tpl_sudoku.tex, the template that will be used to create the latex document containing sudokus