mala::home Davide “+mala” Eynard’s website

2May/110

The importance of detecting patterns

A student whose M.S. thesis project I have followed in the last months has recently handed out his final thesis report. We have spent some time talking about his work, what parts of it he liked most, and what he had learned from this experience. The project turned out to be quite appreciated (this is one of the papers that we published with him) and I was glad to hear that he liked a lot the part involving scraping. He has also gained quite much experience about the topic, enough to find its weakest parts (i.e. relying only on regexps to extract page contents is bad, as some kind of pages change pretty often and you might need to rewrite them each time) and its most interesting generalizations. During our talk, he told me that one of the most impressive things for him was that websites belong to families, and if you can spot the common patterns within a family you can write general scrapers that work well with many sites.

Actually, this is not such an amazing piece of news, but that's SO true that I felt it was worth to spend some time over it. I have been experimenting this concept on my skin, while I was writing one of my first crawlers: it was called TWO (The Working Offline forum reader, a name which tells much about my previous successes in writing scrapers ;)) and its purpose was to fully download the contents of a forum inside a DB, allowing users to read them when offline. This tool relied on a set of plugins which allowed it to work with different forum technologies (I was supporting PhPBB, VBulletin, and a much less standard software used for some very specific forums about reversing and searching). Of course, what I noticed while writing it is that all forums could be roughly described in the same way: every forum has some threads, every thread has different messages, and every message has specific properties (Author, Date, Subject, Message body) that are repeated throughout the system. Having different systems complying to the same standards provides some advantages:

  1. crawlers are usually built over two main components: a "browsing component" that deals with finding the right starting pages and following the right links in them, and a "scraping component" that deals with extracting relevant page contentes. We can exploit browsing patterns to use the same browsing component for crawling different websites/systems. For instance, we know that in every main forum page we will follow some links (with a format that can be specified with a regexp) to access different sub-forums containing lists of threads, and then we will access one or more pages with the typical "list" pattern (go to first, last, previous, or next page);
  2. similarly, we can exploit scraping patterns to use the same data structures for objects retrieved from different sources. For instance, we can create a "forumMessage" object whose attributes are the message content, the subject, the date, and the author, and these will hold for all the messages we get from any website;
  3. being able to bring together, in the same objects, homogeneous types of information from heterogeneous sources allows us to automatically analyze it, bringing some new knowledge out. For instance, if we know we have a "rating" field in two websites which is comparable (i.e. uses a compatible range of values, or can be normalized), then we can calculate an average of this rating over different services (a value that most of websites, competing one with each other, will never give you).

To give you a practical example, I have applied this concept on search engines to write a simple but very flexible search engine aggregator. I needed to get results from different search engines, merge them, and calculate for instance in how many different domains results could be grouped, which engine returned more results, and so on. I made some example searches with google, yahoo, bing, and others, and described the general access patterns as follows:

  1. first of all, a user agent connects to the search engine at a specified address (in most of the cases it is the default s.e. domain, but sometimes you might want to force a different URL, i.e. to get results in a given language and not the default one according to your location);
  2. then, the main search form is filled with the search string and some parameters you want your results to comply to (i.e. number of results per page, language, date restrictions, and so on);
  3. when the results are returned, they are usually shown in a well defined way (i.e. easily parsable with a given regular expression) and if they do not fit in one page it is possible to follow links to the next or the following ones. If you want all the results, you just need to load the next results page until no more "next" link appears.

The result is a (well, yet another) scraper. Its input is a JSON like the following one:

{
  "engines": {
    "google": {
      "fields": {
        "num": 10,
        "q": "$search"
      },
      "id": 1,
      "next": "Next",
      "nextURL": "google.com/search",
      "regexp": "<a href=\"([^\"]+)\" class=l>",
      "sleep": 2,
      "url": "http://www.google.com/ncr"
    },
    "bing": {
      "fields": {
        "q": "$search"
      },
      "id": 2,
      "next": "^Next$",
      "regexp": "<div class=\"sb_tlst\"><h3><a href=\"([^\"]+)\"",
      "sleep": 5,
      "url": "http://www.bing.com/?scope=web&mkt=it-IT&FORM=W0LH"
    },
    "yahoo": {
      "fields": {
        "n": 10,
        "p": "$search"
      },
      "id": 3,
      "next": "^Next",
      "regexp": "<a class=\"yschttl spt\" href=\"([^\"]+)\"",
      "sleep": 5,
      "url": "http://www.yahoo.com"
    }
  },
  "search": {
    "1": "\"Davide Eynard\"",
    "2": "aittalam",
    "3": "3564020356"
  }
}

As you can see, all the search engines are described in the same way (miracles of patterns :)). Each one of them has a name and an id (used later to match it with its own results), a URL and a set of input fields, a sleep time (useful if anyone tries to ban you because you are hammering it too much), plus regular expressions to parse both the single URLs returned as results and the "Next" links inside result pages. To test the scraper, download it here. Pass it the text above as input (either by pasting it or by saving it in a file and redirecting its contents, i.e. perl se.pl <testsearch.json). The scraper is going to take some time as it is querying all the three search engines with all the three search strings (I left $DEBUG=1 to allow you to see what is happening while it runs). At the end you'll be returned another JSON file, containing the results of your combined search: a list of all the matching URLs grouped by domain, together with the search engine(s) that returned them and their ranking.

Of course, this is just a skeleton app over which you can build something much more useful. Some examples? I have used a variation of this script to build a backlink analysis tool, returning useful statistics about who is linking a given URL, and another one to get blog posts about a given topic that are later mined for opinions. Hope you got the main teaching of this story: patterns are everywhere, it's up to you to exploit them to make something new and useful ;)

{
"engines": {
"google": {
"fields": {
"num": 10,
"q": "$search"
},
"id": 1,
"next": "Next",
"nextURL": "google.com/search",
"regexp": "<a href=\"([^\"]+)\">",
"sleep": 2,
"url": "http://www.google.com/ncr"
},
"bing": {
"fields": {
"q": "$search"
},
"id": 2,
"next": "^Next$",
"regexp": "<div class=\"sb_tlst\"><h3><a href=\"([^\"]+)\"",
"sleep": 5,
"url": "http://www.bing.com/?scope=web&mkt=it-IT&FORM=W0LH"
},
"yahoo": {
"fields": {
"n": 10,
"p": "$search"
},
"id": 3,
"next": "^Next",
"regexp": "<a class=\"yschttl spt\" href=\"([^\"]+)\"",
"sleep": 5,
"url": "http://www.yahoo.com"
}
},
"search": {
"1": "\"Davide Eynard\"",
"2": "aittalam",
"3": "3564020356"
}
}