Gopher is dead, long live gopher

Some months ago, while preparing a lesson for my Internet Technology class, I was doing some research on old protocols just to give my students the feeling about how the Internet used to work some years ago, and how it is much, much more than just the Web. During this research I found some interesting information about good old Gopher and how to run it from android devices… Hey, wait, did I say android devices? Yep! The Overbite project aims at bringing gopher to the most recent devices and browsers. As Firefox &co. from their latest versions have stopped supporting Gopher (shame on you), guys from the Overbite project have also decided to develop a browser extension to let you continue browsing it (well, if you ever did it before ;)).

What struck my  mind is not the piece of news per se, but the fact that there was a community (I thought a very small one) that was still interested in letting people access the gopherspace… To find what? So I spent some time (probably not enough, but still more than I planned) browsing it myself and checking what is available right now…

What I found is that the gopherspace, or at least the small part of it I was able to read in that limited time, is surprisingly up-to-date: there are news feeds, weather forecasts, reddit, xkcd, and even twitter (here called twitpher :)). Of course, however, there are also those files I used to find 15 years ago when browsing the early web from a text terminal with lynx: guides for n00bs about the Internet, hacking tutorials, ebooks I did not know about (the one I like most is Albert Einstein’s theory of relativity in words of four letters or less, that you can find here in HTML). Moreover, there’s still people willing to use Gopher as a mean to share their stuff, or to provide useful services for other users: FTP servers (built on a cluster of Playstation 3 consoles… awesome!) with collections of rare operating systems, LUG servers with mirrors of all the main Linux distros, pages distributing podcasts and blogs (well, phlogs!). Ah, and for all those who don’t want to install anything to access gopher there’s always the GopherProxy service that can be accessed using any browser.

After seeing all of this, one word came into my mind: WHY? Don’t misunderstand me, I think all of this is really cool and an interesting phenomenon to follow and I really love to see all of these people still having incentives in using this old technology. And it is great to see that the incentives, in this case, are not the usual ones you might find in a participative system. I mean, what’s one of the main incentives in using Wikipedia? Well, the fact that lots of people will read what you have written (if you don’t agree, think about how many people would create new articles in a wiki which is not already as famous as Wikipedia). And how many readers is a page from the Gopherspace going to have? Well, probably not as many as any popular site you can find around the Web. But Gopher, mainly relying on text files, has a very light protocol which is superfast (and cheap!) on a mobile phone. It has no ads. It adds no fuss to the real, interesting content you want to convey to your readers. And quoting the words of lostnbronx from Information Underground:

“… I tell you, there’s something very liberating about not having to worry over “themes” or “Web formatting” or whatever. When you use gopher, you drop your files onto the server, maybe add a notation to a gophermap if you’re using one (which is purely optional), and…that’s it. No muss, no fuss, no dicking around with CMS, CSS, stylesheets, or even HTML. Unless you want to. Which I don’t. It defeats the purpose, see?”

Aaahh… so much time passed since the last time I have heard such wise words… It is like coming back to my good old 356* and listening to its +players! Let me tell you this, I like these ideas and I am so happy to see this new old Gopher still looks so far from being trendy… Because this means that a lot of time will need to pass before commercial idiots start polluting it! And in the meanwhile, it will be nice to have a place where information can be exchanged in a simple and unexpensive way. Maybe we in the richest part of the world do not realize it, but there are still many places where older but effective technologies are widely used (some examples? Check this one about Nokia most popular phone, and read why we still have USENET), and if something like Gopher could be a solution in this case, well… long live Gopher :-)

2 comments June 6th, 2011

Another year passed

Dear +friend,

fortunately, from time to time, your words come back to me:

“There’s no state of sadness that cannot be cured with some good prosciutto crudo”

… still, I miss you a lot.

+mala

1 comment May 3rd, 2011

The importance of detecting patterns

A student whose M.S. thesis project I have followed in the last months has recently handed out his final thesis report. We have spent some time talking about his work, what parts of it he liked most, and what he had learned from this experience. The project turned out to be quite appreciated (this is one of the papers that we published with him) and I was glad to hear that he liked a lot the part involving scraping. He has also gained quite much experience about the topic, enough to find its weakest parts (i.e. relying only on regexps to extract page contents is bad, as some kind of pages change pretty often and you might need to rewrite them each time) and its most interesting generalizations. During our talk, he told me that one of the most impressive things for him was that websites belong to families, and if you can spot the common patterns within a family you can write general scrapers that work well with many sites.

Actually, this is not such an amazing piece of news, but that’s SO true that I felt it was worth to spend some time over it. I have been experimenting this concept on my skin, while I was writing one of my first crawlers: it was called TWO (The Working Offline forum reader, a name which tells much about my previous successes in writing scrapers ;)) and its purpose was to fully download the contents of a forum inside a DB, allowing users to read them when offline. This tool relied on a set of plugins which allowed it to work with different forum technologies (I was supporting PhPBB, VBulletin, and a much less standard software used for some very specific forums about reversing and searching). Of course, what I noticed while writing it is that all forums could be roughly described in the same way: every forum has some threads, every thread has different messages, and every message has specific properties (Author, Date, Subject, Message body) that are repeated throughout the system. Having different systems complying to the same standards provides some advantages:

  1. crawlers are usually built over two main components: a “browsing component” that deals with finding the right starting pages and following the right links in them, and a “scraping component” that deals with extracting relevant page contentes. We can exploit browsing patterns to use the same browsing component for crawling different websites/systems. For instance, we know that in every main forum page we will follow some links (with a format that can be specified with a regexp) to access different sub-forums containing lists of threads, and then we will access one or more pages with the typical “list” pattern (go to first, last, previous, or next page);
  2. similarly, we can exploit scraping patterns to use the same data structures for objects retrieved from different sources. For instance, we can create a “forumMessage” object whose attributes are the message content, the subject, the date, and the author, and these will hold for all the messages we get from any website;
  3. being able to bring together, in the same objects, homogeneous types of information from heterogeneous sources allows us to automatically analyze it, bringing some new knowledge out. For instance, if we know we have a “rating” field in two websites which is comparable (i.e. uses a compatible range of values, or can be normalized), then we can calculate an average of this rating over different services (a value that most of websites, competing one with each other, will never give you).

To give you a practical example, I have applied this concept on search engines to write a simple but very flexible search engine aggregator. I needed to get results from different search engines, merge them, and calculate for instance in how many different domains results could be grouped, which engine returned more results, and so on. I made some example searches with google, yahoo, bing, and others, and described the general access patterns as follows:

  1. first of all, a user agent connects to the search engine at a specified address (in most of the cases it is the default s.e. domain, but sometimes you might want to force a different URL, i.e. to get results in a given language and not the default one according to your location);
  2. then, the main search form is filled with the search string and some parameters you want your results to comply to (i.e. number of results per page, language, date restrictions, and so on);
  3. when the results are returned, they are usually shown in a well defined way (i.e. easily parsable with a given regular expression) and if they do not fit in one page it is possible to follow links to the next or the following ones. If you want all the results, you just need to load the next results page until no more “next” link appears.

The result is a (well, yet another) scraper. Its input is a JSON like the following one:

{
  "engines": {
    "google": {
      "fields": {
        "num": 10,
        "q": "$search"
      },
      "id": 1,
      "next": "Next",
      "nextURL": "google.com/search",
      "regexp": "<a href=\"([^\"]+)\" class=l>",
      "sleep": 2,
      "url": "http://www.google.com/ncr"
    },
    "bing": {
      "fields": {
        "q": "$search"
      },
      "id": 2,
      "next": "^Next$",
      "regexp": "<div class=\"sb_tlst\"><h3><a href=\"([^\"]+)\"",
      "sleep": 5,
      "url": "http://www.bing.com/?scope=web&mkt=it-IT&FORM=W0LH"
    },
    "yahoo": {
      "fields": {
        "n": 10,
        "p": "$search"
      },
      "id": 3,
      "next": "^Next",
      "regexp": "<a class=\"yschttl spt\" href=\"([^\"]+)\"",
      "sleep": 5,
      "url": "http://www.yahoo.com"
    }
  },
  "search": {
    "1": "\"Davide Eynard\"",
    "2": "aittalam",
    "3": "3564020356"
  }
}

As you can see, all the search engines are described in the same way (miracles of patterns :)). Each one of them has a name and an id (used later to match it with its own results), a URL and a set of input fields, a sleep time (useful if anyone tries to ban you because you are hammering it too much), plus regular expressions to parse both the single URLs returned as results and the “Next” links inside result pages. To test the scraper, download it here. Pass it the text above as input (either by pasting it or by saving it in a file and redirecting its contents, i.e. perl se.pl <testsearch.json). The scraper is going to take some time as it is querying all the three search engines with all the three search strings (I left $DEBUG=1 to allow you to see what is happening while it runs). At the end you’ll be returned another JSON file, containing the results of your combined search: a list of all the matching URLs grouped by domain, together with the search engine(s) that returned them and their ranking.

Of course, this is just a skeleton app over which you can build something much more useful. Some examples? I have used a variation of this script to build a backlink analysis tool, returning useful statistics about who is linking a given URL, and another one to get blog posts about a given topic that are later mined for opinions. Hope you got the main teaching of this story: patterns are everywhere, it’s up to you to exploit them to make something new and useful ;)

{
“engines”: {
“google”: {
“fields”: {
“num”: 10,
“q”: “$search”
},
“id”: 1,
“next”: “Next”,
“nextURL”: “google.com/search”,
“regexp”: “<a href=\”([^\"]+)\”>”,
“sleep”: 2,
“url”: “http://www.google.com/ncr”
},
“bing”: {
“fields”: {
“q”: “$search”
},
“id”: 2,
“next”: “^Next$”,
“regexp”: “<div class=\”sb_tlst\”><h3><a href=\”([^\"]+)\”",
“sleep”: 5,
“url”: “http://www.bing.com/?scope=web&mkt=it-IT&FORM=W0LH”
},
“yahoo”: {
“fields”: {
“n”: 10,
“p”: “$search”
},
“id”: 3,
“next”: “^Next”,
“regexp”: “<a class=\”yschttl spt\” href=\”([^\"]+)\”",
“sleep”: 5,
“url”: “http://www.yahoo.com”
}
},
“search”: {
“1″: “\”Davide Eynard\”",
“2″: “aittalam”,
“3″: “3564020356″
}
}

Add comment May 2nd, 2011

Slides for “An integrated approach to discover tag semantics”

The slides of my presentation at SAC 2011 are available on SlideShare:



Just to have an idea on what the presentation is about, here‘s an excerpt of the paper’s abstract and the link to the paper itself.

Add comment April 1st, 2011

New paper: An integrated approach to discover tag semantics

Antonina Dattolo, Davide Eynard, and Luca Mazzola. An Integrated Approach to Discover Tag Semantics. 26th Annual ACM Symposium on Applied Computing, vol. 1, pp. 814-820. Taichung, Taiwan, March 2011. From the abstract:

“Tag-based systems have become very common for online classification thanks to their intrinsic advantages such as self-organization and rapid evolution. However, they are still affected by some issues that limit their utility, mainly due to the inherent ambiguity in the semantics of tags. Synonyms, homonyms, and polysemous words, while not harmful for the casual user, strongly affect the quality of search results and the performances of tag-based recommendation systems. In this paper we rely on the concept of tag relatedness in order to study small groups of similar tags and detect relationships between them. This approach is grounded on a model that builds upon an edge-colored multigraph of users, tags, and resources. To put our thoughts in practice, we present a modular and extensible framework of analysis for discovering synonyms, homonyms and hierarchical relationships amongst sets of tags. Some initial results of its application to the delicious database are presented, showing that such an approach could be useful to solve some of the well known problems of folksonomies”.

Paper is available here. Enjoy! ;)

Add comment March 25th, 2011

Harvesting Online Content: An Analysis of Hotel Review Websites

A new paper is out:

Marchiori, E., Eynard, D., Inversini, A., Cantoni, L., Cerretti, F. (2011) Harvesting Online Content: An Analysis of Hotel Review Websites. In R. Law, M. Fuchs & Francesco Ricci (Eds.), Information and Communication Technologies in Tourism 2011 – Proceedings of the International Conference in Innsbruck, Austria (pp. 101-112). Wien: Springer.

Find the paper here ;)

Add comment March 11th, 2011

WOEID to Wikipedia reconciliation

For a project we are developing at PoliMI/USI, we are using Yahoo! APIs to get data (photos and tags associated to these photos) about a city. We thought it would be nice to provide, together with this information, also a link or an excerpt from the Wikipedia page that matches the specific city. However, we found that the matching between Yahoo’s WOEIDs and Wikipedia articles is far from trivial…

First of all, just two words on WOEIDs: they are unique, 32-bit identifiers used within Yahoo! GeoPlanet to refer to all geo-permanent named places on Earth. WOEIDs can be used to refer to differently sized places, from towns to Countries or even continents (i.e. Europe is 24865675). A more in-depth explanation of this can be found in the Key Concepts page within GeoPlanet documentation, and an interesting introductory blog post with examples to play with is available here. Note that, however, you now need a valid Yahoo! application id  to test these APIs (which means you should be registered in the Yahoo! developer network and then get a new appid by creating a new project).

One cool aspect of WOEIDs (as for other geographical ids such as GeoNames‘ ones) is that you can use them to disambiguate the name of a city you are referring to: for instance, you have Milan and you want to make sure you are referring to Milano, Italy and not to the city of Milan, Michigan. The two cities have two different WOEIDs, so when you are using one of them you exactly  know which one of the two you are talking about. A similar thing happens when you search for Milan (or any other ambiguous city name) on Wikipedia: most of the times you will be automatically redirected to the most popular article, but you can always search for its disambiguation page (here is the example for Milan) and choose between the different articles that are listed inside it.

Of course, the whole idea of having standard, global, unique identifiers for things in the real world is a great one per se, and being able to use it for disambiguation is only one aspect of it. While disambiguation can be (often, but not always!) easy at the human level, where the context and the background of the people who communicate help them in understanding which entity a particular name refers to, this does not hold for machines. Having unique identifiers saves machines from the need of disambiguating, but also allows them to easily link data between different sources, provided they all use the same standard for identification. And linking data, that is making connections between things that were not connected before, is a first form of inference, a very simple but also a very useful one that allows us to get new knowledge from the one we originally had. Thus, what makes these unique identifiers really useful is not only the fact that they are unique. Uniqueness allows for disambiguating, but is not sufficient to link a data source to others. To do this, identifiers also need to be shared between different systems and knowledge repositories: the more the same id is used across knowledge bases, the easier it is to make connections between them.

What happens when two systems, instead, use different ids? Well, unless somebody decides to map the ids between the two systems, there are few possibilities of getting something useful out of them. This is the reason why the reconciliation of objects across different systems is so useful: once you state that their two ids are equivalent, then you can perform all the connections that you would do if the objects were using the same id. This is the main reason why matching WOEIDs for cities with their Wikipedia pages would be nice, as I wrote at the beginning of this post.

Wikipedia articles are already disambiguated (except, of course, for disambiguation pages) and their names can be used as unique identifiers. For instance, DBPedia uses article names as a part of its URIs (see, for instance, information about the entities Milan and Milan(disambiguation)). However, what we found is that there is no trivial way to match Wikipedia articles with WOEIDs: despite what others say on the Web, we found no 100% working solution. Actually, the ones who at least return something are pretty far from that 100% too: Wikilocation works fine with monuments or geographical elements but not with large cities, while Yahoo! APIs themselves have a direct concordance with Wikipedia pages, but according to the documentation this is limited to airports and towns within the US.

The solution to this problem is a mashup approach, feeding the information returned by a Yahoo! WOEID-based query to another data source capable of dealing with Wikipedia pages. The first experiment I tried was to query DBPedia, searching for articles matching Places with the same name and a geolocation contained in the boundingBox. The script I built is available here (remember: to make it work, you need to change it entering a valid Yahoo! appid) and performs the following SPARQL query on DBPedia:

SELECT DISTINCT ?page ?fbase WHERE {
 ?city a <http://dbpedia.org/ontology/Place> .
 ?city foaf:page ?page .
 ?city <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat .
 ?city <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long .
 ?city rdfs:label ?label .
 ?city owl:sameAs ?fbase .
 FILTER (?lat > "45.40736"^^xsd:float) .
 FILTER (?lat < "45.547058"^^xsd:float) .
 FILTER (?long > "9.07683"^^xsd:float) .
 FILTER (?long < "9.2763"^^xsd:float) .
 FILTER (regex(str(?label), "^Milan($|,.*)")) .
 FILTER (regex(?fbase, "http://rdf.freebase.com/ns/")) .
}

Basically, what it gets are the Wikipedia page and the Freebase URI for a place called “like” the one we are searching, where “like” means either exactly the same name (“Milan”) or one which still begins with the specified name but is followed by a comma and some additional text (i.e. “Milan, Italy”). This is to take into account cities whose Wikipedia page name also contains the Country they belong to. Some more notes are required to better understand how this works:

  • I am querying for articles matching “Places” and not “Cities” because on DBPedia not all the cities are categorized as such (data is still not very consistent);
  • I am matching rdfs:label for the name of the City, but unfortunately not all cities have such a property;
  • requiring the Wikipedia article to have equivalent URIs related with the owl:sameAs property is kind of strict, but I saw that most of the cities had not only one such URI, but also most of the times the one from Freebase I was searching for.

This solution, of course, is still kind of naive. I have tested it with a list WOEIDs of the top 233 cities around the world and its recall is pretty bad: out of 233 cities the empty results were 96, which corresponds to a recall lower than 60%. The reasons of this are many: sometimes the geographic coordinates of the cities in Wikipedia are just out of the bounding box provided by GeoPlanet; other times the city name returned by Yahoo! does not belong to any of the labels provided by DBPedia, or no rdfs:label property is present at all; some cities are not even categorized as Places; very often accents or alternative spellings make the city name (which usually is returned by Yahoo! without special characters) untraceable within DBPedia; and so on.

Trying to find an alternative approach, I reverted to good old Freebase. Its api/service/search API allows to query the full text index of Metaweb’s content base for a city name or part of it, returning all the topics whose name or alias match it and ranking them according to different parameters, including their popularity in Freebase and Wikipedia. This is a really powerful and versatile tool and I suggest everyone who is interested in it to check its online documentation to get an idea about its potential. The script I built is very similar to the previous one: the only difference is that, after the query to Yahoo! APIs, it queries Freebase instead of DBPedia. The request it sends to the search API is like the following one:

where (like in the previous script) city name and bounding box coordinates are provided by Yahoo! APIs. Here are some notes to better understand the API call:

  • the city name is provided as the query parameter, while type is set to /location/citytown to get only the cities from Freebase. In this case, I found that every city I was querying for was correctly assigned this type;
  • the mql_output parameter specifies what you want in Freebase’s response. In my case, I just asked for Wikipedia ID (asking for the “key” whose “namespace” was /wikipedia/en_id). Speaking about IDs, Metaweb has done a great job in reconciliating entities from different sources and already provides plenty of unique identifiers for its topics. For instance, we could get not only Wikipedia and Freebase own IDs here, but also the ones from Geonames if we wanted to (this is left to the reader as an exercise ;)). If you want to know more about this, just check the Id documentation page on Freebase wiki;
  • the mql_filter parameter allows you to specify some constraints to filter data before they are returned by the system. This is very useful for us, as we can put our constraints on geographic coordinates here. I also specified the type /location/location to “cast” results on it, as it is the one which has the geolocation property. Finally, I repeated the constraint on the Wikipedia key which is also present in the output, as not all the topics have this kind of key and the API wants us to filter them away in advance.

Luckily, in this case the results were much more satisfying: only 9 out of 233 cities were not found, giving us a recall higher than 96%. The reasons why those cities were missing follow:

  • three cities did not have the specified name as one of their alternative spellings;
  • four cities had non-matching coordinates (this can be due either to Metaweb’s data or to Yahoo’s bounding boxes, however after a quick check it seems that Metaweb’s are fine);
  • two cities (Buzios and Singapore) just did not exist as cities in Freebase.

The good news is that, apart from the last case, the other ones can be easily fixed just by updating Freebase topics: for instance one city (Benidorm) just did not have any geographic coordinates, so (bow to the mighty power of the crowd, and of Freebase that supports it!) I just added them taking the values from Wikipedia and now the tool works fine with it. Of course, I would not suggest anybody to run my 74-lines script now to reconciliate the WOEIDs of all the cities in the World and then manually fix the empty results, however this gives us hope on the fact that, with some more programming effort, this reconciliation could be possible without too much human involvement.

So, I’m pretty satisfied right now. At least for our small project (which will probably become the subject of one blog post sooner or later ;)) we got what we needed, and then who knows, maybe with the help of someone we could make the script better and start adding WOEIDs to cities in Freebase… what do you think about this?

I have prepared a zip file with all the material I talked about in this post, so you don’t have to follow too many links to get all you need. In the zip you will find:

  • woe2wp.pl and woe2wpFB.pl, the two perl scripts;
  • test*.pl, the two test scripts that run woe2wp or woe2wpFB over the list of WOEIDs provided in the following file;
  • woeids.txt, the list of 233 WOEIDs I tested the scripts with;
  • output*.txt, the (commented) outputs of the two test scripts.

Here is the zip package. Have fun ;)

3 comments February 28th, 2011

Semantic Annotations Part 1: an introduction

This week’s post is about a topic I’m really interested in, that is semantic annotations. It is so interesting for me that in the last two years, despite having jobs not directly involving this subject, I have tried to learn more and work on this topic anyway, using my spare time. Why is that so important for me? Well, it might become the same for you if you like its basic concept, that is allowing anyone to write anything about anything else. Moreover, in a perfect (or at least well designed ;)) world semantic annotations would also allow anyone to only get the information written by people they trust/like (or that authorized them), without being overwhelmed by unuseful, wrong, or bad data.

A big problem about semantic annotations is that the subject itself, from a researcher’s point of view, is really broad and any kind of breadth-first approach on the topic tends to leave you with pretty shallow concepts to deal with. At the same time, going depth-first while ignoring some aspects of annotations will only make your approach seem too simplistic to people who are experts in other aspects (or who already tried the breadth-first approach ;)). I think this is the main reason why my work on semantic annotations has become more and more like the development of Duke Nukem Forever… Which, in case you don’t know, is a great example of how trying to reach absolute perfection -especially in a field where everything evolves so quickly- keeps you more and more far from having something simply done. For anyone who wants to read something about this, I’d really suggest you to give a look at this article, which I found really enlightening.

So, trying to follow the call to “release early, release often”, I’ll post a series of articles about semantic annotations here. I have decided to skip scientific venues for a while, at least till I have something that is at the same time deep and broad enough. And if I never reach that… Well, you will have read everything I’ve done in the meanwhile and I hope something good will come out of there anyway.

What is an annotation?

To start understanding semantic annotations, I guess I should first clarify what an annotation is. Annotations are notes about something: if you are reading a book, you can write them in the page margins; if someone parks a car out of your garage, you can leave one on her windshield (well, better if you directly write that on the windshield, so she’ll remember it next time ;)); or, for example, you can add an annotation to food in the fridge with its expiration date. What is common between these annotations is that they are all written on a medium (paper, windshield, whatever!) and they are physically placed somewhere. Moreover, they have been written by someone in a specific moment in time, and they comment something in specific (some text within the book, the act of leaving a car in the wrong place, the duration of some food).

What is a computer annotation?

This is what happens “in real life”, but what about computer annotations? As everything is data (some time ago I would have said “Everything is byte“), annotations become metadata, that is data about data. For them we would like to be able to maintain some of the characteristics of the “physical” annotations. They are useful if we can see them in the context of the piece of data they are annotating (what use is an expiration date if we don’t know which food it refers to?) and if we can know their authors and dates of creation. There is no “physical medium” for them, but nothing prevents us from adding some other meta-meta-data (that is, data about the annotation itself)  that customizes it to become some sort of electronic post-it, a formal note, an audio file, or whatever else we can imagine.

Computer annotation systems are far from new: think, for instance, about the concept of annotations in documents. However, they get a completely new meaning when a medium like the Web becomes available: in this case we talk about Web annotations, every resource with a URL can be uniquely identified and using XPath it is also possible to access specific parts of a Web page. Collecting Web annotations makes it possible, whenever a web page is opened, to check whether metadata exist for it and display them contextually. Systems like Google Sidewiki allow exactly this kind of operations, but they are not the only ones available: tags are nothing else than simple annotations added to generic URLs (such as in delicious), photos (Flickr), and so on; ratings are typically associated to products, but which are often associated with unique URLs within a website, and systems like Revyu allow you to rate basically anything that has a URI. Finally, there are even games exploiting the concept of Web annotations like The Nethernet.

What is a semantic annotation?

A semantic annotation is a computer annotation that relies on semantics for its definition. And here’s the rub: this definition is so wide that it can actually cover many different families of annotations. For instance:

  • semantics can be used to define information about the annotation itself in a structured way (i.e. the author, the date, and so on). An example of such an annotation system is Annotea;
  • semantics can be used to univocally define the meaning of the content of the annotation. For instance, if you tag something “Turkey” nobody will be ever able to know if you were talking about the animal or the country, while if you tag it with Wordnet synsets 01794158 and 09039411 you’ll be able to disambiguate;
  • semantics can be used to (also) define the format of the contents of the annotation, meaning that the “body” of the annotation is not a simple unstructured text, but it contains RDF triples or some kind of structured information (the semantic annotation ontology I co-developed last year at Hypios follows this idea).

In the next episodes…

Ehmm… I guess I might have lost someone here, but trust me… there’s nothing too difficult, it is just a matter of entering a little more into the details. As the “semantics” part requires a more in-depth description, I’ll leave it to the next “episode” of this series. My idea, at least for now, is the following:

  • Semantic Annotations Part 2: where’s the semantics?
  • Semantic Annotations Part 3: early prototypes for a semantic annotation system
  • Semantic Annotations Part 4: the SAnno project

I’m pretty sure there will be changes in this list, but I’ll make sure they will be reflected here so you will always be able to access all the other articles from every post belonging to this series.

1 comment February 20th, 2011

Android apps you can live without (but that I like)

I had decided to write a post about SL4A (Scripting Layer for Android) this week, but I have just started learning it and do not have much to say yet. What I learned, however, is that the space on my Hero is quite limited, especially now that more applications (highly) above 1MB of size are available.

I have always tried to keep my list of apps as clean as I can, but of course many times I have forgotten something clearly unuseful between them, or left something that I thought would have been useful sooner or later… Now, trying to free some space in my phone, I have started to classify apps as “removable” for next time I run out of space, “unremovable”, “probably going to be useful sooner or later” and so on (I don’t know how many of these classes one should have, I just made up some in my  mind – let me know if you think I should add more or remove some). Here are some of the applications I have now on my Hero and the categories I have associated to them.

Apps I’d remove if I could

These are apps that I cannot (easily) remove because they come with the firmware, but I’d really like to remove them:

  • Stocks and Footprints: I am not using them at all;
  • Peep: I found it nice at the beginning, but later realised it is pretty limited especially if compared to other Twitter clients for Android;
  • Teeter: this is a really nice app, but I have already finished it plenty of times and got bored. It would be so much better if it had more levels…

Unremovable

Well, at least for now… This is the list of apps that I use regularly, but I might change them if I found better alternatives.

  • Dropbox: I find it very useful whenever I want to bring something with me (like a pdf that I want to read later while I’m commuting) or, conversely, when I want to copy something to my PC without using the USB cable;
  • Game Dev Story: this is a game that is totally addictive for me. The same night I have tried the free version I have decided to buy the full one. Only now, after plenty of hours of playing and twenty years in the videogame, I am starting to get bored (as now it starts to get a little too repetitive);
  • Lazy Geek: I am not using it that much, but when I did it it has proved to be very useful, allowing me to run batch scripts to quickly check server-related stuff from my mobile;
  • Of course Perl for Android, even if I still haven’t still extensively used it;
  • Scan2PDF mobile: ok, this is not the kind of app you use everyday but I found it useful in many circumstances, from scanning documents I had to email to quickly PDFfing notes taken on a whiteboard during meetings;
  • Sipdroid, to connect my phone to my VOIP account. I like it much more than Skype, it works perfectly with my Messagenet account and all in just 860KB;
  • TED Mobile: I like to watch TED talks and this is my favorite way to see them while I’m travelling. I think I will also install RSA Vision soon for the same reason;
  • Tubemate: I use it to download YouTube videos for my son and watch them while offline;
  • TuneIn Radio, to listen to streaming radios (and in particular The Bone!) from wherever I am;
  • Train timetables for Italy (Orari Trenitalia) and Switzerland (Timetable Switzerland), absolutely indispensable for me :)

Nice, but only for a while

  • Air control lite: a cool game that I played a lot… after a while it gets a little too repetitive but I’d suggest it for a try;
  • Overbite Android: actually you must be a little more into gopher than I am to really appreciate it, but I think it is great to know not only that gopher is still alive, but also that you can access it from mobile devices. Indispensable if you want to access online resources in a very cheap way, but they need to be there and you need to know where they are. By the way, I think this is a great topic for a new blog post, so stay tuned;
  • Skype: one word: bah! Of course some might need it at all costs, but fortunately I can use something else for cheap calls (see Sipdroid above) and avoid having people bug me all day long with IM. And the size… That’s what I call an overbloated app! Next one I’ll remove when I need some space;
  • SMS Backup and restore: simple, free, and works perfectly. Of course, unless you’re paranoid about losing your text messages, you won’t need it very often. I used it once when I was upgrading Android on my phone and it worked fine… even if next time I’ll probably just backup messages without restoring them ;)

Useful, but substitutable

These are apps that I consider useful, but that probably can (or sometimes need to) be substituted by other applications: any suggestion about this is welcome!

  • Aldiko: this is a really cool ebook reader. I liked it so much that I decided to support it and buy the Premium version. But damn… it is SO huge for an ebook reader! Should I switch to something smaller?
  • Droid Comic Viewer: this is a nice app to read comics in different formats (such as CBR, CBZ, JPG, and so on). I found it nice, albeit not perfect… So I am open to alternatives if you think there are better ones;
  • ES File Explorer: it looks cool but I think I am not using at its full potential. I guess there might be another, less powerful, lighter app that’s probably better for me;
  • Note Everything: a very powerful note-taking application. I use it very much and like it, but again I think I’m not exploiting its full potential;
  • Taskiller: the first task killer I have installed. I don’t know if it is better or worse than others, but it works fine for me;
  • Shuffle: it’s a free GTD todo list manager. It is quite customizable and the nice things is that it allows you to check your list according to different facets: for instance you can see all the tasks you have to do today, or the ones related to a specific context or a given project. This app is at the same time one I would not live without (I always keep many todo lists), something you might get bored of (I am currently not using it much) and something that can be substituted (in my case, lately, by lists on paper!). The main reason I stopped using it is that I’m experimenting with another, more physical, medium, and I kind of like it, but apart from this I still think it’s a very good app.

Conclusions

Well, this is more or less the list of apps I have now on my Android. Of course there were many others I have already uninstalled, however I thought it wouldn’t have been very useful to describe them too… ;) I hope that in this list you can find something that might be useful to you too. However, I’m always fearing that this list is too biased by the fact that most of these apps are the first ones that appear in the market for popularity: this always leaves me with the doubt that there might be something very good in the long queue but that I’ll never have a chance to try. So, please, if you have other good apps to suggest, especially lighter substitutes of the ones I already have, but also new ones that you think I might like to try, just let me know: I’ll be happy to try them.

4 comments February 13th, 2011

Javascript scraper basics

As you already know, I am quite much into scraping. The main reason is that I think that what is available on the Internet (and in particular on the Web) should be consumed not only in the way it is provided, but also in a more customized one. We should be able to automatically gather information from different sources, integrate it, and obtain new information as the result of the elaboration of what we found.

Of course, this is highly related to my research topic, that is the Semantic Web. In the SW we suppose we already have data in formats that are easy for a machine to automatically consume, and a good part of the efforts of the SW research community are directed towards enabling standards and technologies that allow us to publish information this way. Of course, even if we made some progresses in this, there are still a couple of problems:

  1. the casual Internet user does not (and does not want to) know about these technologies and how to use them, so the whole “semantic publishing” process should be made totally invisible to her;
  2. most of the data that has been already provided on the Web in the last 20 years does not follow these standards. It has been produced for human consumption and cannot be easily parsed in an automatic way.

Fortunately, some structure in data exists anyway: information written in tables, infoboxes, or generated by a software from structured data (i.e. data saved in a database), typically shows a structure that, albeit informal (that is, not following a formal standard), can be exploited to get the original structured data back. Scrapers rely on this principle to extract relevant information from Web pages and, if provided a correct “crawling plan” (I’m makin’ up this name, but I think you can easily understand what I mean ;)), they can more or less easily gather all the contents we need, reconstructing the knowledge we are interested in and allowing us to perform new operations on it.

There are, however, cases in which this operation is not trivial. For instance, when the crawling plan is not easy to reproduce (i.e. contents belong to pages which do not follow a well defined structure). Different problems happen when the page is generated on the fly by some javascript code, thus making HTML source code parsing unuseful. This reminds me much about dynamically generated code as a software protection: when we moved from statically compiled code to one that was updating itself at runtime (such as with packed or encrypted executables), studying a protection on its “dead listing” became more difficult and so we had to change our tools (from simple disassemblers to debuggers or interactive disassemblers) and approaches… but that is, probably, another story ;-) Finally, we might want to provide an easier, more interactive approach to Web scraping that allows users to dynamically choose the page they want to analyze while they are browsing it, and that does not require them to switch to another application to do this.

So, how can we run scrapers dynamically, on the page we are viewing, even if part of its contents have been generated on the fly? Well, we can write few lines of Javascript code and run it from our own browser. Of course there are already apps that allow you to do similar things very easily (anyone said Greasemonkey?), but starting from scratch is a good way to do some practice and learn new things… then you’ll always be able to revert to GM later ;-) So, I am going to show you some very easy examples of Javascript code that I have tested on Firefox. I think they should be simple enough to run from other browsers too… well, if you have any problem with them just let me know.

First things first: as usual I’ll be playing with regular expressions, so check the Javascript regexp references at regular-expressions.info and javascriptkit.com (here and here) and use the online regex tester tool at regexpal.com. And, of course, to learn basics about bookmarklets and get an idea about what you can do with them check Fravia’s and Ritz’s tutorials. Also, check when these tutorials have been written and realize how early they understood that Javascript would have changed the way we use our browsers…

Second things second: how can you test your javascript code easily? Well, apart from pasting it inside your location bar as a “javascript:” oneliner (we’ll get to that later when we build our bookmarklets), I found very useful the “jsenv” bookmarklet at squarefree: just drag the jsenv link to your bookmark toolbar, click on the new bookmark (or just click on the link appearing in the page) and see what happens. From this very rudimental development environment we’ll be able to access the page’s contents and test our scripts. Remember that this script is able to access only the contents of the page that is displayed when you run it, so if you want it to work with some other page’s contents you’ll need to open it again.

Now we should be ready to test the examples. For each of them I will provide both some “human readable” source code and a bookmarklet you can “install” as you did with jsenv.

Example 1: MySwitzerland.com city name extractor

This use case is a very practical one: for my work I needed to get a list of the tourist destinations in Switzerland as described within MySwitzerland.com website. I knew the pages I wanted to parse (one for every region in Switzerland, like this one) but I did not want to waste time collecting them. So, the following quick script helped me:

var results="";
var content = document.documentElement.innerHTML;
var re = new RegExp ("<div style=\"border-bottom:.*?>([^<]+)</div>","gi");
while (array = re.exec(content)){
  results += array[1]+"<br/>";
}
document.documentElement.innerHTML=results;

Just few quick comments about the code:

  • line 2 gets the page content and saves it in the “content” variable
  • line 3 initializes a new regular expression (whose content and modifiers are defined as strings) and saves it in the “re” variable
  • line 4 executes the regexp on the content variable and, as long as it returns matches, it saves them in the “array” variable. As the matching group (the ones defined within parentheses inside the regexp) is only one, the results of the matches will be saved inside array[1]
  • in line 5 the “results” variable, which was initialized as an empty string in line 1, is appended the result of the match plus “<br/>”
  • in the last line, the value of “results” is printed as the new Web page content

The bookmarklet is here: MySwitzerland City Extractor.

Example 2: Facebook’s phonebook conversion tool

This example shows how to quickly create a CSV file out of Facebook’s Phonebook page. It is a simplified version that only extracts the first phone number for every listed contact, and the regular expression is kind of redundant but is a little more readable and should allow you, if you want, to easily find where it matches within the HTML source.

var output="";
content = document.documentElement.innerHTML;
var re = new RegExp ("<div class=\"fsl fwb fcb\">.*?<a href=\"[^\"]+\">([^<]+)<.*?<div class=\"fsl\">([^<]+)<span class=\"pls fss fcg\">([^<]+)</span>", "gi");
while (array = re.exec(content)){
 output += array[1]+";"+array[2]+";"+array[3]+"<br/>";
}
document.documentElement.innerHTML = output;

As you can see, the structure is the very same of the previous example. The only difference is in the regular expression (that has three matching groups: one for the name, one for the phone number, and one for the type, i.e. mobile, home, etc.) and, obviously, in the output. You can get the bookmarklet here: Facebook Phonebook to CSV.

Example 3: Baseball stats

This last example shows you how you can get new information from a Web page just by running a specific tool on it: data from the Web page, once extracted from the HTML code, becomes the input for an analysis tool (that can be more or less advanced – in our case is going to be trivial) that returns its output to the user. For the example I decided to calculate an average value from the ones shown in a baseball statistics table at baseball-reference.com. Of course I know nothing about baseball so I just got a value even I could understand, that is the team members’ age :) Yes, the script runs on a generic team roster page and returns the average age of its members… probably not very useful, but should give you the idea. Ah, and speaking of this, there are already much more powerful tools that perform similar tasks like statcrunch, which automatically detects tables within Web pages and imports them so you can calculate statistics over their values. The advantage of our approach, however, is that it is completely general: any operation can be performed on any piece of data found within a Web page.

The code for this script follows. I have tried to be as verbose as I could, dividing data extraction in three steps (get team name/year, extract team members’ list, get their ages). As the final message is just a string and does not require a page on its own, the calculated average is returned within an alert box.

var content, payrolls, team, count, sum, avg, re;
var content = document.documentElement.innerHTML;

// get team name/year
re = new RegExp ("<h1>([^<]+)<", "gi");
if (array = re.exec(content)){
 team = array[1];
}

// extract "team future payrolls" to get team members' list
re = new RegExp ("div_payroll([\\s\\S]+?)appearances_table", "i");
if (array = re.exec(content)){
 payrolls = array[1];
}

sum=0;count=0;
// extract members' ages and sum them
re = new RegExp ("</a></td>\\s+<td align=\"right\">([^<]+)</td>", "gi");
while (array = re.exec(payrolls)){
 sum += parseInt(array[1]);
 count++;
}

alert("The average age of "+team+" is "+sum/count+" out of "+count+" team members");

The bookmarklet for this example is available here: Baseball stats  – Age.

Conclusions

In this article I have shown some very simple examples of Javascript scrapers that can be saved as bookmarklets and run on the fly while you are browsing specific Web pages. The examples are very simple and should both give you an idea about the potentials of this approach and teach you the basics of javascript regular expressions so that you can apply them in other contexts. If you want to delve deeper into this topic, however, I suggest you to give a look at Greasemonkey and its huge repository of user-provided scripts: that is probably the best way to go if you want to easily develop and share your powerbrowsing tools.

2 comments February 6th, 2011

Previous Posts


Links

Feeds