mala::home Davide “+mala” Eynard’s website

17Feb/120

Internet Technology (2011-2012) assignments are online

After one year here is another update regarding my Internet Technology class (see here for last year's update). Unfortunately it will also be the last one, at least for the class as it is now, because the master I was teaching this class for has been closed :-/. But hey, there are many ways in which knowledge can be shared and that master was only one, right?

So here they are, the new papers written by my dear students! This year fewer have been shared, but I think their quality kind of compensates the amount. So do not worry if you cannot access all of them and enjoy the fact that the ones you can read are willingly shared by students with a CC BY-NC-SA license :-) If you are interested in any of the topics let me know and I might try to put you in contact with the authors.

Filed under: papers, teaching, web No Comments
6Jun/112

Gopher is dead, long live gopher

Some months ago, while preparing a lesson for my Internet Technology class, I was doing some research on old protocols just to give my students the feeling about how the Internet used to work some years ago, and how it is much, much more than just the Web. During this research I found some interesting information about good old Gopher and how to run it from android devices... Hey, wait, did I say android devices? Yep! The Overbite project aims at bringing gopher to the most recent devices and browsers. As Firefox &co. from their latest versions have stopped supporting Gopher (shame on you), guys from the Overbite project have also decided to develop a browser extension to let you continue browsing it (well, if you ever did it before ;)).

What struck my  mind is not the piece of news per se, but the fact that there was a community (I thought a very small one) that was still interested in letting people access the gopherspace... To find what? So I spent some time (probably not enough, but still more than I planned) browsing it myself and checking what is available right now...

What I found is that the gopherspace, or at least the small part of it I was able to read in that limited time, is surprisingly up-to-date: there are news feeds, weather forecasts, reddit, xkcd, and even twitter (here called twitpher :)). Of course, however, there are also those files I used to find 15 years ago when browsing the early web from a text terminal with lynx: guides for n00bs about the Internet, hacking tutorials, ebooks I did not know about (the one I like most is Albert Einstein's theory of relativity in words of four letters or less, that you can find here in HTML). Moreover, there's still people willing to use Gopher as a mean to share their stuff, or to provide useful services for other users: FTP servers (built on a cluster of Playstation 3 consoles... awesome!) with collections of rare operating systems, LUG servers with mirrors of all the main Linux distros, pages distributing podcasts and blogs (well, phlogs!). Ah, and for all those who don't want to install anything to access gopher there's always the GopherProxy service that can be accessed using any browser.

After seeing all of this, one word came into my mind: WHY? Don't misunderstand me, I think all of this is really cool and an interesting phenomenon to follow and I really love to see all of these people still having incentives in using this old technology. And it is great to see that the incentives, in this case, are not the usual ones you might find in a participative system. I mean, what's one of the main incentives in using Wikipedia? Well, the fact that lots of people will read what you have written (if you don't agree, think about how many people would create new articles in a wiki which is not already as famous as Wikipedia). And how many readers is a page from the Gopherspace going to have? Well, probably not as many as any popular site you can find around the Web. But Gopher, mainly relying on text files, has a very light protocol which is superfast (and cheap!) on a mobile phone. It has no ads. It adds no fuss to the real, interesting content you want to convey to your readers. And quoting the words of lostnbronx from Information Underground:

"... I tell you, there's something very liberating about not having to worry over "themes" or "Web formatting" or whatever. When you use gopher, you drop your files onto the server, maybe add a notation to a gophermap if you're using one (which is purely optional), and...that's it. No muss, no fuss, no dicking around with CMS, CSS, stylesheets, or even HTML. Unless you want to. Which I don't. It defeats the purpose, see?"

Aaahh... so much time passed since the last time I have heard such wise words... It is like coming back to my good old 356* and listening to its +players! Let me tell you this, I like these ideas and I am so happy to see this new old Gopher still looks so far from being trendy... Because this means that a lot of time will need to pass before commercial idiots start polluting it! And in the meanwhile, it will be nice to have a place where information can be exchanged in a simple and unexpensive way. Maybe we in the richest part of the world do not realize it, but there are still many places where older but effective technologies are widely used (some examples? Check this one about Nokia most popular phone, and read why we still have USENET), and if something like Gopher could be a solution in this case, well... long live Gopher :-)

11Mar/110

Harvesting Online Content: An Analysis of Hotel Review Websites

A new paper is out:

Marchiori, E., Eynard, D., Inversini, A., Cantoni, L., Cerretti, F. (2011) Harvesting Online Content: An Analysis of Hotel Review Websites. In R. Law, M. Fuchs & Francesco Ricci (Eds.), Information and Communication Technologies in Tourism 2011 – Proceedings of the International Conference in Innsbruck, Austria (pp. 101-112). Wien: Springer.

Find the paper here ;)

Filed under: papers, research, web No Comments
6Feb/113

Javascript scraper basics

As you already know, I am quite much into scraping. The main reason is that I think that what is available on the Internet (and in particular on the Web) should be consumed not only in the way it is provided, but also in a more customized one. We should be able to automatically gather information from different sources, integrate it, and obtain new information as the result of the elaboration of what we found.

Of course, this is highly related to my research topic, that is the Semantic Web. In the SW we suppose we already have data in formats that are easy for a machine to automatically consume, and a good part of the efforts of the SW research community are directed towards enabling standards and technologies that allow us to publish information this way. Of course, even if we made some progresses in this, there are still a couple of problems:

  1. the casual Internet user does not (and does not want to) know about these technologies and how to use them, so the whole "semantic publishing" process should be made totally invisible to her;
  2. most of the data that has been already provided on the Web in the last 20 years does not follow these standards. It has been produced for human consumption and cannot be easily parsed in an automatic way.

Fortunately, some structure in data exists anyway: information written in tables, infoboxes, or generated by a software from structured data (i.e. data saved in a database), typically shows a structure that, albeit informal (that is, not following a formal standard), can be exploited to get the original structured data back. Scrapers rely on this principle to extract relevant information from Web pages and, if provided a correct "crawling plan" (I'm makin' up this name, but I think you can easily understand what I mean ;)), they can more or less easily gather all the contents we need, reconstructing the knowledge we are interested in and allowing us to perform new operations on it.

There are, however, cases in which this operation is not trivial. For instance, when the crawling plan is not easy to reproduce (i.e. contents belong to pages which do not follow a well defined structure). Different problems happen when the page is generated on the fly by some javascript code, thus making HTML source code parsing unuseful. This reminds me much about dynamically generated code as a software protection: when we moved from statically compiled code to one that was updating itself at runtime (such as with packed or encrypted executables), studying a protection on its "dead listing" became more difficult and so we had to change our tools (from simple disassemblers to debuggers or interactive disassemblers) and approaches... but that is, probably, another story ;-) Finally, we might want to provide an easier, more interactive approach to Web scraping that allows users to dynamically choose the page they want to analyze while they are browsing it, and that does not require them to switch to another application to do this.

So, how can we run scrapers dynamically, on the page we are viewing, even if part of its contents have been generated on the fly? Well, we can write few lines of Javascript code and run it from our own browser. Of course there are already apps that allow you to do similar things very easily (anyone said Greasemonkey?), but starting from scratch is a good way to do some practice and learn new things... then you'll always be able to revert to GM later ;-) So, I am going to show you some very easy examples of Javascript code that I have tested on Firefox. I think they should be simple enough to run from other browsers too... well, if you have any problem with them just let me know.

First things first: as usual I'll be playing with regular expressions, so check the Javascript regexp references at regular-expressions.info and javascriptkit.com (here and here) and use the online regex tester tool at regexpal.com. And, of course, to learn basics about bookmarklets and get an idea about what you can do with them check Fravia's and Ritz's tutorials. Also, check when these tutorials have been written and realize how early they understood that Javascript would have changed the way we use our browsers...

Second things second: how can you test your javascript code easily? Well, apart from pasting it inside your location bar as a "javascript:" oneliner (we'll get to that later when we build our bookmarklets), I found very useful the "jsenv" bookmarklet at squarefree: just drag the jsenv link to your bookmark toolbar, click on the new bookmark (or just click on the link appearing in the page) and see what happens. From this very rudimental development environment we'll be able to access the page's contents and test our scripts. Remember that this script is able to access only the contents of the page that is displayed when you run it, so if you want it to work with some other page's contents you'll need to open it again.

Now we should be ready to test the examples. For each of them I will provide both some "human readable" source code and a bookmarklet you can "install" as you did with jsenv.

Example 1: MySwitzerland.com city name extractor

This use case is a very practical one: for my work I needed to get a list of the tourist destinations in Switzerland as described within MySwitzerland.com website. I knew the pages I wanted to parse (one for every region in Switzerland, like this one) but I did not want to waste time collecting them. So, the following quick script helped me:

var results="";
var content = document.documentElement.innerHTML;
var re = new RegExp ("<div style=\"border-bottom:.*?>([^<]+)</div>","gi");
while (array = re.exec(content)){
  results += array[1]+"<br/>";
}
document.documentElement.innerHTML=results;

Just few quick comments about the code:

  • line 2 gets the page content and saves it in the "content" variable
  • line 3 initializes a new regular expression (whose content and modifiers are defined as strings) and saves it in the "re" variable
  • line 4 executes the regexp on the content variable and, as long as it returns matches, it saves them in the "array" variable. As the matching group (the ones defined within parentheses inside the regexp) is only one, the results of the matches will be saved inside array[1]
  • in line 5 the "results" variable, which was initialized as an empty string in line 1, is appended the result of the match plus "<br/>"
  • in the last line, the value of "results" is printed as the new Web page content

The bookmarklet is here: MySwitzerland City Extractor.

Example 2: Facebook's phonebook conversion tool

This example shows how to quickly create a CSV file out of Facebook's Phonebook page. It is a simplified version that only extracts the first phone number for every listed contact, and the regular expression is kind of redundant but is a little more readable and should allow you, if you want, to easily find where it matches within the HTML source.

var output="";
content = document.documentElement.innerHTML;
var re = new RegExp ("<div class=\"fsl fwb fcb\">.*?<a href=\"[^\"]+\">([^<]+)<.*?<div class=\"fsl\">([^<]+)<span class=\"pls fss fcg\">([^<]+)</span>", "gi");
while (array = re.exec(content)){
 output += array[1]+";"+array[2]+";"+array[3]+"<br/>";
}
document.documentElement.innerHTML = output;

As you can see, the structure is the very same of the previous example. The only difference is in the regular expression (that has three matching groups: one for the name, one for the phone number, and one for the type, i.e. mobile, home, etc.) and, obviously, in the output. You can get the bookmarklet here: Facebook Phonebook to CSV.

Example 3: Baseball stats

This last example shows you how you can get new information from a Web page just by running a specific tool on it: data from the Web page, once extracted from the HTML code, becomes the input for an analysis tool (that can be more or less advanced - in our case is going to be trivial) that returns its output to the user. For the example I decided to calculate an average value from the ones shown in a baseball statistics table at baseball-reference.com. Of course I know nothing about baseball so I just got a value even I could understand, that is the team members' age :) Yes, the script runs on a generic team roster page and returns the average age of its members... probably not very useful, but should give you the idea. Ah, and speaking of this, there are already much more powerful tools that perform similar tasks like statcrunch, which automatically detects tables within Web pages and imports them so you can calculate statistics over their values. The advantage of our approach, however, is that it is completely general: any operation can be performed on any piece of data found within a Web page.

The code for this script follows. I have tried to be as verbose as I could, dividing data extraction in three steps (get team name/year, extract team members' list, get their ages). As the final message is just a string and does not require a page on its own, the calculated average is returned within an alert box.

var content, payrolls, team, count, sum, avg, re;
var content = document.documentElement.innerHTML;

// get team name/year
re = new RegExp ("<h1>([^<]+)<", "gi");
if (array = re.exec(content)){
 team = array[1];
}

// extract "team future payrolls" to get team members' list
re = new RegExp ("div_payroll([\\s\\S]+?)appearances_table", "i");
if (array = re.exec(content)){
 payrolls = array[1];
}

sum=0;count=0;
// extract members' ages and sum them
re = new RegExp ("</a></td>\\s+<td align=\"right\">([^<]+)</td>", "gi");
while (array = re.exec(payrolls)){
 sum += parseInt(array[1]);
 count++;
}

alert("The average age of "+team+" is "+sum/count+" out of "+count+" team members");

The bookmarklet for this example is available here: Baseball stats  - Age.

Conclusions

In this article I have shown some very simple examples of Javascript scrapers that can be saved as bookmarklets and run on the fly while you are browsing specific Web pages. The examples are very simple and should both give you an idea about the potentials of this approach and teach you the basics of javascript regular expressions so that you can apply them in other contexts. If you want to delve deeper into this topic, however, I suggest you to give a look at Greasemonkey and its huge repository of user-provided scripts: that is probably the best way to go if you want to easily develop and share your powerbrowsing tools.

5Feb/110

The importance of being serendipitous

Ok, ok, I know it seems just an excuse for the delay in my weekly post, however:

  • I was writing a beautiful post on Javascript scrapers and bookmarklets, when I started doing some experiments on a facebook page (you'll see more in the next post...)
  • on facebook I saw some updates from friends, crossposted from twitter, that reminded me that I had to check recent twits
  • after a while watching twits, I decided to check the list of people that few selected friends of mine were following (which is a good way to increase the possibility to find something interesting), so I stumbled onto Silvia's twits and found her recent interview online (btw Silvia, that's cool... congrats!)
  • at the end of the interview, I saw that it was originally appearing on Issue 45 of fullcircle magazine so I went to check it - I opened the pdf file as I thought that the "top 5 music notation apps" article might have been interesting for a friend of mine
  • at page 5 my attention was attracted by an article about Conky... wait... What is that? Just a matter of minutes: google search, official website, screenshots, aptitude install...

Now I have a beautiful conky panel on my desktop (and yes, I have also solved the conky disappears issue... thanks Google!), I have discovered an ezine I did not know and a twitter account I was not following. And the best thing is that these streams of consciousness happen pretty often in my life: I still remember how I discovered the existence of The Time Traveler's Wife by reading the blog of the wife of a researcher who was publishing stuff I was interested in. Sometimes instead I just happen to find stuff without exactly remembering how I got to it, such as with Jesus Christ Vampire Hunter.

I kind of like my serendipitous attitude, as most of the times the things I find are actually interesting... But that kind of scares my friends when I tell them about the stuff I find. :) It is hard to make them understand that I was not searching for it, but rather it found me, and that it is not an attention problem: I can stay concentrated when I need it, but when I leave my mind open to serendipity these things just happen very easily.

Btw, the Javascript article is coming soon... sorry for making you wait ;)

Filed under: serendipity, web No Comments
27Jan/110

Internet Technology assignments are online

As some of you might already know, last semester I have taught a class, called Internet Technology, at University of Lugano (USI). This has been a great chance not only to teach students something I was very passionate about (I have been lucky enough to be allowed to choose the topic and the contents of the class, so I basically put everything I liked inside it), but also to make clear in my mind some concepts and links about Internet in general plus Web 1.0, 2.0, 3.0, and following ones ;-).

At the end of the course, I decided not to evaluate my students with a classical exam. I thought that for a subject that evolves so quickly it would have not been useful to ask for notions that will become outdated, or in the worst case false, in few months. I rather wanted to be sure they had grasped the main concepts and got the right attitude to deal with Internet-related topics and technologies now and in the future. So I asked them to write a short paper (around 10 pages) about a topic related to the ones we discussed in class. I suggested some (you can find them here) just in case there were students without ideas, asked them to communicate me their choice in advance so I could check that everyone was working on a different topic, and finally gave them enough time to complete their work (and enjoy their Christmas holidays in between!).

I have corrected students' assignments about a week ago and I have to say that I'm pretty satisfied by them. As "everything is already online" nowadays it is very difficult to write something really new, but I think that most of these works provide both online references and some personal insights that, together, help in making those documents useful to interested readers. I have asked students if they wanted to share them on the course wiki with a CC license and most of them agreed (yeeeh!). You can find their documents here: their English is not great, but I think you might appreciate the contents anyway ;-)

Filed under: teaching, web No Comments
5Apr/070

Lunch Seminar

Lunch Seminar: Research on collaborative information sharing systems

Research on collaborative information sharing systems from Davide Eynard
6Dec/060

Bibsonomy

As you probably have noticed yet just by giving a look at my About page, I'm using Bibsonomy as an online tool to manage bibliographies for my (erm... future?) papers. If you use BibTeX (or EndNote) for your bibliographies this is quite a useful tool. With it you can:

  • Save references to the papers/websites you have read all in one place
  • Save all the bibliography-related information in a standard, widely used format
  • Tag publications: this doesn't only mean you can easily find them later, but also that you can share them with others, find new ones, find new people interested in the same stuff and so on (soon: an article about tags and folksonomies, I promise)
  • Easily export publication lists in BibTeX format, ready to include in your papers

This export feature is particularly useful: you can manage your bibliographies directly on the website and then export them (or a part of them) in few clicks. Even better: you can do that even without clicking at all! Try this:

wget "http://www.bibsonomy.org/bib/user/deynard?items=100" -O bibs.bib

You've just downloaded my whole collection inside one "bibs.bib" file, ready to be included in your LaTeX document. The items parameter allows you to specify how many publications you want to download: yes, it sounds silly (why would you want only the first n items?)
Of course, if you like you can just download items tagged as "tagging" with this URL:

http://www.bibsonomy.org/bib/user/deynard/tagging

You can join tags like in this URL:

http://www.bibsonomy.org/bib/user/deynard/tagging+ontology

And of course, you can substitute wget with lynx:

lynx -dont_wrap_pre -dump "http://www.bibsonomy.org/bib/user/deynard?items=100" > bibs2.bib

Well, that's all for now. If you happen to find any interesting hacks lemme know ;)

Filed under: research, tagging, web No Comments