Paper review: A Tutorial on Spectral Clustering
Lately I am experimenting with spectral clustering. I find it a very interesting approach (well, family of) to clustering and I think that the paper that most helped me to have a grasp of it was "A Tutorial on Spectral Clustering", by Ulrike von Luxburg. In case you are curious about it, here is its abstract:
"In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed".
Of course this is just a beginning, and if you are interested in the topic I would also suggest you to read at least another couple of papers such as Ng, Jordan, and Weiss: "On spectral clustering: analysis and an algorithm" or "Laplacian eigenmaps for dimensionality reduction and data representation" by Belkin and Niyogi.
While reading von Luxburg's paper, I took some notes that might be handy if you want to have a brief summary of the main concepts or explain them to somebody else. I actually reused them for a class in PoliMI, empirically demonstrating that in research nothing is unuseful ;-) Here are the slides I made, enjoy!
Octave clustering demo part 2: cluster evaluation
[This post is part of the Octave clustering demo series]
So, if you read this you probably have already unpacked the tgz file containing the demo, read the previous article, and ran demos 1 and 2. This post describes in detail the third demo, devoted to cluster evaluation.
In the previous article you saw how you can use an internal evaluation index to get better results from k-means. The main difference between internal and external indices is that the former (e.g. SSE) do not require additional information (such as the original labeling of data points), while the latter (e.g. accuracy) typically rely on that. SSE was used to evaluate the quality of different k-means executions, then the result providing the lowest SSE was considered the best one.
All of this was done supposing we already knew the correct number of clusters in the dataset (the variable clusters we used was actually provided, together with the data points and their classes). In the real world, however, this is not always true: you will typically lack not only the original classification (which is obvious, otherwise you would not need to run a clustering algorithm!), but also the best number of groups you want to divide your data into. So, you will have to find it... But what does best mean, in this case?
In kMeansDemo3 we employ the same SSE function to compare different executions of k-means. These are the main steps performed by the script:
- for k varying between 2 and 10
- run k-means 10 times (to address the random initialization problem, by choosing the lowest-SSE clustering)
- save the best (lowest) SSE and the matching clustering result
- plot the best SSE values for k=2, 3, ..., 10
- allow the user to decide which k is best, then plot that result and show both its SSE and its accuracy.
… Yes, there is nothing automatic in this :-). The reason is that I wanted you to get a grasp of how SSE changes with k and how to interpret it. This is strongly connected to what I wrote early about what best means in this case.
If you look at the SSE plot, you will see that it always decreases with increasing values of k. Which is reasonable, if you think that by increasing the number of clusters you will have points which are closer and closer to their centroids, up to the case when every point is the centroid of its own 1-point cluster and SSE becomes zero. This is totally different from the case in which the number of clusters was given and we just had to take the lowest-SSE results. How can we choose, then, the best k?
By looking again at the SSE plot, you will notice that the value of SSE always decreases, but for some values of k the change is bigger than for others. To find the best k, what we should look for is where the decrease changes most, that is the "elbow" inside the plot where the difference in change is maximum. Of course, this is quite easy to spot visually but not always trivial in terms of calculations. Some code that roughly does the trick follows:
diffs = [sse(2:end) 0]-sse; % calculates differences in SSE between each (k,k-1) pair ratios = [diffs(2:end) 0]./diffs; % calculate ratios between differences k = find(ratios == min(ratios(ratios>0))) + 1; % get lowest ratio (constrained to be >0 to ignore case k=1 and k=max)
... however, there are still cases (even in the provided datasets) where it does not perform well, especially when k-means is not the best algorithm for a given dataset. Run this demo both with blobs and circles/crescents datasets and try to answer the following questions:
- were you always able to calculate the correct number of clusters using the code provided above?
- were you always able to visually detect the correct number of clusters by looking at the SSE plot?
- was the value of k you estimated confirmed by the final plot/the accuracy calculated against the ground truth?
- how would you fix the code provided above to automatically estimate k even in the "blobs03" example? (note: if the code already gives you the correct k do not worry, just skip this question (see comments below ;)).
- how would you fix the code provided above to automatically estimate k in the circles/crescents examples?
This time I really wrote a lot of questions :) Don't worry, some of them will be (partially) answered in the next tutorial!
Octave clustering demo part 1: k-means
[This post is part of the Octave clustering demo series]
So, if you are here you probably have already unpacked the tgz file containing the demo and done your first experiments trying to run Octave scripts. Great! This post describes in detail the first two demo files: kMeansDemo1 and kMeansDemo2.
Try to run kMeansDemo1. You should execute it as kMeansDemo1(dataset), where dataset is one of the .mat files saved in the data directory. For instance, supposing your current working directory is the one where you unpacked the .tgz file, you could run:
kMeansDemo1('./data/blobs01.mat')
Now, all the demos are somehow interactive, that is they will plot stuff, comment the results in the main window and ask you to press a key to continue. All the demos have the possibility of becoming non-interactive: this is done by setting the variable "interactive" in the scripts to 0. Here are the main operations that the first demo performs:
- it plots data, so that "real" clusters (i.e. the ground truth, not the detected ones) are shown. Note that for all the datasets provided you have points (variable X), ground truth (variable classes), and number of clusters (variable clusters);
- for three times, it runs the k-means algorithm trying to identify clusters in data. This is the original page where k-means code comes from, and here is an example of how it can be called (thanks Dan for sharing!);
- every time the algorithm ends, its accuracy is calculated and shown together with the clustering results. Note that bad behaviors are foreseen and in some cases you just cannot reach 100% accuracy with a simple k-means.
After you run the full example, try to answer the following questions:
- Were the results the same? If not, can you tell why? Feel free to open myKmeans.m to better understand how it works.
- During the class we said that you can initialize k-means by just choosing random positions for centroids or by using some smarter heuristics (e.g. using actual data points for initial centroids, choosing them as far as possible from each other, and so on). Does myKmeans.m use these kind of heuristics?
- How would you modify the algorithm to obtain better/more stable results?
The second demo, kMeansDemo2, partially answers the last question, providing a way to address the random centroid initialization problem. Still, this remains non-deterministic, however there are many more chances to have better results by following its approach. The main idea is: find a way to evaluate how well k-means performed, then run it many times and keep only the best results.
A pretty common way to evaluate clustering algorithm performances is to use SSE (Sum of Squared Errors), that is the sum of all the (squared) distances between points in the dataset and the centroids of the clusters they have been assigned to. What we want are clusters which are, at the same time, tight (i.e. where points are near to their centroids) and far apart from each other. SSE mainly deals with the first of these two characteristic, returning results which should be as low as possible for a clustering to be good. So, the demo does the following:
- it runs k-means 10 times
- each time, it calculates the SSE for the whole clustering
- it keeps the lowest SSE (and matching cluster labels) as the best result, then shows it to the user
After running the demo with different datasets, try to answer the following questions:
- Does the quality of the algorithm only depend on the number of repetitions, i.e. is it always possible to get 100% accuracy by running k-means a sufficiently high number of times?
- Is the initialization of the bestSSE variable correct in kMeansDemo2 code? Does the code work fine for all the examples?
- Why didn't we directly use the accuracy function to evaluate the algorithm performances, checking the correctness of the assigned labels with respect to ground truth?
- Now that we get stabler results, compare the plots of two clusterings obtained in different executions. Are labels (i.e. colors in the plot) equal from one execution to the other? If not, does that matter? How are they evaluated against ground truth?
Octave clustering demo part 0: introduction and setup
As promised to my PAMI students, I have prepared a couple of demos to get a better grasp of how different clustering algorithms work. Hoping they could be useful for somebody else (and sure that this way I will not lose them so easily ;-)) I have decided to post more information about them here.
All the demos are in one single file and can be downloaded from the course page together with slides and other material (or if you only want the demo file follow this direct link). Once unpacked, you will find the following:
- data/*.mat (some example datasets)
- accuracy.m (calculates clustering accuracy)
- adjacency_weighted.m (calculates the weighted adjacency matrix used by spectral clustering)
- kMeansDemo*.m (k-means demos)
- L2_distance.m (calculates L2 distance between two vectors/matrices)
- laplacian.m (builds the Laplacian used by spectral clustering)
- myKmeans.m (performs k-means clustering)
- plotClustering.m (given clustered data and ground truth, plots clustering results)
- repKmeans.m (automatically repeats k-means ntimes and keeps the result with lowest SSE)
- spectralDemo.m (spectral clustering demo)
- SSE.m (given clustered data and centroids, calculates total SSE)
- uAveragedAccuracy.m (another way to calculate clustering accuracy)
To run the code just open Octave, cd into the spectralClustering directory, and call the different functions. The demo files are executed in the following ways:
kMeansDemo1(dataset); - or - spectralDemo(dataset,nn,t);
where dataset is one of the files in the data directory, nn is the number of nearest neighbors for the calculation of the adjacency matrix, and t is the parameter for the Gaussian kernel used to calculate similarities in the weighted adjacency matrix (0 can be used for auto-tuning and usually works fine enough). For example:
kMeansDemo1('./data/blobs03.mat'); - or - spectralDemo('./data/circles02.mat',10,0);
Good values of nn for these examples range between 10 and 40... I will let you experiment which ones are better for each dataset ;-). Now feel free to play with the demos or read the following tutorials:
- Octave clustering demo part 1: k-means
- Octave clustering demo part 2: cluster evaluation
- Octave clustering demo part 3: spectral clustering
- Octave clustering demo part 4: k-medoids
- Octave clustering demo part 5: hierarchical clustering
- Octave clustering demo part 6: (more) evaluation
New (old) paper: Destinations Similarity Based on User Generated Pictures’ Tags
[This is post number 1 of the "2012 publications" series. Read here if you want to know more about this]
I have posted a new publication in the Research page:
Alessandro Inversini, Davide Eynard, Leonardo Gentile, and Marchiori Elena (2012). Destinations Similarity Based on User Generated Pictures' Tags.
"Pictures about tourism destinations are part of the contents shared online through social media
by travelers. Additional picture information, such as geo-tags and user description of a place,
can be used to create groups of similar destinations. This paper investigates the possibility of
defining destination similarities based on implicit information already shared on the Web.
Flickr.com was used as a case study as it represents the most popular picture sharing website.
Results show the possibility to group similar destinations based on visual components,
represented by the contents of the pictures, and the related tag descriptions".
2012 publications series
I have recently attended the "Promoting your academic profile on the Web" at USI. It was a good workshop (thanks Lorenzo and Nadzeya!) and allowed me to get an idea about how well I am communicating who I am and what I do to the world. As you might know already, I think this is pretty important if your aim is to make your research (or in general your work) open and accessible to everyone.
The introspection that followed the workshop made me realize how often I put doing stuff before communicating it. I mean, of course I do that when I publish a paper, but too often I do not let others know that the paper I wrote exists! For instance, I realized that as of today I had not updated my publication list with any of the work published in 2012... D'oh :-)
For this reason I have decided not only to update that list, but also to write one post for each new paper, making the abstract and additional material available. In this post, instead, I will keep an updated list of links to this year's publications, so you can just watch this page and see when something new has been posted.
2012 publication series:
- Alessandro Inversini, Davide Eynard, Leonardo Gentile, and Marchiori Elena (2012). Destinations Similarity Based on User Generated Pictures' Tags.
- Davide Eynard, Matteo Matteucci, and Fabio Marfia (2012).A Modular Framework to Learn Seed Ontologies from Text.
- Davide Eynard, Alessandro Inversini, and Leonardo Gentile (2012). Finding similar destinations with Flickr Geotags.
- Alessandro Inversini, Davide Eynard. Harvesting User Generated Picture Metadata To Understand Destination Similarity.
- Davide Eynard, Luca Mazzola, and Antonina Dattolo. Exploiting tag similarities to discover synonyms and homonyms in folksonomies.
Internet Technology (2011-2012) assignments are online
After one year here is another update regarding my Internet Technology class (see here for last year's update). Unfortunately it will also be the last one, at least for the class as it is now, because the master I was teaching this class for has been closed :-/. But hey, there are many ways in which knowledge can be shared and that master was only one, right?
So here they are, the new papers written by my dear students! This year fewer have been shared, but I think their quality kind of compensates the amount. So do not worry if you cannot access all of them and enjoy the fact that the ones you can read are willingly shared by students with a CC BY-NC-SA license :-) If you are interested in any of the topics let me know and I might try to put you in contact with the authors.
New year, new you
... starting from the blog theme. Of course I have just downloaded a ready made one, otherwise with my taste you would have probably gotten something painful for your eyes ;-)
New year's resolutions? Plenty. But after last year's ones, my main resolution is no promises :-). And no creativity-killer posts: I'll try to stay far away from those topics I know will stop me from writing instead of incentivating me. I'll try to make this fun and useful, first of all for me. And if you find something useful here too, well, good for you ;-)
Fist post of the year, first after a long while... And to leave you with some more food for thought than the one you would have just by reading news about my wordpress themes, here you are:
Alon, Uri: "How To Choose a Good Scientific Problem". Molecular cell doi:10.1016/j.molcel.2009.09.013 (volume 35 issue 6 pp.726 - 728).
Here's the abstract:
"Choosing good problems is essential for being a good scientist. But what is a good problem, and how do you choose one? The subject is not usually discussed explicitly within our profession. Scientists are expected to be smart enough to figure it out on their own and through the observation of their teachers. This lack of explicit discussion leaves a vacuum that can lead to approaches such as choosing problems that can give results that merit publication in valued journals, resulting in a job and tenure."
I found the paper very inspiring and I agreed with most of it. Here are few sentences I particularly liked:
- "A lab is a nurturing environment that aims to maximize the potential of students as scientists and as human beings."
- "The projects that a particular researcher finds interesting are an expression of a personal filter, a way of perceiving the world. This filter is associated with a set of values: the beliefs of what is good, beautiful, and true versus what is bad, ugly, and false."
- "... when one can achieve self-expression in science, work becomes revitalizing, self- driven, and laden with personal meaning."
What do you think about it? I think that this self-expression, this possibility of projecting my personal values in my work is one of the main reasons I have chosen to do it. Of course, this is also constraining me somehow: what happens when I work with others? What if there is a clash of values between me and my collaborators? Finally, one last big question arises: how much is this applicable for other job? Is there a chance for everyone to achieve this self-expression or only for someone? What about those who can't?
Ok, enough food for today ;-) One last link, which you might find interesting if you liked this paper too: Uri Alon Lab homepage, where you can find more materials for nurturing scientists.
Take care, have a great 2012!
Gopher is dead, long live gopher
Some months ago, while preparing a lesson for my Internet Technology class, I was doing some research on old protocols just to give my students the feeling about how the Internet used to work some years ago, and how it is much, much more than just the Web. During this research I found some interesting information about good old Gopher and how to run it from android devices... Hey, wait, did I say android devices? Yep! The Overbite project aims at bringing gopher to the most recent devices and browsers. As Firefox &co. from their latest versions have stopped supporting Gopher (shame on you), guys from the Overbite project have also decided to develop a browser extension to let you continue browsing it (well, if you ever did it before ;)).
What struck my mind is not the piece of news per se, but the fact that there was a community (I thought a very small one) that was still interested in letting people access the gopherspace... To find what? So I spent some time (probably not enough, but still more than I planned) browsing it myself and checking what is available right now...
What I found is that the gopherspace, or at least the small part of it I was able to read in that limited time, is surprisingly up-to-date: there are news feeds, weather forecasts, reddit, xkcd, and even twitter (here called twitpher :)). Of course, however, there are also those files I used to find 15 years ago when browsing the early web from a text terminal with lynx: guides for n00bs about the Internet, hacking tutorials, ebooks I did not know about (the one I like most is Albert Einstein's theory of relativity in words of four letters or less, that you can find here in HTML). Moreover, there's still people willing to use Gopher as a mean to share their stuff, or to provide useful services for other users: FTP servers (built on a cluster of Playstation 3 consoles... awesome!) with collections of rare operating systems, LUG servers with mirrors of all the main Linux distros, pages distributing podcasts and blogs (well, phlogs!). Ah, and for all those who don't want to install anything to access gopher there's always the GopherProxy service that can be accessed using any browser.
After seeing all of this, one word came into my mind: WHY? Don't misunderstand me, I think all of this is really cool and an interesting phenomenon to follow and I really love to see all of these people still having incentives in using this old technology. And it is great to see that the incentives, in this case, are not the usual ones you might find in a participative system. I mean, what's one of the main incentives in using Wikipedia? Well, the fact that lots of people will read what you have written (if you don't agree, think about how many people would create new articles in a wiki which is not already as famous as Wikipedia). And how many readers is a page from the Gopherspace going to have? Well, probably not as many as any popular site you can find around the Web. But Gopher, mainly relying on text files, has a very light protocol which is superfast (and cheap!) on a mobile phone. It has no ads. It adds no fuss to the real, interesting content you want to convey to your readers. And quoting the words of lostnbronx from Information Underground:
"... I tell you, there's something very liberating about not having to worry over "themes" or "Web formatting" or whatever. When you use gopher, you drop your files onto the server, maybe add a notation to a gophermap if you're using one (which is purely optional), and...that's it. No muss, no fuss, no dicking around with CMS, CSS, stylesheets, or even HTML. Unless you want to. Which I don't. It defeats the purpose, see?"
Aaahh... so much time passed since the last time I have heard such wise words... It is like coming back to my good old 356* and listening to its +players! Let me tell you this, I like these ideas and I am so happy to see this new old Gopher still looks so far from being trendy... Because this means that a lot of time will need to pass before commercial idiots start polluting it! And in the meanwhile, it will be nice to have a place where information can be exchanged in a simple and unexpensive way. Maybe we in the richest part of the world do not realize it, but there are still many places where older but effective technologies are widely used (some examples? Check this one about Nokia most popular phone, and read why we still have USENET), and if something like Gopher could be a solution in this case, well... long live Gopher :-)
Another year passed
Dear +friend,
fortunately, from time to time, your words come back to me:
"There's no state of sadness that cannot be cured with some good prosciutto crudo"
... still, I miss you a lot.
+mala