mala::home Davide “+mala” Eynard’s website

5Jun/120

Octave clustering demo part 1: k-means

[This post is part of the Octave clustering demo series]

So, if you are here you probably have already unpacked the tgz file containing the demo and done your first experiments trying to run Octave scripts. Great! This post describes in detail the first two demo files: kMeansDemo1 and kMeansDemo2.

Try to run kMeansDemo1. You should execute it as kMeansDemo1(dataset), where dataset is one of the .mat files saved in the data directory. For instance, supposing your current working directory is the one where you unpacked the .tgz file, you could run:

kMeansDemo1('./data/blobs01.mat')

Now, all the demos are somehow interactive, that is they will plot stuff, comment the results in the main window and ask you to press a key to continue. All the demos have the possibility of becoming non-interactive: this is done by setting the variable "interactive" in the scripts to 0. Here are the main operations that the first demo performs:

  • it plots data, so that "real" clusters (i.e. the ground truth, not the detected ones) are shown. Note that for all the datasets provided you have points (variable X), ground truth (variable classes), and number of clusters (variable clusters);
  • for three times, it runs the k-means algorithm trying to identify clusters in data. This is the original page where k-means code comes from, and here is an example of how it can be called (thanks Dan for sharing!);
  • every time the algorithm ends, its accuracy is calculated and shown together with the clustering results. Note that bad behaviors are foreseen and in some cases you just cannot reach 100% accuracy with a simple k-means.

After you run the full example, try to answer the following questions:

  1. Were the results the same? If not, can you tell why? Feel free to open myKmeans.m to better understand how it works.
  2. During the class we said that you can initialize k-means by just choosing random positions for centroids or by using some smarter heuristics (e.g. using actual data points for initial centroids, choosing them as far as possible from each other, and so on). Does myKmeans.m use these kind of heuristics?
  3. How would you modify the algorithm to obtain better/more stable results?

The second demo, kMeansDemo2, partially answers the last question, providing a way to address the random centroid initialization problem. Still, this remains non-deterministic, however there are many more chances to have better results by following its approach. The main idea is: find a way to evaluate how well k-means performed, then run it many times and keep only the best results.

A pretty common way to evaluate clustering algorithm performances is to use SSE (Sum of Squared Errors), that is the sum of all the (squared) distances between points in the dataset and the centroids of the clusters they have been assigned to. What we want are clusters which are, at the same time, tight (i.e. where points are near to their centroids) and far apart from each other. SSE mainly deals with the first of these two characteristic, returning results which should be as low as possible for a clustering to be good. So, the demo does the following:

  • it runs k-means 10 times
  • each time, it calculates the SSE for the whole clustering
  • it keeps the lowest SSE (and matching cluster labels) as the best result, then shows it to the user

After running the demo with different datasets, try to answer the following questions:

  1. Does the quality of the algorithm only depend on the number of repetitions, i.e. is it always possible to get 100% accuracy by running k-means a sufficiently high number of times?
  2. Is the initialization of the bestSSE variable correct in kMeansDemo2 code? Does the code work fine for all the examples?
  3. Why didn't we directly use the accuracy function to evaluate the algorithm performances, checking the correctness of the assigned labels with respect to ground truth?
  4. Now that we get stabler results, compare the plots of two clusterings obtained in different executions. Are labels (i.e. colors in the plot) equal from one execution to the other? If not, does that matter? How are they evaluated against ground truth?
5Jun/122

Octave clustering demo part 0: introduction and setup

As promised to my PAMI students, I have prepared a couple of demos to get a better grasp of how different clustering algorithms work. Hoping they could be useful for somebody else (and sure that this way I will not lose them so easily ;-)) I have decided to post more information about them here.

All the demos are in one single file and can be downloaded from the course page together with slides and other material (or if you only want the demo file follow this direct link). Once unpacked, you will find the following:

  • data/*.mat (some example datasets)
  • accuracy.m (calculates clustering accuracy)
  • adjacency_weighted.m (calculates the weighted adjacency matrix used by spectral clustering)
  • kMeansDemo*.m (k-means demos)
  • L2_distance.m (calculates L2 distance between two vectors/matrices)
  • laplacian.m (builds the Laplacian used by spectral clustering)
  • myKmeans.m (performs k-means clustering)
  • plotClustering.m (given clustered data and ground truth, plots clustering results)
  • repKmeans.m (automatically repeats k-means ntimes and keeps the result with lowest SSE)
  • spectralDemo.m (spectral clustering demo)
  • SSE.m (given clustered data and centroids, calculates total SSE)
  • uAveragedAccuracy.m (another way to calculate clustering accuracy)

To run the code just open Octave, cd into the spectralClustering directory, and call the different functions. The demo files are executed in the following ways:

    kMeansDemo1(dataset);
    - or -
    spectralDemo(dataset,nn,t);

where dataset is one of the files in the data directory, nn is the number of nearest neighbors for the calculation of the adjacency matrix, and t is the parameter for the Gaussian kernel used to calculate similarities in the weighted adjacency matrix (0 can be used for auto-tuning and usually works fine enough). For example:

    kMeansDemo1('./data/blobs03.mat');
    - or -
    spectralDemo('./data/circles02.mat',10,0);

Good values of nn for these examples range between 10 and 40... I will let you experiment which ones are better for each dataset ;-). Now feel free to play with the demos or read the following tutorials:

17Feb/120

Internet Technology (2011-2012) assignments are online

After one year here is another update regarding my Internet Technology class (see here for last year's update). Unfortunately it will also be the last one, at least for the class as it is now, because the master I was teaching this class for has been closed :-/. But hey, there are many ways in which knowledge can be shared and that master was only one, right?

So here they are, the new papers written by my dear students! This year fewer have been shared, but I think their quality kind of compensates the amount. So do not worry if you cannot access all of them and enjoy the fact that the ones you can read are willingly shared by students with a CC BY-NC-SA license :-) If you are interested in any of the topics let me know and I might try to put you in contact with the authors.

Filed under: papers, teaching, web No Comments
6Jun/112

Gopher is dead, long live gopher

Some months ago, while preparing a lesson for my Internet Technology class, I was doing some research on old protocols just to give my students the feeling about how the Internet used to work some years ago, and how it is much, much more than just the Web. During this research I found some interesting information about good old Gopher and how to run it from android devices... Hey, wait, did I say android devices? Yep! The Overbite project aims at bringing gopher to the most recent devices and browsers. As Firefox &co. from their latest versions have stopped supporting Gopher (shame on you), guys from the Overbite project have also decided to develop a browser extension to let you continue browsing it (well, if you ever did it before ;)).

What struck my  mind is not the piece of news per se, but the fact that there was a community (I thought a very small one) that was still interested in letting people access the gopherspace... To find what? So I spent some time (probably not enough, but still more than I planned) browsing it myself and checking what is available right now...

What I found is that the gopherspace, or at least the small part of it I was able to read in that limited time, is surprisingly up-to-date: there are news feeds, weather forecasts, reddit, xkcd, and even twitter (here called twitpher :)). Of course, however, there are also those files I used to find 15 years ago when browsing the early web from a text terminal with lynx: guides for n00bs about the Internet, hacking tutorials, ebooks I did not know about (the one I like most is Albert Einstein's theory of relativity in words of four letters or less, that you can find here in HTML). Moreover, there's still people willing to use Gopher as a mean to share their stuff, or to provide useful services for other users: FTP servers (built on a cluster of Playstation 3 consoles... awesome!) with collections of rare operating systems, LUG servers with mirrors of all the main Linux distros, pages distributing podcasts and blogs (well, phlogs!). Ah, and for all those who don't want to install anything to access gopher there's always the GopherProxy service that can be accessed using any browser.

After seeing all of this, one word came into my mind: WHY? Don't misunderstand me, I think all of this is really cool and an interesting phenomenon to follow and I really love to see all of these people still having incentives in using this old technology. And it is great to see that the incentives, in this case, are not the usual ones you might find in a participative system. I mean, what's one of the main incentives in using Wikipedia? Well, the fact that lots of people will read what you have written (if you don't agree, think about how many people would create new articles in a wiki which is not already as famous as Wikipedia). And how many readers is a page from the Gopherspace going to have? Well, probably not as many as any popular site you can find around the Web. But Gopher, mainly relying on text files, has a very light protocol which is superfast (and cheap!) on a mobile phone. It has no ads. It adds no fuss to the real, interesting content you want to convey to your readers. And quoting the words of lostnbronx from Information Underground:

"... I tell you, there's something very liberating about not having to worry over "themes" or "Web formatting" or whatever. When you use gopher, you drop your files onto the server, maybe add a notation to a gophermap if you're using one (which is purely optional), and...that's it. No muss, no fuss, no dicking around with CMS, CSS, stylesheets, or even HTML. Unless you want to. Which I don't. It defeats the purpose, see?"

Aaahh... so much time passed since the last time I have heard such wise words... It is like coming back to my good old 356* and listening to its +players! Let me tell you this, I like these ideas and I am so happy to see this new old Gopher still looks so far from being trendy... Because this means that a lot of time will need to pass before commercial idiots start polluting it! And in the meanwhile, it will be nice to have a place where information can be exchanged in a simple and unexpensive way. Maybe we in the richest part of the world do not realize it, but there are still many places where older but effective technologies are widely used (some examples? Check this one about Nokia most popular phone, and read why we still have USENET), and if something like Gopher could be a solution in this case, well... long live Gopher :-)

27Jan/110

Internet Technology assignments are online

As some of you might already know, last semester I have taught a class, called Internet Technology, at University of Lugano (USI). This has been a great chance not only to teach students something I was very passionate about (I have been lucky enough to be allowed to choose the topic and the contents of the class, so I basically put everything I liked inside it), but also to make clear in my mind some concepts and links about Internet in general plus Web 1.0, 2.0, 3.0, and following ones ;-).

At the end of the course, I decided not to evaluate my students with a classical exam. I thought that for a subject that evolves so quickly it would have not been useful to ask for notions that will become outdated, or in the worst case false, in few months. I rather wanted to be sure they had grasped the main concepts and got the right attitude to deal with Internet-related topics and technologies now and in the future. So I asked them to write a short paper (around 10 pages) about a topic related to the ones we discussed in class. I suggested some (you can find them here) just in case there were students without ideas, asked them to communicate me their choice in advance so I could check that everyone was working on a different topic, and finally gave them enough time to complete their work (and enjoy their Christmas holidays in between!).

I have corrected students' assignments about a week ago and I have to say that I'm pretty satisfied by them. As "everything is already online" nowadays it is very difficult to write something really new, but I think that most of these works provide both online references and some personal insights that, together, help in making those documents useful to interested readers. I have asked students if they wanted to share them on the course wiki with a CC license and most of them agreed (yeeeh!). You can find their documents here: their English is not great, but I think you might appreciate the contents anyway ;-)

Filed under: teaching, web No Comments