mala::home Davide “+mala” Eynard’s website

5Jun/120

Octave clustering demo part 1: k-means

[This post is part of the Octave clustering demo series]

So, if you are here you probably have already unpacked the tgz file containing the demo and done your first experiments trying to run Octave scripts. Great! This post describes in detail the first two demo files: kMeansDemo1 and kMeansDemo2.

Try to run kMeansDemo1. You should execute it as kMeansDemo1(dataset), where dataset is one of the .mat files saved in the data directory. For instance, supposing your current working directory is the one where you unpacked the .tgz file, you could run:

kMeansDemo1('./data/blobs01.mat')

Now, all the demos are somehow interactive, that is they will plot stuff, comment the results in the main window and ask you to press a key to continue. All the demos have the possibility of becoming non-interactive: this is done by setting the variable "interactive" in the scripts to 0. Here are the main operations that the first demo performs:

  • it plots data, so that "real" clusters (i.e. the ground truth, not the detected ones) are shown. Note that for all the datasets provided you have points (variable X), ground truth (variable classes), and number of clusters (variable clusters);
  • for three times, it runs the k-means algorithm trying to identify clusters in data. This is the original page where k-means code comes from, and here is an example of how it can be called (thanks Dan for sharing!);
  • every time the algorithm ends, its accuracy is calculated and shown together with the clustering results. Note that bad behaviors are foreseen and in some cases you just cannot reach 100% accuracy with a simple k-means.

After you run the full example, try to answer the following questions:

  1. Were the results the same? If not, can you tell why? Feel free to open myKmeans.m to better understand how it works.
  2. During the class we said that you can initialize k-means by just choosing random positions for centroids or by using some smarter heuristics (e.g. using actual data points for initial centroids, choosing them as far as possible from each other, and so on). Does myKmeans.m use these kind of heuristics?
  3. How would you modify the algorithm to obtain better/more stable results?

The second demo, kMeansDemo2, partially answers the last question, providing a way to address the random centroid initialization problem. Still, this remains non-deterministic, however there are many more chances to have better results by following its approach. The main idea is: find a way to evaluate how well k-means performed, then run it many times and keep only the best results.

A pretty common way to evaluate clustering algorithm performances is to use SSE (Sum of Squared Errors), that is the sum of all the (squared) distances between points in the dataset and the centroids of the clusters they have been assigned to. What we want are clusters which are, at the same time, tight (i.e. where points are near to their centroids) and far apart from each other. SSE mainly deals with the first of these two characteristic, returning results which should be as low as possible for a clustering to be good. So, the demo does the following:

  • it runs k-means 10 times
  • each time, it calculates the SSE for the whole clustering
  • it keeps the lowest SSE (and matching cluster labels) as the best result, then shows it to the user

After running the demo with different datasets, try to answer the following questions:

  1. Does the quality of the algorithm only depend on the number of repetitions, i.e. is it always possible to get 100% accuracy by running k-means a sufficiently high number of times?
  2. Is the initialization of the bestSSE variable correct in kMeansDemo2 code? Does the code work fine for all the examples?
  3. Why didn't we directly use the accuracy function to evaluate the algorithm performances, checking the correctness of the assigned labels with respect to ground truth?
  4. Now that we get stabler results, compare the plots of two clusterings obtained in different executions. Are labels (i.e. colors in the plot) equal from one execution to the other? If not, does that matter? How are they evaluated against ground truth?
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

Trackbacks are disabled.