mala::home Davide “+mala” Eynard’s website

30Dec/130

Octave clustering demo part 6: (more) evaluation

[This post is part of the Octave clustering demo series]

This new clustering demo goes deeper into evaluation, showing some techniques to get an idea about the clustering tendency of a dataset, find visual clues about the clusters, and get a hint about the "best" number of clusters. If you have attended the PAMI class this year you probably know what I am talking about. If instead you are new to these demos please (1) check the page linked above and (2) set up your Octave environment (feel free to download the old package and play with the old demos too ;-)). Then download this new package.

If you want to check out the older Octave clustering articles, here they are: part 0 - part 1 - part 2 - part 3 - part 4 - part 5. I strongly suggest you to run at least parts 0-3 before this demo, as they provide you all the basics you need to get the best out of this exercise.

Note that some of the functions available in previous packages are also present in this one. While it is ok to run the previous examples with those functions, make sure you are using the most up-to-date ones for the current experiments, as I debugged and optimized them for this specific demo.

Run the evaluationDemo1 script. This will generate a random dataset and walk you through a sequence of tests to evaluate the clustering tendency and ultimately perform clustering on it. As random clusters might be more or less nasty to cluster, I suggest you to try running the demo few times and see how it behaves in general. Note that part of the code is missing and you will have to provide it for the demo to be complete.

When you feel confident enough with evaluating the random dataset, run the evaluationDemo2 script. This will be a little more interactive, asking you to choose, between three datasets, the one which has the most "interesting" content, and requiring you to write small pieces of code to complete your evaluation.

At the end of your experiment, answer the following questions:

  1. comment the different executions of evaluationDemo1. How sensitive to "bad" overlapping clusters are SSE elbow and distance matrix plot?  Does the presence of overlapping clusters affect the clustering tendency test? How do you think it would be possible (and in this case, would it be meaningful) to distinguish between two completely overlapping clusters?
  2. comment the different executions of evaluationDemo2. Which is the "interesting" dataset? Why? Is the SSE elbow method useful to automatically detect the number of clusters? Why? What additional information does the distance matrix convey? Is Spectral clustering better than k-Means? Did you happen to find parameters that give better accuracy than the default ones?

If you are a PAMI student, please write your answers in a pdf file, motivating them and providing all the material (images, code, numerical results) needed to support them.

 

Hints:

  • the demo stops at each step waiting for you to press a key. You can disable this feature by setting the "interactive" variable to zero at the beginning of the script;
  • the second demo file has some "return" commands purposely left in the code to stop the execution at given points (typically when you need to add some code before proceeding, or to increase the suspance ;-));
  • I tested the code on Octave (3.6.4) and MATLAB (2011a) and it runs on both. If you still have problems please let me know ASAP and we will sort them out.