## Octave clustering demo part 2: cluster evaluation

[This post is part of the Octave clustering demo series]

So, if you read this you probably have already unpacked the tgz file containing the demo, read the previous article, and ran demos 1 and 2. This post describes in detail the third demo, devoted to cluster evaluation.

In the previous article you saw how you can use an *internal* evaluation index to get better results from k-means. The main difference between internal and external indices is that the former (e.g. SSE) do not require additional information (such as the original labeling of data points), while the latter (e.g. accuracy) typically rely on that. SSE was used to evaluate the quality of different k-means executions, then the result providing the lowest SSE was considered the best one.

All of this was done supposing we already knew the correct number of clusters in the dataset (the variable *clusters* we used was actually provided, together with the data points and their classes). In the real world, however, this is not always true: you will typically lack not only the original classification (which is obvious, otherwise you would not need to run a clustering algorithm!), but also the best number of groups you want to divide your data into. So, you will have to find it... But what does *best* mean, in this case?

In kMeansDemo3 we employ the same SSE function to compare different executions of k-means. These are the main steps performed by the script:

- for k varying between 2 and 10
- run k-means 10 times (to address the random initialization problem, by choosing the lowest-SSE clustering)
- save the best (lowest) SSE and the matching clustering result
- plot the best SSE values for k=2, 3, ..., 10
- allow the user to decide which k is best, then plot that result and show both its SSE and its accuracy.

… Yes, there is nothing automatic in this :-). The reason is that I wanted you to get a grasp of how SSE changes with k and how to interpret it. This is strongly connected to what I wrote early about what *best* means in this case.

If you look at the SSE plot, you will see that *it always decreases with increasing values of k*. Which is reasonable, if you think that by increasing the number of clusters you will have points which are closer and closer to their centroids, up to the case when every point is the centroid of its own 1-point cluster and SSE becomes zero. This is totally different from the case in which the number of clusters was given and we just had to take the lowest-SSE results. How can we choose, then, the best k?

By looking again at the SSE plot, you will notice that the value of SSE always decreases, but for some values of k the change is bigger than for others. To find the best *k*, what we should look for is where the decrease changes most, that is the "elbow" inside the plot where the difference in change is maximum. Of course, this is quite easy to spot visually but not always trivial in terms of calculations. Some code that roughly does the trick follows:

diffs = [sse(2:end) 0]-sse; % calculates differences in SSE between each (k,k-1) pair ratios = [diffs(2:end) 0]./diffs; % calculate ratios between differences k = find(ratios == min(ratios(ratios>0))) + 1; % get lowest ratio (constrained to be >0 to ignore case k=1 and k=max)

... however, there are still cases (even in the provided datasets) where it does not perform well, especially when k-means is not the best algorithm for a given dataset. Run this demo both with blobs and circles/crescents datasets and try to answer the following questions:

- were you always able to
*calculate*the correct number of clusters using the code provided above? - were you always able to
*visually detect*the correct number of clusters by looking at the SSE plot? - was the value of k you estimated confirmed by the final plot/the accuracy calculated against the ground truth?
- how would you fix the code provided above to automatically estimate k even in the "blobs03" example? (note: if the code already gives you the correct k do not worry, just skip this question (see comments below ;)).
- how would you fix the code provided above to automatically estimate k in the circles/crescents examples?

This time I really wrote a lot of questions :) Don't worry, some of them will be (partially) answered in the next tutorial!

Luca VandroJune 17th, 2012 - 18:54

wrt question n.3 What do you mean by “fix the code provided above to automatically estimate k even in the “blobs03″ example”. By running the code I get k=5 that is the number of clusters that we have in blobs03, so I don’t understand why do we need to fix it

+malaJune 18th, 2012 - 14:43

Erm… my fault, I guess I already provided the fixed code instead of the faulty one :) You can just ignore the question, I will also update the post!