## Octave clustering demo part 2: cluster evaluation

[This post is part of the Octave clustering demo series]

So, if you read this you probably have already unpacked the tgz file containing the demo, read the previous article, and ran demos 1 and 2. This post describes in detail the third demo, devoted to cluster evaluation.

In the previous article you saw how you can use an *internal* evaluation index to get better results from k-means. The main difference between internal and external indices is that the former (e.g. SSE) do not require additional information (such as the original labeling of data points), while the latter (e.g. accuracy) typically rely on that. SSE was used to evaluate the quality of different k-means executions, then the result providing the lowest SSE was considered the best one.

All of this was done supposing we already knew the correct number of clusters in the dataset (the variable *clusters* we used was actually provided, together with the data points and their classes). In the real world, however, this is not always true: you will typically lack not only the original classification (which is obvious, otherwise you would not need to run a clustering algorithm!), but also the best number of groups you want to divide your data into. So, you will have to find it... But what does *best* mean, in this case?

In kMeansDemo3 we employ the same SSE function to compare different executions of k-means. These are the main steps performed by the script:

- for k varying between 2 and 10
- run k-means 10 times (to address the random initialization problem, by choosing the lowest-SSE clustering)
- save the best (lowest) SSE and the matching clustering result
- plot the best SSE values for k=2, 3, ..., 10
- allow the user to decide which k is best, then plot that result and show both its SSE and its accuracy.

… Yes, there is nothing automatic in this :-). The reason is that I wanted you to get a grasp of how SSE changes with k and how to interpret it. This is strongly connected to what I wrote early about what *best* means in this case.

If you look at the SSE plot, you will see that *it always decreases with increasing values of k*. Which is reasonable, if you think that by increasing the number of clusters you will have points which are closer and closer to their centroids, up to the case when every point is the centroid of its own 1-point cluster and SSE becomes zero. This is totally different from the case in which the number of clusters was given and we just had to take the lowest-SSE results. How can we choose, then, the best k?

By looking again at the SSE plot, you will notice that the value of SSE always decreases, but for some values of k the change is bigger than for others. To find the best *k*, what we should look for is where the decrease changes most, that is the "elbow" inside the plot where the difference in change is maximum. Of course, this is quite easy to spot visually but not always trivial in terms of calculations. Some code that roughly does the trick follows:

diffs = [sse(2:end) 0]-sse; % calculates differences in SSE between each (k,k-1) pair ratios = [diffs(2:end) 0]./diffs; % calculate ratios between differences k = find(ratios == min(ratios(ratios>0))) + 1; % get lowest ratio (constrained to be >0 to ignore case k=1 and k=max)

... however, there are still cases (even in the provided datasets) where it does not perform well, especially when k-means is not the best algorithm for a given dataset. Run this demo both with blobs and circles/crescents datasets and try to answer the following questions:

- were you always able to
*calculate*the correct number of clusters using the code provided above? - were you always able to
*visually detect*the correct number of clusters by looking at the SSE plot? - was the value of k you estimated confirmed by the final plot/the accuracy calculated against the ground truth?
- how would you fix the code provided above to automatically estimate k even in the "blobs03" example? (note: if the code already gives you the correct k do not worry, just skip this question (see comments below ;)).
- how would you fix the code provided above to automatically estimate k in the circles/crescents examples?

This time I really wrote a lot of questions :) Don't worry, some of them will be (partially) answered in the next tutorial!

## Octave clustering demo part 1: k-means

[This post is part of the Octave clustering demo series]

So, if you are here you probably have already unpacked the tgz file containing the demo and done your first experiments trying to run Octave scripts. Great! This post describes in detail the first two demo files: kMeansDemo1 and kMeansDemo2.

Try to run **kMeansDemo1**. You should execute it as kMeansDemo1(*dataset*), where *dataset* is one of the .mat files saved in the *data* directory. For instance, supposing your current working directory is the one where you unpacked the .tgz file, you could run:

kMeansDemo1('./data/blobs01.mat')

Now, all the demos are somehow interactive, that is they will plot stuff, comment the results in the main window and ask you to press a key to continue. All the demos have the possibility of becoming non-interactive: this is done by setting the variable "interactive" in the scripts to 0. Here are the main operations that the first demo performs:

- it plots data, so that "real" clusters (i.e. the ground truth, not the detected ones) are shown. Note that for all the datasets provided you have points (variable
*X*), ground truth (variable*classes*), and number of clusters (variable*clusters*); - for three times, it runs the k-means algorithm trying to identify clusters in data. This is the original page where k-means code comes from, and here is an example of how it can be called (thanks Dan for sharing!);
- every time the algorithm ends, its
*accuracy*is calculated and shown together with the clustering results. Note that bad behaviors are foreseen and in some cases you just cannot reach 100% accuracy with a simple k-means.

After you run the full example, try to answer the following questions:

- Were the results the same? If not, can you tell why? Feel free to open myKmeans.m to better understand how it works.
- During the class we said that you can initialize k-means by just choosing random positions for centroids or by using some smarter heuristics (e.g. using actual data points for initial centroids, choosing them as far as possible from each other, and so on). Does myKmeans.m use these kind of heuristics?
- How would you modify the algorithm to obtain better/more stable results?

The second demo, kMeansDemo2, partially answers the last question, providing a way to address the random centroid initialization problem. Still, this remains non-deterministic, however there are many more chances to have better results by following its approach. The main idea is: find a way to evaluate how well k-means performed, then run it many times and keep only the best results.

A pretty common way to evaluate clustering algorithm performances is to use SSE (Sum of Squared Errors), that is the sum of all the (squared) distances between points in the dataset and the centroids of the clusters they have been assigned to. What we want are clusters which are, at the same time, tight (i.e. where points are near to their centroids) and far apart from each other. SSE mainly deals with the first of these two characteristic, returning results which should be as low as possible for a clustering to be good. So, the demo does the following:

- it runs k-means 10 times
- each time, it calculates the SSE for the whole clustering
- it keeps the lowest SSE (and matching cluster labels) as the best result, then shows it to the user

After running the demo with different datasets, try to answer the following questions:

- Does the quality of the algorithm only depend on the number of repetitions, i.e. is it always possible to get 100% accuracy by running k-means a sufficiently high number of times?
- Is the initialization of the bestSSE variable correct in kMeansDemo2 code? Does the code work fine for all the examples?
- Why didn't we directly use the
*accuracy*function to evaluate the algorithm performances, checking the correctness of the assigned labels with respect to ground truth? - Now that we get stabler results, compare the plots of two clusterings obtained in different executions. Are labels (i.e. colors in the plot) equal from one execution to the other? If not, does that matter? How are they evaluated against ground truth?

## Octave clustering demo part 0: introduction and setup

As promised to my PAMI students, I have prepared a couple of demos to get a better grasp of how different clustering algorithms work. Hoping they could be useful for somebody else (and sure that this way I will not lose them *so* easily ;-)) I have decided to post more information about them here.

All the demos are in one single file and can be downloaded from the course page together with slides and other material (or if you only want the demo file follow this direct link). Once unpacked, you will find the following:

**data/*.mat**(some example datasets)**accuracy.m**(calculates clustering accuracy)**adjacency_weighted.m**(calculates the weighted adjacency matrix used by spectral clustering)**kMeansDemo*.m**(k-means demos)**L2_distance.m**(calculates L2 distance between two vectors/matrices)**laplacian.m**(builds the Laplacian used by spectral clustering)**myKmeans.m**(performs k-means clustering)**plotClustering.m**(given clustered data and ground truth, plots clustering results)**repKmeans.m**(automatically repeats k-means*n*times and keeps the result with lowest SSE)**spectralDemo.m**(spectral clustering demo)**SSE.m**(given clustered data and centroids, calculates total SSE)**uAveragedAccuracy.m**(another way to calculate clustering accuracy)

To run the code just open Octave, *cd* into the spectralClustering directory, and call the different functions. The demo files are executed in the following ways:

kMeansDemo1(dataset); - or - spectralDemo(dataset,nn,t);

where *dataset* is one of the files in the *data* directory, *nn* is the number of nearest neighbors for the calculation of the adjacency matrix, and *t* is the parameter for the Gaussian kernel used to calculate similarities in the weighted adjacency matrix (0 can be used for auto-tuning and usually works fine enough). For example:

kMeansDemo1('./data/blobs03.mat'); - or - spectralDemo('./data/circles02.mat',10,0);

Good values of *nn* for these examples range between 10 and 40... I will let you experiment which ones are better for each dataset ;-). Now feel free to play with the demos or read the following tutorials:

- Octave clustering demo part 1: k-means
- Octave clustering demo part 2: cluster evaluation
- Octave clustering demo part 3: spectral clustering
- Octave clustering demo part 4: k-medoids
- Octave clustering demo part 5: hierarchical clustering
- Octave clustering demo part 6: (more) evaluation