## Statistical learning with R part 4: Clustering

[This post is part of the Statistical learning with R series]

If you are here you probably have already unpacked the tgz file containing the demos and read the previous articles (part 1: Overfitting and part 2: Linear regression, and part 3: Classification). If not, please check this page before starting.

Try to run clusteringDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("clusteringDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as *summary* in the context of the *source* command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo tests different clustering algorithms (K-Means, Hierarchical, and K-Medoids) on the Iris flower dataset.

After you run the full demo, try to answer the following questions:

- What can you say about the WSS vs K plot? Can you spot an elbow? How can you interpret this behaviour, knowing the correct K and having looked at how the dataset looks like?
- What kind of hierarchical clustering is ran by default? Top down or bottom up? Single, complete, or average linkage? Can you modify the method (just look at the parameters of the hclust function) to obtain better results in terms of NMI?
- Try to play with the
*fuzziness coefficient*parameter (m) of Fuzzy C-Means. We know that its value is strictly greater than 1 but with no upper bound. What happens if its value is very close to 1? What happens if it is very big? And what if the value is the default (2)? - What is the best algorithm in terms of NMI (also take into account the different methods of Hierarchical)? Which one do you consider the best in terms of interpretability of the results?
- Finally, take one sample whose Fuzzy C-Means membership does not clearly put it in one cluster (you can look at the "Memberships" matrix printed after FCM execution, check e.g. sample 51). What makes it so ambiguous? Try to check its features (
**iris[51,]**) and the ones of the cluster centers (i.e. the representative points of each cluster,**result$centers**), and comment what you found.

## Statistical learning with R part 3: Classification

[This post is part of the Statistical learning with R series]

If you are here you probably have already unpacked the tgz file containing the demos and read the previous articles (part 1: Overfitting and part 2: Linear regression). If not, please check this page before starting.

Try to run classificationDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("classificationDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as *summary* in the context of the *source* command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo shows you two different classification experiments. In the former, you will run your own spam filter on the HP Labs SPAM E-mail Database. In the latter, you will use classification to recognise handwritten digits from the USPS ZIP code dataset (Le Cun et al., 1990). You can find more information on both of the datasets here, in the "Data" section.

After you run the full demo, try to answer the following questions:

- Look at the confusion matrix from logistic regression and interpret it. How many messages have been correctly classified as spam? How many as non-spam? How many false positives you have (e.g. messages classified as spam that were not)? And how many false negatives?
- The spam experiment runs by default with a given training set (provided by the dataset creators for an easier comparison with other methods). Try to change it with a random one (all the code is ready for you to use within the script). Does anything significative change if you do this? What if you change the default size of the set?
- If you think about the spam experiment as a *real* problem, you might not just settle with a solution which gives you the smallest error, but you'd probably want to also minimize the *false positives* (i.e. you do not want to classify regular messages as spam). Play with the threshold to reduce the number of false positives and discuss the results: how does the general error change? How does the amount of spam you cannot recognize anymore increase?
- In the digits classification experiment, what is the digit that is misclassified the most by QDA? Try to display some examples of this misclassified digit (all the code you need is available in the script)
- Compare the results given by the different classification approaches in *both experiments*. What are the best ones? Relying on the methods comparison in your textbook, can you provide hypotheses explaining this behaviour?

## Statistical learning with R part 2: Linear regression

[This post is part of the Statistical learning with R series]

So, if you are here you probably have already unpacked the tgz file containing the demos and read the previous article about overfitting. If not, please check this page before starting.

Try to run linearRegressionDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("linearRegressionDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as *summary* in the context of the *source* command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo performs different linear regressions on the Los Angeles ozone dataset (simple, multivariate, with non-linear extensions, etc.). After you run the full demo, try to answer the following questions:

- The variable selection results we obtain with the
*regsubsets*command are consistent with the first three steps we performed manually. Can you tell which approach we used for variable selection (both in*regsubsets*and in manual selection)? Would the sets of selected variables be the same if we chose another approach in*regsubsets*?(NOTE: you can try that! Just type*?regsubsets*in R to see which methods are allowed, and change the current one in the code - Which is the best variable selection method in qualitative terms? Choose e.g. the sets of 4 variables you get with the different methods provided by
*regsubsets*, fit linear models using them, and show which one has the lowest MSE. Can you also explain why that method is better? - Just by introducing a very simple non-linearity in our model we were able to get a better fit both in the training and in the test sets. Can you do better? Try to modify the code to include other types of non-linearities (in the simplest case, more complex polynomials) and verify whether the model

gets better only in the training set or also in test. - What can you say about the final results, comparing the evaluations of the training and test sets? Do you think there is actually overfitting? Do things change if you choose different values for the training and test sets?

## Statistical learning with R part 1: Overfitting

[This post is part of the Statistical learning with R series]

So, if you are here you probably have already unpacked the tgz file containing the demos and done your first experiments trying to run R scripts. If not, please check this page before starting.

Try to run overfittingDemo.R: supposing your current working directory is the one where you unpacked the R files, you can just type

source("overfittingDemo.R",print.eval=TRUE)

(note that each demo generates some random data and before running one you should edit it and set up your own random seed).

In the overfitting demo you will generate different datasets, try to fit more and more flexible functions to the data, and evaluate how good your functions are in terms of how well they fit both a training and a test dataset. After you run the full demo, try to answer the following questions:

- We have seen that increasing the flexibility we tend to overfit the data, i.e. while the training MSE gets smaller, the test MSE increases. Is the opposite also true for our datasets, i.e. is the simplest, less flexible function taken into consideration the one that also provides the smallest test MSE? If not, can you explain why?
- How good are our errors? Throughout the whole demo, we have talked about errors without ever referring to the smallest possible one that we can get, that is the irreducible error. How is it calculated and what is its size in each of the generated datasets?
- The default datasets are quite small, and in some cases (e.g. try with seed=789) the results are not exactly the expected ones. Do results change sensibly if you increase the number of points? Can you tell why?
*(NOTE: you can do that just by changing the dataset_size variable at the beginning of the script)*

## Statistical learning with R – Introduction and setup

New year, new PAMI class: as the format changed with respect to previous years, my homeworks/demos are also different, and you will need R to run them. If you have not played with R yet, the notes I have attached to the introductory Lab might be of help.

If you attended the class, you probably know what to do with the next posts. After you have ran each demo, answer the related questions you will find both in the blog post and in the demo itself. Add whatever is necessary (screenshots, code, text, links) to motivate your answers and convince me you actually ran the demos and understood them. Finally send me everything in a pdf file.

To run each demo, just open the R file you will find in each post with the **source** command in R, for example:

source("/whatever/your/path/is/demofilename.R",print.eval = TRUE)

For your convenience, here is a package containing all of the source and data files you need for your homework (the package will be updated every time a new demo is added). Remember that while you will not be asked to add much new code to the demos, you should at least be able to understand what it does and modify some parameters to produce different results. Now feel free to play with the following demos:

## Octave clustering demo part 5: hierarchical clustering

[This post is part of the Octave clustering demo series]

So, if you read this you probably have already unpacked the tgz file containing the demo and read the previous article about k-medoids clustering. If you want to check out the older Octave clustering articles, here they are: part 0 - part 1 - part 2 - part 3.

Run the *hierarchicalDemo* script (if launched without input parameters, it will open the "crescents.mat" data file by default). The script will plot the dataset, then it will try to perform clustering using first k-means and then an agglomerative hierarchical clustering algorithm. Run the algorithm with all the datasets available (note that you will need to modify the code to make it work with different numbers of clusters).

Questions:

- how does k-means perform on the non-globular datasets (crescents, circles01, and circles02)? What is the reason of this performance? how does hierarchical algorithm perform instead?
- now let us play with the "agglomerative" dataset. Information is missing so you know neither the number of clusters in advance nor the ground truth about points (i.e. which class they belong to). First of all, run the hclust function with different numbers of clusters and plot the results. Are you satisfied with them? What do you think is the reason of this "chaining" phenomenon?
- complete the code in hclust.m so it also calculates complete and average linkage, and run the algorithm again. How do clustering results change? Plot and comment them.

## Octave clustering demo part 4: k-Medoids

[This post is part of the Octave clustering demo series]

Here follows the description of the first of two new Octave clustering demos. If you have attended the PAMI class this year you probably know what I am talking about. If instead you are new to these demos please (1) check the page linked above and (2) set up your Octave environment (feel free to download the old package and play with the old demos too ;-)). Then download this new package.

Run the *kMedoidsDemo* script (if launched without input parameters, it will open the "noisyBlobs01.mat" data file by default). The script will plot the dataset, then it will try to perform clustering using first k-means and then k-medoids. Run the experiment many times (at least 4-5 times) for each of the three noisyBlobs datasets (you can pass the dataset path and name as a parameter to the kMedoidsDemo function).

Questions:

- comment the "worst case" results you get from k-means: what happens and how does noise influence this behavior?
- compare the results you get with k-means with the ones you get with k-medoids: how does k-medoids deal with the presence of noise?
- what would happen if noise points were much farther from the original clusters? Try to get few (1-2) of them and bring them very far from the rest of the data… what are the results you get from k-means and k-medoids?

Hints:

- the demo stops at each step waiting for you to press a key. You can disable this feature by setting the "interactive" variable to zero at the beginning of the script;
- there are two different k-means implementations you can run: feel free to try both just by uncommenting their respective code.

## Octave clustering demo part 3: spectral clustering

[This post is part of the Octave clustering demo series]

This is (currently) the last of my posts devoted to clustering demos. If you haven't downloaded the file containing Octave code yet, you can get it here. If you want to read the previous articles, here they are: part 0 - part 1 - part 2.

The last demo in the package shows you how spectral clustering works. Spectral Clustering is a relatively recent approach to clustering that I found very interesting, especially because it addresses some of the limitations of k-means but, at the same time, it relies on k-means itself! What it does in practice is finding a new space where localities are preserved, emphasizing both the cohesion and the separation of clusters, thus allowing k-means to work at its best. If you are curious and want to delve deeper into this topic, you will find more material in the following page: Paper Review: A Tutorial on Spectral Clustering.

I was quite surprised not to find ready code for spectral clustering in Octave, but probably I just had not searched (well) enough. By the way, if you came here searching for it here is an(other?) implementation, together with some examples and data to play with: feel free to try it and to leave your feedback!

The *spectralDemo* function can be called as follows:

spectralDemo(dataset,nn,t)

where *dataset* is the usual dataset path+name, while *nn* and *t* are parameters used to build the (weighted) adjacency matrix describing similarity relationships between data points. Respectively, *nn* specifies the number of nearest neighbors for a point (a point is connected to another if it appears as one of its top nn nearest neighbors) and *t* is the parameter for the Gaussian similarity function s(xi,xj)=exp(-(xi-xj)^2 / t). My suggestion is to play with both parameters, but once you understood how they work you can just keep *t=0* for the automatic tuning of the kernel and focus only on *nn*. Good values of nn range between 10 and 40, but of course you are free to test others (but keep in mind that the bigger the value, the slower the whole algorithm is! Can you say why?). The script performs the following tasks:

- it calculates the weighted adjacency matrix and plots both the distances between points and the weights between nodes in the adjacency graph (this allows one to inspect whether parameters need to be trimmed to have a more/less connected graph, and also to understand when the graph approach might work better than a classical euclidean distance approach);
- it builds the graph Laplacian and calculates its eigenvalues and eigenvectors. Then, it plots eigenvalues for inspection (this allows one to understand --if the dataset is clean enough-- how many connected components/natural clusters are present in the graph);
- to compare results between k-means and spectral clustering, it runs k-means both in the euclidean space and in the eigenspace built out of the top k eigenvectors, then shows both clusterings with related SSEs and accuracies.

And now, here are the usual maieutical questions:

- what are the datasets in which spectral clustering performs definitely better than k-means?
- how much do the results of spectral clustering depend on the input parameters?
- how would you evaluate the quality of a labeling done with spectral clustering? How would you apply that to find which is the best number of nearest neighbors to take into account?
- how often (according to your tests with different datasets) does the rule "the number of 0-valued eigenvalues matches the number of clusters" holds? How ofted does its "relaxed version" hold? (if the question is not clear check spectral clustering notes/papers)
- how worth would be to use algorithms other than k-means to do clustering in the eigenspace?

That's all... have fun with the demos and with spectral clustering :-)

## Octave clustering demo part 2: cluster evaluation

[This post is part of the Octave clustering demo series]

So, if you read this you probably have already unpacked the tgz file containing the demo, read the previous article, and ran demos 1 and 2. This post describes in detail the third demo, devoted to cluster evaluation.

In the previous article you saw how you can use an *internal* evaluation index to get better results from k-means. The main difference between internal and external indices is that the former (e.g. SSE) do not require additional information (such as the original labeling of data points), while the latter (e.g. accuracy) typically rely on that. SSE was used to evaluate the quality of different k-means executions, then the result providing the lowest SSE was considered the best one.

All of this was done supposing we already knew the correct number of clusters in the dataset (the variable *clusters* we used was actually provided, together with the data points and their classes). In the real world, however, this is not always true: you will typically lack not only the original classification (which is obvious, otherwise you would not need to run a clustering algorithm!), but also the best number of groups you want to divide your data into. So, you will have to find it... But what does *best* mean, in this case?

In kMeansDemo3 we employ the same SSE function to compare different executions of k-means. These are the main steps performed by the script:

- for k varying between 2 and 10
- run k-means 10 times (to address the random initialization problem, by choosing the lowest-SSE clustering)
- save the best (lowest) SSE and the matching clustering result
- plot the best SSE values for k=2, 3, ..., 10
- allow the user to decide which k is best, then plot that result and show both its SSE and its accuracy.

… Yes, there is nothing automatic in this :-). The reason is that I wanted you to get a grasp of how SSE changes with k and how to interpret it. This is strongly connected to what I wrote early about what *best* means in this case.

If you look at the SSE plot, you will see that *it always decreases with increasing values of k*. Which is reasonable, if you think that by increasing the number of clusters you will have points which are closer and closer to their centroids, up to the case when every point is the centroid of its own 1-point cluster and SSE becomes zero. This is totally different from the case in which the number of clusters was given and we just had to take the lowest-SSE results. How can we choose, then, the best k?

By looking again at the SSE plot, you will notice that the value of SSE always decreases, but for some values of k the change is bigger than for others. To find the best *k*, what we should look for is where the decrease changes most, that is the "elbow" inside the plot where the difference in change is maximum. Of course, this is quite easy to spot visually but not always trivial in terms of calculations. Some code that roughly does the trick follows:

diffs = [sse(2:end) 0]-sse; % calculates differences in SSE between each (k,k-1) pair ratios = [diffs(2:end) 0]./diffs; % calculate ratios between differences k = find(ratios == min(ratios(ratios>0))) + 1; % get lowest ratio (constrained to be >0 to ignore case k=1 and k=max)

... however, there are still cases (even in the provided datasets) where it does not perform well, especially when k-means is not the best algorithm for a given dataset. Run this demo both with blobs and circles/crescents datasets and try to answer the following questions:

- were you always able to
*calculate*the correct number of clusters using the code provided above? - were you always able to
*visually detect*the correct number of clusters by looking at the SSE plot? - was the value of k you estimated confirmed by the final plot/the accuracy calculated against the ground truth?
- how would you fix the code provided above to automatically estimate k even in the "blobs03" example? (note: if the code already gives you the correct k do not worry, just skip this question (see comments below ;)).
- how would you fix the code provided above to automatically estimate k in the circles/crescents examples?

This time I really wrote a lot of questions :) Don't worry, some of them will be (partially) answered in the next tutorial!

## Octave clustering demo part 1: k-means

[This post is part of the Octave clustering demo series]

So, if you are here you probably have already unpacked the tgz file containing the demo and done your first experiments trying to run Octave scripts. Great! This post describes in detail the first two demo files: kMeansDemo1 and kMeansDemo2.

Try to run **kMeansDemo1**. You should execute it as kMeansDemo1(*dataset*), where *dataset* is one of the .mat files saved in the *data* directory. For instance, supposing your current working directory is the one where you unpacked the .tgz file, you could run:

kMeansDemo1('./data/blobs01.mat')

Now, all the demos are somehow interactive, that is they will plot stuff, comment the results in the main window and ask you to press a key to continue. All the demos have the possibility of becoming non-interactive: this is done by setting the variable "interactive" in the scripts to 0. Here are the main operations that the first demo performs:

- it plots data, so that "real" clusters (i.e. the ground truth, not the detected ones) are shown. Note that for all the datasets provided you have points (variable
*X*), ground truth (variable*classes*), and number of clusters (variable*clusters*); - for three times, it runs the k-means algorithm trying to identify clusters in data. This is the original page where k-means code comes from, and here is an example of how it can be called (thanks Dan for sharing!);
- every time the algorithm ends, its
*accuracy*is calculated and shown together with the clustering results. Note that bad behaviors are foreseen and in some cases you just cannot reach 100% accuracy with a simple k-means.

After you run the full example, try to answer the following questions:

- Were the results the same? If not, can you tell why? Feel free to open myKmeans.m to better understand how it works.
- During the class we said that you can initialize k-means by just choosing random positions for centroids or by using some smarter heuristics (e.g. using actual data points for initial centroids, choosing them as far as possible from each other, and so on). Does myKmeans.m use these kind of heuristics?
- How would you modify the algorithm to obtain better/more stable results?

The second demo, kMeansDemo2, partially answers the last question, providing a way to address the random centroid initialization problem. Still, this remains non-deterministic, however there are many more chances to have better results by following its approach. The main idea is: find a way to evaluate how well k-means performed, then run it many times and keep only the best results.

A pretty common way to evaluate clustering algorithm performances is to use SSE (Sum of Squared Errors), that is the sum of all the (squared) distances between points in the dataset and the centroids of the clusters they have been assigned to. What we want are clusters which are, at the same time, tight (i.e. where points are near to their centroids) and far apart from each other. SSE mainly deals with the first of these two characteristic, returning results which should be as low as possible for a clustering to be good. So, the demo does the following:

- it runs k-means 10 times
- each time, it calculates the SSE for the whole clustering
- it keeps the lowest SSE (and matching cluster labels) as the best result, then shows it to the user

After running the demo with different datasets, try to answer the following questions:

- Does the quality of the algorithm only depend on the number of repetitions, i.e. is it always possible to get 100% accuracy by running k-means a sufficiently high number of times?
- Is the initialization of the bestSSE variable correct in kMeansDemo2 code? Does the code work fine for all the examples?
- Why didn't we directly use the
*accuracy*function to evaluate the algorithm performances, checking the correctness of the assigned labels with respect to ground truth? - Now that we get stabler results, compare the plots of two clusterings obtained in different executions. Are labels (i.e. colors in the plot) equal from one execution to the other? If not, does that matter? How are they evaluated against ground truth?