## Statistical learning with R part 4: Clustering

[This post is part of the Statistical learning with R series]

If you are here you probably have already unpacked the tgz file containing the demos and read the previous articles (part 1: Overfitting and part 2: Linear regression, and part 3: Classification). If not, please check this page before starting.

Try to run clusteringDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("clusteringDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as *summary* in the context of the *source* command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo tests different clustering algorithms (K-Means, Hierarchical, and K-Medoids) on the Iris flower dataset.

After you run the full demo, try to answer the following questions:

- What can you say about the WSS vs K plot? Can you spot an elbow? How can you interpret this behaviour, knowing the correct K and having looked at how the dataset looks like?
- What kind of hierarchical clustering is ran by default? Top down or bottom up? Single, complete, or average linkage? Can you modify the method (just look at the parameters of the hclust function) to obtain better results in terms of NMI?
- Try to play with the
*fuzziness coefficient*parameter (m) of Fuzzy C-Means. We know that its value is strictly greater than 1 but with no upper bound. What happens if its value is very close to 1? What happens if it is very big? And what if the value is the default (2)? - What is the best algorithm in terms of NMI (also take into account the different methods of Hierarchical)? Which one do you consider the best in terms of interpretability of the results?
- Finally, take one sample whose Fuzzy C-Means membership does not clearly put it in one cluster (you can look at the "Memberships" matrix printed after FCM execution, check e.g. sample 51). What makes it so ambiguous? Try to check its features (
**iris[51,]**) and the ones of the cluster centers (i.e. the representative points of each cluster,**result$centers**), and comment what you found.

## Statistical learning with R part 3: Classification

[This post is part of the Statistical learning with R series]

If you are here you probably have already unpacked the tgz file containing the demos and read the previous articles (part 1: Overfitting and part 2: Linear regression). If not, please check this page before starting.

Try to run classificationDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("classificationDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as *summary* in the context of the *source* command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo shows you two different classification experiments. In the former, you will run your own spam filter on the HP Labs SPAM E-mail Database. In the latter, you will use classification to recognise handwritten digits from the USPS ZIP code dataset (Le Cun et al., 1990). You can find more information on both of the datasets here, in the "Data" section.

After you run the full demo, try to answer the following questions:

- Look at the confusion matrix from logistic regression and interpret it. How many messages have been correctly classified as spam? How many as non-spam? How many false positives you have (e.g. messages classified as spam that were not)? And how many false negatives?
- The spam experiment runs by default with a given training set (provided by the dataset creators for an easier comparison with other methods). Try to change it with a random one (all the code is ready for you to use within the script). Does anything significative change if you do this? What if you change the default size of the set?
- If you think about the spam experiment as a *real* problem, you might not just settle with a solution which gives you the smallest error, but you'd probably want to also minimize the *false positives* (i.e. you do not want to classify regular messages as spam). Play with the threshold to reduce the number of false positives and discuss the results: how does the general error change? How does the amount of spam you cannot recognize anymore increase?
- In the digits classification experiment, what is the digit that is misclassified the most by QDA? Try to display some examples of this misclassified digit (all the code you need is available in the script)
- Compare the results given by the different classification approaches in *both experiments*. What are the best ones? Relying on the methods comparison in your textbook, can you provide hypotheses explaining this behaviour?

## Statistical learning with R part 2: Linear regression

[This post is part of the Statistical learning with R series]

So, if you are here you probably have already unpacked the tgz file containing the demos and read the previous article about overfitting. If not, please check this page before starting.

Try to run linearRegressionDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("linearRegressionDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as *summary* in the context of the *source* command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo performs different linear regressions on the Los Angeles ozone dataset (simple, multivariate, with non-linear extensions, etc.). After you run the full demo, try to answer the following questions:

- The variable selection results we obtain with the
*regsubsets*command are consistent with the first three steps we performed manually. Can you tell which approach we used for variable selection (both in*regsubsets*and in manual selection)? Would the sets of selected variables be the same if we chose another approach in*regsubsets*?(NOTE: you can try that! Just type*?regsubsets*in R to see which methods are allowed, and change the current one in the code - Which is the best variable selection method in qualitative terms? Choose e.g. the sets of 4 variables you get with the different methods provided by
*regsubsets*, fit linear models using them, and show which one has the lowest MSE. Can you also explain why that method is better? - Just by introducing a very simple non-linearity in our model we were able to get a better fit both in the training and in the test sets. Can you do better? Try to modify the code to include other types of non-linearities (in the simplest case, more complex polynomials) and verify whether the model

gets better only in the training set or also in test. - What can you say about the final results, comparing the evaluations of the training and test sets? Do you think there is actually overfitting? Do things change if you choose different values for the training and test sets?

## Statistical learning with R part 1: Overfitting

[This post is part of the Statistical learning with R series]

So, if you are here you probably have already unpacked the tgz file containing the demos and done your first experiments trying to run R scripts. If not, please check this page before starting.

Try to run overfittingDemo.R: supposing your current working directory is the one where you unpacked the R files, you can just type

source("overfittingDemo.R",print.eval=TRUE)

(note that each demo generates some random data and before running one you should edit it and set up your own random seed).

In the overfitting demo you will generate different datasets, try to fit more and more flexible functions to the data, and evaluate how good your functions are in terms of how well they fit both a training and a test dataset. After you run the full demo, try to answer the following questions:

- We have seen that increasing the flexibility we tend to overfit the data, i.e. while the training MSE gets smaller, the test MSE increases. Is the opposite also true for our datasets, i.e. is the simplest, less flexible function taken into consideration the one that also provides the smallest test MSE? If not, can you explain why?
- How good are our errors? Throughout the whole demo, we have talked about errors without ever referring to the smallest possible one that we can get, that is the irreducible error. How is it calculated and what is its size in each of the generated datasets?
- The default datasets are quite small, and in some cases (e.g. try with seed=789) the results are not exactly the expected ones. Do results change sensibly if you increase the number of points? Can you tell why?
*(NOTE: you can do that just by changing the dataset_size variable at the beginning of the script)*

## Statistical learning with R – Introduction and setup

New year, new PAMI class: as the format changed with respect to previous years, my homeworks/demos are also different, and you will need R to run them. If you have not played with R yet, the notes I have attached to the introductory Lab might be of help.

If you attended the class, you probably know what to do with the next posts. After you have ran each demo, answer the related questions you will find both in the blog post and in the demo itself. Add whatever is necessary (screenshots, code, text, links) to motivate your answers and convince me you actually ran the demos and understood them. Finally send me everything in a pdf file.

To run each demo, just open the R file you will find in each post with the **source** command in R, for example:

source("/whatever/your/path/is/demofilename.R",print.eval = TRUE)

For your convenience, here is a package containing all of the source and data files you need for your homework (the package will be updated every time a new demo is added). Remember that while you will not be asked to add much new code to the demos, you should at least be able to understand what it does and modify some parameters to produce different results. Now feel free to play with the following demos:

## New paper: Laplacian colormaps: a framework for structure-preserving color transformations

Next week I will present *Laplacian colormaps* at EG2014: here are (drafts of) the paper, the supplementary material, and the slides:

## Octave clustering demo part 6: (more) evaluation

[This post is part of the Octave clustering demo series]

This new clustering demo goes deeper into evaluation, showing some techniques to get an idea about the *clustering tendency* of a dataset, find visual clues about the clusters, and get a hint about the "best" number of clusters. If you have attended the PAMI class this year you probably know what I am talking about. If instead you are new to these demos please (1) check the page linked above and (2) set up your Octave environment (feel free to download the old package and play with the old demos too ;-)). Then download this new package.

If you want to check out the older Octave clustering articles, here they are: part 0 - part 1 - part 2 - part 3 - part 4 - part 5. I **strongly suggest** you to run at least parts 0-3 before this demo, as they provide you all the basics you need to get the best out of this exercise.

*Note that some of the functions available in previous packages are also present in this one. While it is ok to run the previous examples with those functions, make sure you are using the most up-to-date ones for the current experiments, as I debugged and optimized them for this specific demo.*

Run the *evaluationDemo1* script. This will generate a random dataset and walk you through a sequence of tests to evaluate the clustering tendency and ultimately perform clustering on it. As random clusters might be more or less nasty to cluster, I suggest you to try running the demo few times and see how it behaves in general. Note that part of the code is missing and you will have to provide it for the demo to be complete.

When you feel confident enough with evaluating the random dataset, run the *evaluationDemo2* script. This will be a little more interactive, asking you to choose, between three datasets, the one which has the most "interesting" content, and requiring you to write small pieces of code to complete your evaluation.

At the end of your experiment, answer the following questions:

- comment the different executions of
*evaluationDemo1*. How sensitive to "bad" overlapping clusters are SSE elbow and distance matrix plot? Does the presence of overlapping clusters affect the clustering tendency test? How do you think it would be possible (and in this case, would it be meaningful) to distinguish between two completely overlapping clusters? - comment the different executions of
*evaluationDemo2*. Which is the "interesting" dataset? Why? Is the SSE elbow method useful to automatically detect the number of clusters? Why? What additional information does the distance matrix convey? Is Spectral clustering better than k-Means? Did you happen to find parameters that give better accuracy than the default ones?

If you are a PAMI student, please write your answers in a pdf file, motivating them and providing all the material (images, code, numerical results) needed to support them.

Hints:

- the demo stops at each step waiting for you to press a key. You can disable this feature by setting the "interactive" variable to zero at the beginning of the script;
- the second demo file has some "return" commands purposely left in the code to stop the execution at given points (typically when you need to add some code before proceeding, or to increase the suspance ;-));
- I tested the code on Octave (3.6.4) and MATLAB (2011a) and it runs on both. If you still have problems please let me know ASAP and we will sort them out.

## New TR: Structure-preserving color transformations using Laplacian commutativity

*"Mappings between color spaces are ubiquitous in image processing problems such as gamut mapping, decolorization, and image optimization for color-blind people. Simple color transformations often result in information loss and ambiguities (for example, when mapping from RGB to grayscale), and one wishes to find an image-specific transformation that would preserve as much as possible the structure of the original image in the target color space. In this paper, we propose Laplacian colormaps, a generic framework for structure-preserving color transformations between images. We use the image Laplacian to capture the structural information, and show that if the color transformation between two images preserves the structure, the respective Laplacians have similar eigenvectors, or in other words, are approximately jointly diagonalizable. Employing the relation between joint diagonalizability and commutativity of matrices, we use Laplacians commutativity as a criterion of color mapping quality and minimize it w.r.t. the parameters of a color transformation to achieve optimal structure preservation. We show numerous applications of our approach, including color-to-gray conversion, gamut mapping, multispectral image fusion, and image optimization for color deficient viewers."*

Full text is available on ArXiv. Enjoy! :-)

## Octave clustering demo part 5: hierarchical clustering

[This post is part of the Octave clustering demo series]

So, if you read this you probably have already unpacked the tgz file containing the demo and read the previous article about k-medoids clustering. If you want to check out the older Octave clustering articles, here they are: part 0 - part 1 - part 2 - part 3.

Run the *hierarchicalDemo* script (if launched without input parameters, it will open the "crescents.mat" data file by default). The script will plot the dataset, then it will try to perform clustering using first k-means and then an agglomerative hierarchical clustering algorithm. Run the algorithm with all the datasets available (note that you will need to modify the code to make it work with different numbers of clusters).

Questions:

- how does k-means perform on the non-globular datasets (crescents, circles01, and circles02)? What is the reason of this performance? how does hierarchical algorithm perform instead?
- now let us play with the "agglomerative" dataset. Information is missing so you know neither the number of clusters in advance nor the ground truth about points (i.e. which class they belong to). First of all, run the hclust function with different numbers of clusters and plot the results. Are you satisfied with them? What do you think is the reason of this "chaining" phenomenon?
- complete the code in hclust.m so it also calculates complete and average linkage, and run the algorithm again. How do clustering results change? Plot and comment them.

## Octave clustering demo part 4: k-Medoids

[This post is part of the Octave clustering demo series]

Here follows the description of the first of two new Octave clustering demos. If you have attended the PAMI class this year you probably know what I am talking about. If instead you are new to these demos please (1) check the page linked above and (2) set up your Octave environment (feel free to download the old package and play with the old demos too ;-)). Then download this new package.

Run the *kMedoidsDemo* script (if launched without input parameters, it will open the "noisyBlobs01.mat" data file by default). The script will plot the dataset, then it will try to perform clustering using first k-means and then k-medoids. Run the experiment many times (at least 4-5 times) for each of the three noisyBlobs datasets (you can pass the dataset path and name as a parameter to the kMedoidsDemo function).

Questions:

- comment the "worst case" results you get from k-means: what happens and how does noise influence this behavior?
- compare the results you get with k-means with the ones you get with k-medoids: how does k-medoids deal with the presence of noise?
- what would happen if noise points were much farther from the original clusters? Try to get few (1-2) of them and bring them very far from the rest of the data… what are the results you get from k-means and k-medoids?

Hints:

- the demo stops at each step waiting for you to press a key. You can disable this feature by setting the "interactive" variable to zero at the beginning of the script;
- there are two different k-means implementations you can run: feel free to try both just by uncommenting their respective code.