Statistical learning with R part 4: Clustering
[This post is part of the Statistical learning with R series]
If you are here you probably have already unpacked the tgz file containing the demos and read the previous articles (part 1: Overfitting and part 2: Linear regression, and part 3: Classification). If not, please check this page before starting.
Try to run clusteringDemo.R: supposing your current working directory is the one where you unpacked the R files, type
source("clusteringDemo.R",print.eval=TRUE)
The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.
This demo tests different clustering algorithms (K-Means, Hierarchical, and K-Medoids) on the Iris flower dataset.
After you run the full demo, try to answer the following questions:
- What can you say about the WSS vs K plot? Can you spot an elbow? How can you interpret this behaviour, knowing the correct K and having looked at how the dataset looks like?
- What kind of hierarchical clustering is ran by default? Top down or bottom up? Single, complete, or average linkage? Can you modify the method (just look at the parameters of the hclust function) to obtain better results in terms of NMI?
- Try to play with the fuzziness coefficient parameter (m) of Fuzzy C-Means. We know that its value is strictly greater than 1 but with no upper bound. What happens if its value is very close to 1? What happens if it is very big? And what if the value is the default (2)?
- What is the best algorithm in terms of NMI (also take into account the different methods of Hierarchical)? Which one do you consider the best in terms of interpretability of the results?
- Finally, take one sample whose Fuzzy C-Means membership does not clearly put it in one cluster (you can look at the "Memberships" matrix printed after FCM execution, check e.g. sample 51). What makes it so ambiguous? Try to check its features (iris[51,]) and the ones of the cluster centers (i.e. the representative points of each cluster, result$centers), and comment what you found.
Statistical learning with R part 3: Classification
[This post is part of the Statistical learning with R series]
If you are here you probably have already unpacked the tgz file containing the demos and read the previous articles (part 1: Overfitting and part 2: Linear regression). If not, please check this page before starting.
Try to run classificationDemo.R: supposing your current working directory is the one where you unpacked the R files, type
source("classificationDemo.R",print.eval=TRUE)
The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.
This demo shows you two different classification experiments. In the former, you will run your own spam filter on the HP Labs SPAM E-mail Database. In the latter, you will use classification to recognise handwritten digits from the USPS ZIP code dataset (Le Cun et al., 1990). You can find more information on both of the datasets here, in the "Data" section.
After you run the full demo, try to answer the following questions:
- Look at the confusion matrix from logistic regression and interpret it. How many messages have been correctly classified as spam? How many as non-spam? How many false positives you have (e.g. messages classified as spam that were not)? And how many false negatives?
- The spam experiment runs by default with a given training set (provided by the dataset creators for an easier comparison with other methods). Try to change it with a random one (all the code is ready for you to use within the script). Does anything significative change if you do this? What if you change the default size of the set?
- If you think about the spam experiment as a *real* problem, you might not just settle with a solution which gives you the smallest error, but you'd probably want to also minimize the *false positives* (i.e. you do not want to classify regular messages as spam). Play with the threshold to reduce the number of false positives and discuss the results: how does the general error change? How does the amount of spam you cannot recognize anymore increase?
- In the digits classification experiment, what is the digit that is misclassified the most by QDA? Try to display some examples of this misclassified digit (all the code you need is available in the script)
- Compare the results given by the different classification approaches in *both experiments*. What are the best ones? Relying on the methods comparison in your textbook, can you provide hypotheses explaining this behaviour?
Statistical learning with R part 2: Linear regression
[This post is part of the Statistical learning with R series]
So, if you are here you probably have already unpacked the tgz file containing the demos and read the previous article about overfitting. If not, please check this page before starting.
Try to run linearRegressionDemo.R: supposing your current working directory is the one where you unpacked the R files, type
source("linearRegressionDemo.R",print.eval=TRUE)
The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.
This demo performs different linear regressions on the Los Angeles ozone dataset (simple, multivariate, with non-linear extensions, etc.). After you run the full demo, try to answer the following questions:
- The variable selection results we obtain with the regsubsets command are consistent with the first three steps we performed manually. Can you tell which approach we used for variable selection (both in regsubsets and in manual selection)? Would the sets of selected variables be the same if we chose another approach in regsubsets?(NOTE: you can try that! Just type ?regsubsets in R to see which methods are allowed, and change the current one in the code
- Which is the best variable selection method in qualitative terms? Choose e.g. the sets of 4 variables you get with the different methods provided by regsubsets, fit linear models using them, and show which one has the lowest MSE. Can you also explain why that method is better?
- Just by introducing a very simple non-linearity in our model we were able to get a better fit both in the training and in the test sets. Can you do better? Try to modify the code to include other types of non-linearities (in the simplest case, more complex polynomials) and verify whether the model
gets better only in the training set or also in test. - What can you say about the final results, comparing the evaluations of the training and test sets? Do you think there is actually overfitting? Do things change if you choose different values for the training and test sets?
Statistical learning with R part 1: Overfitting
[This post is part of the Statistical learning with R series]
So, if you are here you probably have already unpacked the tgz file containing the demos and done your first experiments trying to run R scripts. If not, please check this page before starting.
Try to run overfittingDemo.R: supposing your current working directory is the one where you unpacked the R files, you can just type
source("overfittingDemo.R",print.eval=TRUE)
(note that each demo generates some random data and before running one you should edit it and set up your own random seed).
In the overfitting demo you will generate different datasets, try to fit more and more flexible functions to the data, and evaluate how good your functions are in terms of how well they fit both a training and a test dataset. After you run the full demo, try to answer the following questions:
- We have seen that increasing the flexibility we tend to overfit the data, i.e. while the training MSE gets smaller, the test MSE increases. Is the opposite also true for our datasets, i.e. is the simplest, less flexible function taken into consideration the one that also provides the smallest test MSE? If not, can you explain why?
- How good are our errors? Throughout the whole demo, we have talked about errors without ever referring to the smallest possible one that we can get, that is the irreducible error. How is it calculated and what is its size in each of the generated datasets?
- The default datasets are quite small, and in some cases (e.g. try with seed=789) the results are not exactly the expected ones. Do results change sensibly if you increase the number of points? Can you tell why? (NOTE: you can do that just by changing the dataset_size variable at the beginning of the script)
Statistical learning with R – Introduction and setup
New year, new PAMI class: as the format changed with respect to previous years, my homeworks/demos are also different, and you will need R to run them. If you have not played with R yet, the notes I have attached to the introductory Lab might be of help.
If you attended the class, you probably know what to do with the next posts. After you have ran each demo, answer the related questions you will find both in the blog post and in the demo itself. Add whatever is necessary (screenshots, code, text, links) to motivate your answers and convince me you actually ran the demos and understood them. Finally send me everything in a pdf file.
To run each demo, just open the R file you will find in each post with the source command in R, for example:
source("/whatever/your/path/is/demofilename.R",print.eval = TRUE)
For your convenience, here is a package containing all of the source and data files you need for your homework (the package will be updated every time a new demo is added). Remember that while you will not be asked to add much new code to the demos, you should at least be able to understand what it does and modify some parameters to produce different results. Now feel free to play with the following demos: