mala::home Davide “+mala” Eynard’s website

19Jan/170

Statistical learning with R part 1 (2017): Overfitting

[This post is part of the Statistical learning with R - 2017 edition series. You might want to check out the previous editions too: 2016, 2015]

So, if you are here you probably have already unpacked the zip file. If not, please check this page before starting.

Try to run overfitting.R: supposing your current working directory is the one where you unpacked the R files, type

source("overfitting.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command.

Run the demo and try answering the questions you find there. In some cases you should be able to do that immediately after looking at the results, in others you will first need to add few lines of code to actually get any result. If you find yourself stuck anywhere, all the material you should need is either in the script itself or in the lab notes.

19Jan/170

Statistical learning with R: 2017 edition

New year, new class (with a brand new name!), and a whole bunch of new R demos. If you have not played with R yet, the notes I have attached to the introductory Lab might be of help.

If you attended the class, you probably know what to do with the next posts. After you have ran each demo, answer the related questions you might find both in the blog post and in the demo itself. Add whatever is necessary (screenshots, code, text, links) to motivate your answers and convince me you actually ran the demos and understood their contents. Finally send me everything in a pdf file.

To run each demo, just open the R file you will find in each post with the source command in R, for example:


source("/whatever/your/path/is/demofilename.R",print.eval = TRUE)

For your convenience, here is a package containing all of the source and data files you need for your homework (the package will be updated every time a new demo is added). Remember that while you will not be asked to add much new code to the demos, you should at least be able to understand what the existing code does and modify some parameters to produce different results. Now feel free to play with the following demos:

20Jan/160

Statistical learning with R part 4 (2016): Color quantization with K-means

[This post is part of the Statistical learning with R - 2016 edition series. You might want to check out the 2015 edition too]

So, if you are here you probably have already unpacked the zip file. If not, please check this page before starting.

Try to run ClusteringDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("ClusteringDemo.R")

Run the demo and try answering the questions you find there. In some cases you should be able to do that immediately after looking at the results, in others you will first need to add few lines of code to actually get any result. If you find yourself stuck anywhere, all the material you should need is either in the script itself or in the lab notes.

19Jan/160

Statistical learning with R part 3 (2016): Classification with LDA

[This post is part of the Statistical learning with R - 2016 edition series. You might want to check out the 2015 edition too]

So, if you are here you probably have already unpacked the zip file. If not, please check this page before starting.

Try to run ClassificationDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("ClassificationDemo.R")

Run the demo and try answering the questions you find there. In some cases you should be able to do that immediately after looking at the results, in others you will first need to add few lines of code to actually get any result. If you find yourself stuck anywhere, all the material you should need is either in the script itself or in the lab notes.

17Jan/160

Statistical learning with R part 2 (2016): The curse of dimensionality

[This post is part of the Statistical learning with R - 2016 edition series. You might want to check out the 2015 edition too]

So, if you are here you probably have already unpacked the zip file. If not, please check this page before starting.

Try to run CurseDimDemo.r: supposing your current working directory is the one where you unpacked the R files, type

source("CurseDimDemo.r",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command.

Run the demo and try answering the questions you find there. In some cases you should be able to do that immediately after looking at the results, in others you will first need to add few lines of code to actually get any result. If you find yourself stuck anywhere, all the material you should need is either in the script itself or in the lab notes.

11Jan/162

Statistical learning with R part 1 (2016): Linear regression

[This post is part of the Statistical learning with R - 2016 edition series. You might want to check out the 2015 edition too]

So, if you are here you probably have already unpacked the zip file. If not, please check this page before starting.

Try to run LinearRegressionDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("LinearRegressionDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command.

Run the demo and try answering the questions you find there. In some cases you should be able to do that immediately after looking at the results, in others you will first need to add few lines of code to actually get any result. If you find yourself stuck anywhere, all the material you should need is either in the script itself or in the lab notes.

11Jan/160

Statistical learning with R – 2016 edition

New year, new PAMI class, and a whole bunch of new R demos. If you have not played with R yet, the notes I have attached to the introductory Lab might be of help.

If you attended the class, you probably know what to do with the next posts. After you have ran each demo, answer the related questions you might find both in the blog post and in the demo itself. Add whatever is necessary (screenshots, code, text, links) to motivate your answers and convince me you actually ran the demos and understood their contents. Finally send me everything in a pdf file.

To run each demo, just open the R file you will find in each post with the source command in R, for example:


source("/whatever/your/path/is/demofilename.R",print.eval = TRUE)

For your convenience, here is a package containing all of the source and data files you need for your homework (the package will be updated every time a new demo is added). Remember that while you will not be asked to add much new code to the demos, you should at least be able to understand what it does and modify some parameters to produce different results. Now feel free to play with the following demos:

19Jan/150

Statistical learning with R part 4: Clustering

[This post is part of the Statistical learning with R series]

If you are here you probably have already unpacked the tgz file containing the demos and read the previous articles (part 1: Overfitting and part 2: Linear regression, and part 3: Classification). If not, please check this page before starting.

Try to run clusteringDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("clusteringDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo tests different clustering algorithms (K-Means, Hierarchical, and K-Medoids) on the Iris flower dataset.
After you run the full demo, try to answer the following questions:

  1. What can you say about the WSS vs K plot? Can you spot an elbow? How can you interpret this behaviour, knowing the correct K and having looked at how the dataset looks like?
  2. What kind of hierarchical clustering is ran by default? Top down or bottom up? Single, complete, or average linkage? Can you modify the method (just look at the parameters of the hclust function) to obtain better results in terms of NMI?
  3. Try to play with the fuzziness coefficient parameter (m) of Fuzzy C-Means. We know that its value is strictly greater than 1 but with no upper bound. What happens if its value is very close to 1? What happens if it is very big? And what if the value is the default (2)?
  4. What is the best algorithm in terms of NMI (also take into account the different methods of Hierarchical)? Which one do you consider the best in terms of interpretability of the results?
  5. Finally, take one sample whose Fuzzy C-Means membership does not clearly put it in one cluster (you can look at the "Memberships" matrix printed after FCM execution, check e.g. sample 51). What makes it so ambiguous? Try to check its features (iris[51,]) and the ones of the cluster centers (i.e. the representative points of each cluster, result$centers), and comment what you found.
12Jan/150

Statistical learning with R part 3: Classification

[This post is part of the Statistical learning with R series]

If you are here you probably have already unpacked the tgz file containing the demos and read the previous articles (part 1: Overfitting and part 2: Linear regression). If not, please check this page before starting.

Try to run classificationDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("classificationDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo shows you two different classification experiments. In the former, you will run your own spam filter on the HP Labs SPAM E-mail Database. In the latter, you will use classification to recognise handwritten digits from the USPS ZIP code dataset (Le Cun et al., 1990). You can find more information on both of the datasets here, in the "Data" section.
After you run the full demo, try to answer the following questions:

  1. Look at the confusion matrix from logistic regression and interpret it. How many messages have been correctly classified as spam? How many as non-spam? How many false positives you have (e.g. messages classified as spam that were not)? And how many false negatives?
  2. The spam experiment runs by default with a given training set (provided by the dataset creators for an easier comparison with other methods). Try to change it with a random one (all the code is ready for you to use within the script). Does anything significative change if you do this? What if you change the default size of the set?
  3. If you think about the spam experiment as a *real* problem, you might not just settle with a solution which gives you the smallest error, but you'd probably want to also minimize the *false positives* (i.e. you do not want to classify regular messages as spam). Play with the threshold to reduce the number of false positives and discuss the results: how does the general error change? How does the amount of spam you cannot recognize anymore increase?
  4. In the digits classification experiment, what is the digit that is misclassified the most by QDA? Try to display some examples of this misclassified digit (all the code you need is available in the script)
  5. Compare the results given by the different classification approaches in *both experiments*. What are the best ones? Relying on the methods comparison in your textbook, can you provide hypotheses explaining this behaviour?
9Jan/152

Statistical learning with R part 2: Linear regression

[This post is part of the Statistical learning with R series]

So, if you are here you probably have already unpacked the tgz file containing the demos and read the previous article about overfitting. If not, please check this page before starting.

Try to run linearRegressionDemo.R: supposing your current working directory is the one where you unpacked the R files, type

source("linearRegressionDemo.R",print.eval=TRUE)

The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo performs different linear regressions on the Los Angeles ozone dataset (simple, multivariate, with non-linear extensions, etc.). After you run the full demo, try to answer the following questions:

  1. The variable selection results we obtain with the regsubsets command are consistent with the first three steps we performed manually. Can you tell which approach we used for variable selection (both in regsubsets and in manual selection)? Would the sets of selected variables be the same if we chose another approach in regsubsets?(NOTE: you can try that! Just type ?regsubsets in R to see which methods are allowed, and change the current one in the code
  2. Which is the best variable selection method in qualitative terms? Choose e.g. the sets of 4 variables you get with the different methods provided by regsubsets, fit linear models using them, and show which one has the lowest MSE. Can you also explain why that method is better?
  3. Just by introducing a very simple non-linearity in our model we were able to get a better fit both in the training and in the test sets. Can you do better? Try to modify the code to include other types of non-linearities (in the simplest case, more complex polynomials) and verify whether the model
    gets better only in the training set or also in test.
  4. What can you say about the final results, comparing the evaluations of the training and test sets? Do you think there is actually overfitting? Do things change if you choose different values for the training and test sets?