Statistical learning with R part 1: Overfitting
[This post is part of the Statistical learning with R series]
So, if you are here you probably have already unpacked the tgz file containing the demos and done your first experiments trying to run R scripts. If not, please check this page before starting.
Try to run overfittingDemo.R: supposing your current working directory is the one where you unpacked the R files, you can just type
source("overfittingDemo.R",print.eval=TRUE)
(note that each demo generates some random data and before running one you should edit it and set up your own random seed).
In the overfitting demo you will generate different datasets, try to fit more and more flexible functions to the data, and evaluate how good your functions are in terms of how well they fit both a training and a test dataset. After you run the full demo, try to answer the following questions:
- We have seen that increasing the flexibility we tend to overfit the data, i.e. while the training MSE gets smaller, the test MSE increases. Is the opposite also true for our datasets, i.e. is the simplest, less flexible function taken into consideration the one that also provides the smallest test MSE? If not, can you explain why?
- How good are our errors? Throughout the whole demo, we have talked about errors without ever referring to the smallest possible one that we can get, that is the irreducible error. How is it calculated and what is its size in each of the generated datasets?
- The default datasets are quite small, and in some cases (e.g. try with seed=789) the results are not exactly the expected ones. Do results change sensibly if you increase the number of points? Can you tell why? (NOTE: you can do that just by changing the dataset_size variable at the beginning of the script)
Leave a comment