mala::home Davide “+mala” Eynard’s website


Statistical learning with R part 2: Linear regression

[This post is part of the Statistical learning with R series]

So, if you are here you probably have already unpacked the tgz file containing the demos and read the previous article about overfitting. If not, please check this page before starting.

Try to run linearRegressionDemo.R: supposing your current working directory is the one where you unpacked the R files, type


The print.eval parameter is needed to show you the output of some commands such as summary in the context of the source command. Also remember that the demo generates some random data and before running it you should edit it and set up your own random seed.

This demo performs different linear regressions on the Los Angeles ozone dataset (simple, multivariate, with non-linear extensions, etc.). After you run the full demo, try to answer the following questions:

  1. The variable selection results we obtain with the regsubsets command are consistent with the first three steps we performed manually. Can you tell which approach we used for variable selection (both in regsubsets and in manual selection)? Would the sets of selected variables be the same if we chose another approach in regsubsets?(NOTE: you can try that! Just type ?regsubsets in R to see which methods are allowed, and change the current one in the code
  2. Which is the best variable selection method in qualitative terms? Choose e.g. the sets of 4 variables you get with the different methods provided by regsubsets, fit linear models using them, and show which one has the lowest MSE. Can you also explain why that method is better?
  3. Just by introducing a very simple non-linearity in our model we were able to get a better fit both in the training and in the test sets. Can you do better? Try to modify the code to include other types of non-linearities (in the simplest case, more complex polynomials) and verify whether the model
    gets better only in the training set or also in test.
  4. What can you say about the final results, comparing the evaluations of the training and test sets? Do you think there is actually overfitting? Do things change if you choose different values for the training and test sets?