Chapter 6 - Linear Model Selection and Regularization

library(broom)
library(ISLR)
library(leaps)
library(tidyverse)

6.5.1 - Best Subset Selection

We wish to predict a baseball player’s salary on the bassis of various statistics associated with the performance in the previous year. Let’s remove the players for which the salary is missing from the dataset.

hitters <- Hitters %>% na.omit() %>% as.tibble()

## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

We use the regsubsets() function to perform a best subset selection by identifying the best model that contains a given number of predictors.

hitter_regsub <- regsubsets(Salary ~ ., hitters)
summary(hitter_regsub)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "

The asterisk indicates a given variable is included in the model. By default it goes up to eight variables, nvmax can be used to increase/decrease this.

The summary function also returns the $R^2$, RSS, adjusted $R^2$, $C_p$ and BIC. Let’s extract these values out as a tibble.

(
hitters %>% 
    regsubsets(Salary ~ ., .) %>% 
    summary() %>% 
    rbind() %>% 
    as.tibble() %>% 
    dplyr::select(rsq, rss, adjr2, cp, bic) %>% 
    unnest() %>%
    mutate(nvar = row_number())
)

## # A tibble: 8 x 6
##     rsq       rss adjr2     cp    bic  nvar
##   <dbl>     <dbl> <dbl>  <dbl>  <dbl> <int>
## 1 0.321 36179679. 0.319 104.    -90.8     1
## 2 0.425 30646560. 0.421  50.7  -129.      2
## 3 0.451 29249297. 0.445  38.7  -136.      3
## 4 0.475 27970852. 0.467  27.9  -142.      4
## 5 0.491 27149899. 0.481  21.6  -144.      5
## 6 0.509 26194904. 0.497  14.0  -148.      6
## 7 0.514 25906548. 0.501  13.1  -145.      7
## 8 0.529 25136930. 0.514   7.40 -148.      8

Let’s graph the RSS, adjusted $R^2$, $C_p$ and BIC to help us select a model. We’ll also increase nvmax to all 19 variables other than Salary.

hitters %>% 
    regsubsets(Salary ~ ., ., nvmax = 19) %>% 
    summary() %>% 
    rbind() %>% 
    as.tibble() %>% 
    dplyr::select(rsq, rss, adjr2, cp, bic) %>% 
    unnest() %>% 
    dplyr::mutate(nvar = row_number()) %>% 
    gather(func, 'value', -nvar) %>% 
    ggplot() + 
    geom_line(aes(nvar, value)) + 
    facet_wrap(~func, scales = 'free')

6.5.2 - Forward and Backward Stepwise Selection

The regsubsets() function can also be used for forward and backward stepwise.

hitters %>%
    regsubsets(Salary ~ ., ., nvmax = 19, method = 'backward') %>%
    summary() %>%
    rbind() %>%
    as.tibble() %>%
    dplyr::select(rsq, rss, adjr2, cp, bic) %>%
    unnest() %>%
    dplyr::mutate(nvar = row_number()) %>%
    gather(func, 'value', -nvar) %>%
    ggplot() +
    geom_line(aes(nvar, value)) +
    facet_wrap(~func, scales = 'free')

hitters %>%
    regsubsets(Salary ~ ., ., nvmax = 19, method = 'backward') %>%
    summary() %>%
    rbind() %>%
    as.tibble() %>%
    dplyr::select(rsq, rss, adjr2, cp, bic) %>%
    unnest() %>%
    dplyr::mutate(nvar = row_number()) %>%
    gather(func, 'value', -nvar) %>%
    ggplot() +
    geom_line(aes(nvar, value)) +
    facet_wrap(~func, scales = 'free')

6.5.3 - Chosing Among Models

We can use validation set and cross-validation to choose the correct model.

In order for the approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model fitting, including variable selection. If the full data set is used for the best subset selection step, the validation set errors and cross-validation errors obtained will not be accurate estimates of the test error.

There is no built in predict() method for the regsubsets class, so we first write our own.

This takes the regsubsets object, the data, and the ncoefs or number of coefficients to predict on.

TODO

6.6 - Ridge Regression and Lasso

The glmnet package can be used to perform ridge regression and lasso. The main function is glmnet(). It is different to other model fitting methods, in particular we must pass an x matrix as well as a y vector. We don’t use the y ~ x syntax.

library(glmnet)
library(modelr)

x <- hitters %>% model.matrix(Salary ~ ., .) 
y <- hitters$Salary

The glmnet() function has an alpha parameter that determines what type of model is fit. If alpha = 0 then a ridge regresion model is fit. If alpha = 1 then a lasso is fit.

Let’s fit a ridge regression model with an alpha between $10^10$ and $10^-2$:

ridge.mod <- glmnet(x, y, alpha = 0, lambda = 10 ^ seq(10, -2, length = 100))

By default, the glmnet() function standardises the variables to the same scale.

Associated with each $\lambda$ is a vector of ridge regression coefficients, stored in a matrix and accessible using coef().

We can also tidy the model:

ridge.mod %>% tidy()

## # A tibble: 2,000 x 5
##    term         step estimate      lambda   dev.ratio
##    <chr>       <dbl>    <dbl>       <dbl>       <dbl>
##  1 (Intercept)     1  5.36e+2 10000000000 0.000000276
##  2 AtBat           1  5.44e-8 10000000000 0.000000276
##  3 Hits            1  1.97e-7 10000000000 0.000000276
##  4 HmRun           1  7.96e-7 10000000000 0.000000276
##  5 Runs            1  3.34e-7 10000000000 0.000000276
##  6 RBI             1  3.53e-7 10000000000 0.000000276
##  7 Walks           1  4.15e-7 10000000000 0.000000276
##  8 Years           1  1.70e-6 10000000000 0.000000276
##  9 CAtBat          1  4.67e-9 10000000000 0.000000276
## 10 CHits           1  1.72e-8 10000000000 0.000000276
## # … with 1,990 more rows

Let’s take a look at some of the terms and how they change depending on the value of the lambda. We take the tidy model and filter out a few terms. We take the log of the lambda and use that as our x axis. The y axis is our coefficient estimate.

ridge.mod %>% 
    tidy() %>% 
    dplyr::filter(term %in% c('AtBat', 'Hits', 'Walks', 'Years', 'Salary', 'Runs')) %>% 
    mutate(log_lambda = log(lambda)) %>% 
    ggplot(aes(x = log_lambda, y = estimate, colour = term)) + 
    geom_line()

The predict() function can be used for a number of purposes. We can obtain the ridge regression coefficients for a new value of $\lambda$, say 50:

predict(ridge.mod, s = 50, type = 'coefficients')[1:10,]

##  (Intercept)  (Intercept)        AtBat         Hits        HmRun 
## 48.766103292  0.000000000 -0.358099859  1.969359286 -1.278247981 
##         Runs          RBI        Walks        Years       CAtBat 
##  1.145891632  0.803829228  2.716185796 -6.218319217  0.005447837

Let’s create a training and a test set and extract out the x and y training model matrix and vector respectively.

set.seed(1)
hitters.resample <- 
    hitters %>% 
    resample_partition(c('train' = .5, 'test' = .5))

x.train <- 
    hitters.resample$train %>% 
    model.matrix(Salary ~ ., .)

y.train <- 
    hitters.resample$train %>% 
    as.tibble() %>% 
    .$Salary

We now fit the model to the training set and generate the predictions using predict() with a $\lambda = 4$.

x.test <- hitters.resample$test %>% 
    model.matrix(Salary ~ ., .)

y.test <- as.tibble(hitters.resample$test) %>% 
    .$Salary

y.prediction <- predict(ridge.mod, s = 4, newx = x.test)
mean((y.prediction - y.test)^2)

## [1] 113895.5

Let’s try predicting with a very large $\lambda$, which should take all of the coefficients to zero so we’re only fitting the intercept:

y.prediction <- predict(ridge.mod, s = 1e10, newx = x.test)
mean((y.prediction - y.test)^2)

## [1] 238591.4

We see the MSE increase. Now we compare against a regular least squres to determine whether there is a benefit in using $\lambda = 4$. A normal least squares is the same as having a $\lambda = 0$. We have to add exact = T becuase (from the manual):

*If exact=FALSE (default), then the predict function uses linear interpolation to make predictions for values of s (lambda) that do not coincide with those used in the fitting algorithm. While this is often a good approximation, it can sometimes be a bit coarse.

With exact=TRUE, these different values of s are merged (and sorted) with object$lambda, and the model is refit before predictions are made. In this case, it is required to supply the original data x= and y= as additional named arguments to predict() or coef(). The workhorse predict.glmnet() needs to update the model, and so needs the data used to create it.*

y.prediction <- predict(ridge.mod, s = 0, newx = x.test, exact = TRUE, x = x.train, y = y.train)
mean((y.prediction - y.test)^2)

## [1] 153964.5

So fitting with a $\lambda = 4$ leads to a lower test MSE.

In general it would be better to use cross validation to determine the $\lambda$. We can use the cv.glmnet() to do this. By default is does ten folds. We see the $\lambda$ with the smallest cross validation error is 27. We plug this back in and see an MSE of 121,165.

set.seed(1)
cv.out <- cv.glmnet(x.train, y.train, alpha = 0) 
cv.out %>% 
    tidy() %>%
    mutate(log_lambda = log(lambda)) %>% 
    ggplot(aes(x = log_lambda, y = estimate)) + 
    geom_line() + 
    geom_point()

cv.out$lambda.min

## [1] 24.70255

y.prediction <- predict(ridge.mod, s = 27, newx = x.test)
mean((y.prediction - y.test)^2)

## [1] 121165.3

6.6.6 - The Lasso

Let’s see whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. We continue to use the glmnet() function, but we set alpha = 1. Ither than than that most of the steps are the same.

lasso.mod <- glmnet(x.train, y.train, alpha = 1, lambda = 10^seq(10, -2, length = 100))
lasso.mod %>%
    tidy() %>%
    dplyr::filter(term %in% c('AtBat', 'Hits', 'Walks', 'Years', 'Salary', 'Runs')) %>%
    mutate(log_lambda = log(lambda)) %>%
    ggplot(aes(x = log_lambda, y = estimate, colour = term)) +
    geom_line()

We can see that the coefficients we have chosen will eventually go to zero. We run a cross-validation and compute the test error.

set.seed(1)
lasso.cv <- cv.glmnet(x.train, y.train, alpha = 1)

lasso.cv %>%
    tidy() %>%
    mutate(log_lambda = log(lambda)) %>%
    ggplot(aes(x = log_lambda, y = estimate)) +
    geom_line() +
    geom_point()

lasso.pred <- predict(lasso.mod, s = lasso.cv$lambda.min, newx = x.test)
mean((lasso.pred - y.test)^2)

## [1] 147964.5

The main advantage over the ridge regression is that the estimates are sparse:

predict(lasso.mod, s = lasso.cv$lambda.min, type = 'coefficients')

## 21 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  -50.9475142
## (Intercept)    .        
## AtBat         -0.2076188
## Hits           1.0571536
## HmRun          3.1270724
## Runs           .        
## RBI            2.2654979
## Walks          4.8394796
## Years         -3.4568155
## CAtBat         .        
## CHits          0.3743793
## CHmRun        -0.6036365
## CRuns          0.4838344
## CRBI           .        
## CWalks        -0.6536503
## LeagueN       23.0199715
## DivisionW   -126.5994583
## PutOuts        0.2709289
## Assists        .        
## Errors        -1.7373176
## NewLeagueN     .

We see a number of the coefficients have been taken to 0.

6.7 - PCR and PLS Regression

6.7.1 - Principal Components Regression

Principal components regression can be performed using the pcr() function which is part of the pls library. However we are going to take a different direction and use the prcmp() function in a tidier manner.

Before looking at the hitters data, we take a look at a simpler data set with two principal components:

library(pls)

set.seed(1)
sample <- tibble(x = 1:1000, y = 2*x + rnorm(length(x), 30, 300))
sample %>% ggplot(aes(x,y)) + geom_point()

We take this data set and nest it in a tibble. We then run the prcomp() function over the top to get our princpal components. We augment these with the broom library which gives us each observations projection into the PCA space and then unnest this variable.

The two PC columns per observation are ‘.fittedPC1’ and ‘.fittedPC2’, so we summarise the variance of both of these columns. The resulting tibble has those two columns with their variance, so we gather them togeher to there’s a column for the PC number, and a column for the variance.

We then calculate the percentage of the total variance, ending up with 96.7% of the variance explained by the first principal component, and 3.3% being described by the second principal component.

sample %>% 
    nest() %>% 
    mutate(
        pc = map(data, ~prcomp(.x)), 
        pc_aug = map2(data, pc, ~augment(.y, data = .x))
    ) %>% 
    unnest(pc_aug) %>% 
    summarise_at(vars(contains('PC')), funs(var)) %>% 
    gather(key = pc, value = variance) %>% 
    mutate(var_explained = variance/sum(variance))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

## # A tibble: 2 x 3
##   pc         variance var_explained
##   <chr>         <dbl>         <dbl>
## 1 .fittedPC1  487593.        0.967 
## 2 .fittedPC2   16480.        0.0327

Let’s now apply this to the hitters data. There are two extra steps in the pipeline: we remove columns that aren’t numeric, and we rename the PC columns for aesthetics.

hitters_pca_variance <- 
    hitters %>% 
    select_if(is.numeric) %>% 
    nest() %>% 
    mutate(
        pc = map(data, ~prcomp(.x, center = T, scale = T)), 
        pc_augment = map2(pc, data, ~augment(.x, .y))
    ) %>% 
    unnest(pc_augment) %>% 
    summarise_at(vars(contains('PC')), funs(var)) %>% 
    rename_all(funs(str_replace(., '.fittedPC', ''))) %>% 
    gather(key = pc, value = variance) %>% 
    mutate(pc = as.integer(pc)) %>%
    mutate(explained_variance = variance/sum(variance))

hitters_pca_variance

## # A tibble: 17 x 3
##       pc variance explained_variance
##    <int>    <dbl>              <dbl>
##  1     1  7.69             0.452    
##  2     2  4.12             0.242    
##  3     3  1.73             0.102    
##  4     4  0.917            0.0539   
##  5     5  0.707            0.0416   
##  6     6  0.524            0.0308   
##  7     7  0.488            0.0287   
##  8     8  0.251            0.0148   
##  9     9  0.181            0.0106   
## 10    10  0.132            0.00779  
## 11    11  0.0974           0.00573  
## 12    12  0.0594           0.00349  
## 13    13  0.0538           0.00317  
## 14    14  0.0267           0.00157  
## 15    15  0.0141           0.000828 
## 16    16  0.00481          0.000283 
## 17    17  0.00120          0.0000707

Let’s have a look at the graph:

hitters_pca_variance %>% ggplot() + geom_bar(aes(pc, explained_variance), stat = 'identity')

Now we go back to using the prc() function.

pcr_fit <- pcr(formula = Salary ~ ., data = hitters, scale = T, validation = "CV")
pcr_fit %>% summary()

## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    356.5    353.8    352.9    350.9    347.7    344.0
## adjCV          452    355.9    353.2    352.3    350.3    347.1    343.1
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       344.3    345.3    346.7     350.7     352.7     354.7     355.2
## adjCV    343.4    344.4    345.7     349.4     351.3     353.2     353.6
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        353.2     353.9     346.8     349.5     348.1     350.6
## adjCV     351.3     352.0     344.7     347.1     345.6     348.0
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         94.96    96.28     97.26     97.98     98.65     99.15     99.47
## Salary    46.75    46.86     47.76     47.82     47.85     48.10     50.40
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.75     99.89     99.97     99.99    100.00
## Salary     50.55     53.01     53.85     54.61     54.61

validationplot(pcr_fit, val.type = 'MSEP')

The CV score is the root mean squared error. To obtain the usual MSE, the values need to be squared.

set.seed(1)
hitters_pcr <- hitters %>%
    resample_partition(c('train' = .5, 'test' = .5)) %>%
    rbind() %>%
    as.tibble() %>%
    mutate(pcr = map(train, ~pcr(Salary ~ ., data = as.tibble(.x), scale = T, validation = 'CV')))

hitters_pcr

## # A tibble: 1 x 3
##   train      test       pcr   
##   <list>     <list>     <list>
## 1 <resample> <resample> <mvr>

hitters_pcr %>% pull(pcr) %>% .[[1]] %>% validationplot(., val.type = 'MSEP')

We now see the lowest validation occurs when $M = 7$. We calculate the test MSE:

hitters_pcr %>% 
    transmute(
        pred = map2(pcr, test, ~predict(.x, as.tibble(.y), ncomp = 5)), 
        res = map(test, ~as.tibble(.x))) %>% 
        unnest() %>% 
        summarise(test_mse = mean((pred - Salary)^2))

## # A tibble: 1 x 1
##   test_mse
##      <dbl>
## 1  148922.

6.7.2 - Partial Least Squares

Partial least squares is implemented using the plsr() function, which is also a part of the pls library. The syntax is the same as the pcr() function.

hitters <- na.omit(Hitters) %>% as.tibble()
set.seed(2)
hitters_pls <- hitters %>%
    nest() %>%
    mutate(
        sample = map(data, ~resample_partition(.x, c(test = .5, train = .5))),
        pls = map2(data, sample, ~plsr(Salary ~ ., data = .x, subset = as.integer(.y$test), scale = T, validation = 'CV'))
    ) 

hitters_pls %>%
    pull(pls) %>%
    .[[1]] %>%
    summary()

## Data:    X dimension: 131 19 
##  Y dimension: 131 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           420.9    329.2    332.3    340.4    342.3    356.3    359.1
## adjCV        420.9    328.2    330.9    338.0    339.8    352.4    355.0
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       364.0    356.8    353.9     356.5     356.5     357.8     367.7
## adjCV    358.8    352.4    349.7     352.3     352.0     353.3     362.2
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        366.5     358.5     359.7     360.0     359.4     368.3
## adjCV     360.8     353.6     354.8     355.1     354.6     362.8
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         36.30    53.97    65.17    70.87    75.99    83.24    86.98
## Salary    45.72    48.70    50.92    52.47    54.08    55.21    56.33
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         89.41    91.85     95.89     96.42     97.28     97.89     98.28
## Salary    56.73    56.97     57.03     57.42     57.68     57.99     58.34
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          98.75     99.69     99.82     99.99    100.00
## Salary     58.49     58.50     58.54     58.56     58.64

hitters_pls %>%
    pull(pls) %>%
    .[[1]] %>%
    validationplot(val.type = 'MSEP')

The lowest cross-validation error occurs when $M = 2$. We now evaluate the test set.

hitters_pls %>% 
    mutate(
        pred = map2(sample, pls, ~as.vector(predict(.y, .x$test, ncomp = 2))), 
        salary = map(sample, function(x) as.tibble(x$test)$Salary)) %>% 
    unnest(pred, salary) %>% 
    summarise(mse = mean((pred - salary)^2))

## # A tibble: 1 x 1
##      mse
##    <dbl>
## 1 89506.

We get an MSE of 89,509, which looks rather good.

Chapter 6 - Linear Model Selection and Regularization - Lab Exercises

Greg Foletta