8) Auto data set

a)

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Print the results and comment on the output.

auto <- as_tibble(Auto)
lm_auto <- auto %>% lm(mpg ~ horsepower, .)
lm_auto %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) 39.9358610 0.7174987 55.65984 0
horsepower -0.1578447 0.0064455 -24.48914 0

i)

Is there a relationship between the predictor and the response?

To determine if there is a relationship, we need to look at two items: the p-values for the coefficients, the F-statistic, and the p-value for the F-statistic:

lm_auto %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) 39.9358610 0.7174987 55.65984 0
horsepower -0.1578447 0.0064455 -24.48914 0
lm_auto %>% 
    glance() %>%
    kable() %>%
    kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.6059483 0.6049379 4.905757 599.7177 0 2 -1178.662 2363.324 2375.237 9385.916 390

We in the tidy() output we can see the p-value for the intercept and the coefficient is very small, indicating a very low probability for the null hypothesis.A

In the glance() output, we see the statistic column (the F-statistic) is high at around 600, with a small p-value. This indicates that there is a relationship.

ii)

How strong is the relationship

To test how string the relationship is, we can look at the residual standard error (RSE) and the R^2 value.

We look at the glance() output again and see the R^2 value is .606. Recall that the R^2 is a value between 0 and 1 that is the ‘proportion of the variance explained’. We can consider the relationship reasonably strong;

iii)

Is the relationship positive or negative? There is a negative relationship between mpg and horsepower, thus miles per gallon goes down as horsepower goes up. This aligns with our conceptual idea.

iv)

What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

We use the interval argument to the predict function. Note that the default level argument of predict() is 0.95 (95%).

predict(lm_auto, tibble(horsepower = 98), interval = 'confidence') %>%
    kable() %>%
    kable_styling()
fit lwr upr
24.46708 23.97308 24.96108
predict(lm_auto, tibble(horsepower = 98), interval = 'predict') %>%
    kable() %>%
    kable_styling()
fit lwr upr
24.46708 14.8094 34.12476

b)

Plot the response and the predictor. Display the least squares regression line.

auto %>% ggplot(aes(mpg, horsepower)) + geom_point() + geom_smooth(method = 'lm', formula = 'y ~ x')

c)

Produce diagnostic plots (Resid v Leverage, Resid v Fitted, Fitted v Std Resid) and comment on any problems

First off, lets have a look at the residuals versus the leverage:

augment(lm_auto) %>% ggplot(aes(.hat, .resid)) + geom_point()

There are a few points up in the top right. We take a look at the Cook’s distance for the observations.

augment(lm_auto) %>% mutate(i = 1:n()) %>% ggplot(aes(i, .cooksd)) + geom_bar(stat = 'identity')

A couple of high points but all below 1.

Now we look at the fitted versus the residuals, and also fit a quadratic regression. We see a bit of a U shape, indicating potential non-linearity in the data.

augment(lm_auto) %>% ggplot(aes(.fitted, .resid)) + geom_point() + geom_smooth(method = 'lm', formula = 'y ~ poly(x,2)')

9) Multiple Linear Regression - Auto Data Set

library(GGally)
library(corrplot)

a)

Produce a scatterplot matrix which includes all the data in the data set

auto %>% dplyr::select(-name) %>% ggpairs()

b)

Compute the matrix of correlations between the variables.

auto %>% dplyr::select(-name) %>% cor() %>% corrplot(method = 'color')

c)

Perform a multiple linear regression with mpg as the response and all other variables except name as the predictors.

lin_reg_auto <- lm(mpg ~ . -name, auto)
tidy(lin_reg_auto) %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) -17.2184346 4.6442941 -3.707438 0.0002402
cylinders -0.4933763 0.3232823 -1.526147 0.1277965
displacement 0.0198956 0.0075151 2.647430 0.0084446
horsepower -0.0169511 0.0137869 -1.229512 0.2196328
weight -0.0064740 0.0006520 -9.928787 0.0000000
acceleration 0.0805758 0.0988450 0.815174 0.4154780
year 0.7507727 0.0509731 14.728795 0.0000000
origin 1.4261405 0.2781361 5.127492 0.0000005
glance(lin_reg_auto) %>%
    kable() %>%
    kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.8214781 0.8182238 3.327682 252.428 0 8 -1023.475 2064.949 2100.691 4252.213 384

i)

Is there a relationship between the predictors and the response? We test the null hypothesis of “are all of the regression coefficients zero?”. The F-statistic 252 (far greater than 1) and has a p-value of 2e-139, indicating a low probability that this is just by chance. We can therefore say there is a relationship between the predictors and the response.

ii)

We look at the p-values for each of the predictors. The predictors which have a high probability of having an effect on the mpg, holding all others constant, appear to be weight, year, displacement and origin.

iii)

What does the coefficient for the year variable suggest? The year coefficient suggests that a cars mpg gets larger - and therefore better - the later a car was made.

d)

Plot the diagnostic plots of the regression and comment on any problems with the fit.

We go through our usual plots - first off is looking at the residuals versus the leverage:

augment(lin_reg_auto) %>% mutate(i = 1:n()) %>% ggplot(aes(i, .cooksd)) + geom_bar(stat = 'identity')

We don’t see values with a significant Cook’s distance. We move on to the fitted versus the residuals:

augment(lin_reg_auto) %>% ggplot(aes(.fitted, .resid)) + geom_point() + geom_smooth(method = 'lm', formula = 'y~poly(x,2)')

There is some evidence of the non-linearity of the results.

e)

**Use the ’*‘and’:’ symbols to fit linear regressions with interaction effects. Are any interactions statistically significant?**

A ’*’ adds the predictors and the interaction term, whereas the : only adds the interaction term. I.e. x\*y == x + y + x:y.

Let’s have a think about potential interactions - I think weight and year could interact, given the changes in materials. There could also be an f,Let’s have a think about potential interactions - I think weight and year could interact, given the changes in materials. There could also be and interaction between cylinders and displacement:

lm(mpg ~ weight*year + cylinders*displacement, auto) %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) -81.3476273 13.8781079 -5.861579 0.0000000
weight 0.0202479 0.0046727 4.333191 0.0000188
year 1.7535227 0.1755592 9.988211 0.0000000
cylinders -1.4865200 0.4002062 -3.714385 0.0002338
displacement -0.0698473 0.0127015 -5.499162 0.0000001
weight:year -0.0003469 0.0000618 -5.612385 0.0000000
cylinders:displacement 0.0090185 0.0015597 5.782327 0.0000000

All of the values appear to be reasonably statistically significant. In fact, if we have a look at the fitted vs residuals, it looks much better than before:

lm(mpg ~ weight*year + cylinders*displacement, auto) %>% augment() %>% ggplot(aes(.fitted, .resid)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

f)

Try different transformations of the variables and comment ont the findings.

We try a few different transformations and pipe them through to the fitted versus residuals graph:

lm(mpg ~ sqrt(horsepower), auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))

lm(mpg ~ log(horsepower), auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))

lm(1/mpg ~ horsepower, auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))

lm(1/mpg ~ horsepower + weight*year, auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))

lm(1/mpg ~ horsepower + weight*year, auto) %>% 
    glance() %>%
    kable() %>%
    kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.8830842 0.8818758 0.0057188 730.7686 0 5 1470.576 -2929.152 -2905.324 0.0126569 387

The last one looks quite good.

10) Carseats Data Set

a)

Fit a multiple regression model to predict Sales using Price, Urban, and US.

carseats <- as_tibble(Carseats)
cs_regress <- lm(Sales ~ Price + Urban + US, carseats)
cs_regress %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) 13.0434689 0.6510122 20.0356737 0.0000000
Price -0.0544588 0.0052419 -10.3892320 0.0000000
UrbanYes -0.0219162 0.2716503 -0.0806778 0.9357389
USYes 1.2005727 0.2590415 4.6346731 0.0000049

b)

Provide an interpretation of each coefficient in the model. * (Intercept) - the average number of sales of carseats, ignoring all other factors. * Price - the regression indicates a relationship between price and sales, given the low p-value of the t-statistic. An increase in price of a dollar results in a decrease of 54 carseats solds. * UrbanYes - given the high p-value, there doesn’t appear to be a relationship between sales and whether a store is urban. * USYes - given the low p-value, the store bein in the US results in 1200 more carseats being sold.

c)

Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = x * Price + y * Urban + z * US, where [Urban = Yes => y = 1|Urban = No => y = 0] & [US = Yes => z = 1|US = No => z = 0]

d)

For which of the predictors can you reject the null hypothesis H 0 : βj = 0?

The null hypothesis can be rejected for Price and US.

e)

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

cs_regress_reduced <- lm(Sales ~ Price + US, carseats)
cs_regress_reduced %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) 13.0307928 0.6309763 20.651794 0.0e+00
Price -0.0544776 0.0052301 -10.416123 0.0e+00
USYes 1.1996429 0.2584610 4.641485 4.7e-06

f)

How well do the models in (a) and (e) fit the data?

We look at the F-statistic and its assoicated p-value to determine how well the models fit:

cs_regress %>% 
    glance() %>%
    kable() %>%
    kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2392754 0.2335123 2.472492 41.51877 0 4 -927.656 1865.312 1885.269 2420.835 396
cs_regress_reduced %>% 
    glance() %>%
    kable() %>%
    kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2392629 0.2354305 2.469397 62.43114 0 3 -927.6593 1863.319 1879.285 2420.874 397

We see no increase in the R-sqaured value - but we do see an increase in the F-statistic, and a decrease in its p-value.

g)

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

cs_regress_reduced %>% 
    confint()%>%
    kable() %>%
    kable_styling()
2.5 % 97.5 %
(Intercept) 11.7903202 14.2712653
Price -0.0647598 -0.0441954
USYes 0.6915196 1.7077663

h)

Is there evidence of outliers or high leverage observations in the model from (e)?

cs_regress_reduced %>% augment() %>% ggplot(aes(.fitted, .resid)) + geom_point()

cs_regress_reduced %>% augment() %>% ggplot(aes(.hat, .resid)) + geom_point()

There doesn’t appear to be any non-linearity in the daata, and the high leverage points don’t appear to affect the data substantially.

11) T-statistic and null-hypothesis

In this problem we will investigate the t-statistic for the null hypothesis H 0 : β = 0 in simple linear regression without an intercept. To begin, we generate a predictor x and a response y

set.seed(1)
linear_response <- tibble(x = rnorm(100), y = 2 * x + rnorm(100))
linear_response %>% ggplot(aes(x,y)) + geom_point()

a)

Perform a simple linear regression of y onto x , without an intercept. Report the coefficient estimate β̂, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis H 0 : β = 0. Comment on these results.

lr_reg <- lm(y ~ x + 0, linear_response)
lr_reg %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
x 1.993876 0.1064767 18.72593 0

The coefficient estimate is 1.99 - very close to the 2 that we used to generate the points. The standard error is .106, the t-statistic is 18.7 and the p-value is 2.64e-34. The t-statistic tells us how many standard deviations the coefficient is away from 0. The p-value gives us the probability that the null hypothesis - that the coefficient is 0 - is true.

b)

Now perform a simple linear regression of x onto y without an intercept, and report on the same metrics as a)

lr_reg_reverse <- lm(x ~ y + 0, linear_response)
lr_reg_reverse %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
y 0.3911145 0.0208863 18.72593 0

The coefficient estimate and the standard error have changed, which is as expected. The t-statistic and p-value remain the same.

c)

What is the relationship between the results obtained in (a) and (b)?

It’s the same line, thus the overall results obtained are the same.

12) Simple Linear Regression

a)

Recall that the coefficient estimate β̂ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

When the sum of the squares of the y observations is the same as the sum or the squares of the x observations.

b)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
tibble(x = rnorm(100), y = 4*x) %>% lm(y ~ x, .) %>% tidy()
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## # A tibble: 2 x 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -8.88e-17  2.19e-17  -4.06e 0 0.0000999
## 2 x            4.00e+ 0  2.43e-17   1.64e17 0
set.seed(1)
tibble(x = rnorm(100), y = 4*x) %>% lm(x ~ y, .) %>% tidy()
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## # A tibble: 2 x 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -2.22e-17  5.47e-18  -4.06e 0 0.0000999
## 2 y            2.50e- 1  1.52e-18   1.64e17 0

c)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the Asame as the coefficient estimate for the regression of Y onto X.

To have the same coefficients, we need the sum of the squares to be the same. To do this, we generate 100 random values for X, and we use the same values for Y. However we re-order the values so we don’t simply get a y = x function. We can use the sample_n() function for this:

set.seed(1)
sample_data <- tibble(x = rnorm(100)) %>% 
    mutate(y = (sample_n(., 100) %>% .[['x']]))
lm(y ~ x, sample_data) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  0.108      0.0909    1.19     0.237
## 2 x            0.00695    0.101     0.0688   0.945
lm(x ~ y, sample_data) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  0.108      0.0909    1.19     0.237
## 2 y            0.00695    0.101     0.0688   0.945

13) Simulated Linear Regressions

a)

Using the rnorm() function, create a vector, x , containing 100 observations drawn from a N (0, 1) distribution. This represents a feature, X.

x <- rnorm(100)

b)

Using the rnorm() function, create a vector, eps , containing 100 observations drawn from a N (0, 0.25) distribution

eps <- rnorm(100, 0, .25)

c)

Using x and eps , generate a vector y according to the model Y = −1 + 0.5X + e. What is the length of the vector y ? What are the values of β 0 and β 1 in this linear model?

y <- -1 + .5 * x + eps

The beta_0 is -1, and the beta_1 is .5

d)

*Create a scatterplot displaying the relationship between x and y . Comment on what you observe.

simulated <- tibble(x = x, y = y) 
simulated %>% ggplot(aes(x,y)) + geom_point()

We can see an approximate linear relationship between x and y.

e)

Fit a least squares linear model to predict y using x . Comment on the model obtained. How do β̂ 0 and β̂ 1 compare to β 0 and β 1 ?

lm(y ~ x, simulated) %>%
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) -0.9748778 0.0245031 -39.78592 0
x 0.4953323 0.0234770 21.09861 0

We see a beta_0 of -0.988, and a beta_1 of 0.527 - very close to the actual values of the function. The standard error is low at 0.0248 and 0.0241 respectively.

f)

*Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Create a legend.

simulated %>% ggplot(aes(x,y)) + 
    geom_point() + 
    geom_smooth(aes(colour = 'Regression'), method = 'lm', formula = 'y ~ x') + 
    geom_abline(aes(slope = .5, intercept = -1, colour = 'Population Mean'), size = 1) +
    labs(colour = "Lines")

g)

Now fit a polynomial regression model that predicts y using x and x^2 . Is there evidence that the quadratic term improves the model fit? Explain your answer.

lm(y ~ poly(x,2), simulated) %>%
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) -0.9635947 0.0246226 -39.1345311 0.0000000
poly(x, 2)1 5.1685804 0.2462262 20.9911878 0.0000000
poly(x, 2)2 -0.0166596 0.2462262 -0.0676599 0.9461957
lm(y ~ poly(x,2), simulated) %>%
    glance() %>%
    kable() %>%
    kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.81958 0.81586 0.2462262 220.3173 0 3 -0.2204281 8.440856 18.86154 5.880852 97

We can see that the R^2 has not changed, and the RSE has increased slightly with the x^2 regression. However the F-statistic has decreased significantly with the x^2 regression, indicating a decrease in the significance of the model.

h) & i)

Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data.

Lets create a function to do this, with the variable being how much noise is in the data.

simulated_linear <- function(observations, mean, noise) {
    x <- rnorm(observations)
    eps <- rnorm(observations, mean, noise)
    y <- -1 + .5 * x + eps
    return(tibble(x = x, y = y))
}

set.seed(1)
low_and_high_noise <- bind_cols( simulated_linear(100, 0, .1), simulated_linear(100, 0, .5))
lm(y ~ x, low_and_high_noise) %>% 
    glance() %>%
    kable() %>%
    kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.9564698 0.9560256 0.0962753 2153.309 0 2 93.1706 -180.3412 -172.5257 0.9083561 98
lm(y ~ x, low_and_high_noise) %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) -1.003769 0.0096987 -103.49492 0
x 0.499894 0.0107727 46.40376 0
lm(y1 ~ x1, low_and_high_noise) %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) -0.9757750 0.0495499 -19.69276 0
x1 0.5531092 0.0481297 11.49206 0
lm(y1 ~ x1, low_and_high_noise) %>% 
    glance() %>%
    kable() %>%
    kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.574038 0.5696915 0.4952935 132.0675 0 2 -70.62324 147.2465 155.062 24.04093 98

Looking at the values, the low noise regression picks the exact coefficients that were used in the function. The R^2 is near 1 with a high F-statistic.

The higher noise, as expected, has low F-statistic and high p-values for the coefficients. The R^2 value is very low.

low_and_high_noise %>% 
    ggplot() + 
    geom_point(aes(x, y), colour = 'red') + 
    geom_point(aes(x1, y1), colour = 'blue') + 
    geom_smooth(aes(x, y), method = 'lm', formula = 'y ~ x', colour = 'red') + 
    geom_smooth(aes(x1, y1), method = 'lm', formula = 'y ~ x', colour = 'blue')

j)

** What are the confidence intervals of the data sets? Comment on the results**

lm(y ~ x, simulated) %>% 
    confint() %>%
    kable() %>%
    kable_styling()
2.5 % 97.5 %
(Intercept) -1.023503 -0.9262522
x 0.448743 0.5419217
lm(y ~ x, low_and_high_noise) %>% 
    confint() %>%
    kable() %>%
    kable_styling()
2.5 % 97.5 %
(Intercept) -1.0230161 -0.9845224
x 0.4785159 0.5212720
lm(y1 ~ x1, low_and_high_noise) %>% 
    confint() %>%
    kable() %>%
    kable_styling()
2.5 % 97.5 %
(Intercept) -1.0741052 -0.8774448
x1 0.4575975 0.6486210

14) Collinearity Problem

a)

Set up the data

set.seed(1)
colin_data <- tibble(
    x1 = runif(100), 
    x2 = 0.5 * x1 + rnorm(100)/10, 
    y = 2 + 2*x1 + 0.3*x2 + rnorm(100)
)

The beta_{0,1,2} coefficients of the model are (2,2,0.3) respectively.

b)

What is the correlation between x1 and x2 ? Create a scatterplot displaying the relationship between the variables.

colin_data %>% 
    dplyr::select(x1, x2) %>% 
    cor() %>%
    kable() %>%
    kable_styling()
x1 x2
x1 1.0000000 0.8351212
x2 0.8351212 1.0000000
colin_data %>% ggplot(aes(x1, x2)) + geom_point()

c)

Using this data, fit a least squares regression to predict y using x1 and x2

colin_data_reg <- lm(y ~ x1 + x2, colin_data)
colin_data_reg %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) 2.130500 0.2318817 9.1878742 0.0000000
x1 1.439555 0.7211795 1.9961126 0.0487252
x2 1.009674 1.1337225 0.8905831 0.3753565
colin_data_reg %>% 
    glance() %>%
    kable() %>%
    kable_styling()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.2088293 0.1925165 1.056178 12.80156 1.16e-05 3 -145.8366 299.6731 310.0938 108.2046 97

Can the null hypotheses beta_1 = 0 and beta_2 = 0 be rejected?

Let’s take the general view that a p-value of 0.05 is statistically significant. With this in mind, we can reject the null hypothesis for x1, but not for x2.

d) and e)

Now fit a least squares regression to predict y using only x1, with only x2. Can we reject the null hypothesis for either of these?

colin_data_reg_x1 <- lm(y ~ x1, colin_data)
colin_data_reg_x2 <- lm(y ~ x2, colin_data)
colin_data_reg_x1 %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) 2.112394 0.2307448 9.154675 0.0e+00
x1 1.975929 0.3962774 4.986227 2.7e-06
colin_data_reg_x2 %>% 
    tidy() %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
(Intercept) 2.389949 0.1949307 12.260508 0.00e+00
x2 2.899585 0.6330467 4.580365 1.37e-05

I both instances we have a p-value below 0.05, and thus we can rejct the null hypothesis in both instances.

f)

Do the results obtained in (c)–(e) contradict each other? Explain your answer.

The results don’t contradict each other. When regressing using both predictors that a colinear, it can be difficult to separate out the individual effects on the response, and the power of the hypothesi stest is reduced. (see ISL 3.36).

g)

Now suppose we obtain one additional observation, which was unfortunately mismeasured. Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers

We add the new observation and calculate the mew regressions.

colin_data_add <- colin_data %>% add_row(x1 = 0.1, x2 = 0.8, y = 6)

colin_data_reg_both <- lm(y ~ x1 + x2, colin_data_add)
colin_data_reg_x1 <- lm(y ~ x1, colin_data_add)
colin_data_reg_x2 <- lm(y ~ x2, colin_data_add)

We recall from 3.4 that an outlier is a point far from the value predicted by the model. We can look at residuals, but it can be difficult to decide how large a residual needs to be before it’s classified as an outlier. This is where studentised residuals are used, where each residual is divided by e_i - its estimated standard error. Points with standardised residuals greater than 3 are possible outliers.

colin_data_reg_both %>% 
    augment() %>%
    filter(y == 6) %>%
    kable() %>%
    kable_styling()
y x1 x2 .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
6 0.1 0.8 4.292291 0.6920985 1.707709 0.4147284 1.056178 1.019021 2.07706
colin_data_reg_x1 %>%
    augment() %>%
    filter(y == 6) %>%
    kable() %>%
    kable_styling()
y x1 .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
6 0.1 2.433497 0.2033033 3.566503 0.0334716 1.055063 0.1845392 3.264591
colin_data_reg_x2 %>%
    augment() %>%
    filter(y == 6) %>%
    kable() %>%
    kable_styling()
y x2 .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
6 0.8 4.840347 0.3417746 1.159653 0.1013106 1.07214 0.0731539 1.139229

Only in the regression with x1 as the predictor could the response be considered an outlier.

Points with high leverage have an unusual x_i value, which tend to have significant impacts on the estimated regression lines. The leverage statistic is used to quantify leverage. We look at the .hat column which gives us the leverage statistic. We plot the .hat versus the .std.resid. When both x1, and x2 are used in the regression, this point appears to be a high leverage point. In the x1 regression it’s not a high leverage point, and in the x2 regression it has a bit of leverage.

colin_data_reg_both %>% augment() %>% ggplot() + geom_point(aes(.hat, .std.resid))

colin_data_reg_x1 %>% augment() %>% ggplot() + geom_point(aes(.hat, .std.resid))

colin_data_reg_x2 %>% augment() %>% ggplot() + geom_point(aes(.hat, .std.resid))

15) Boston Data Set

We will now try to predict per capita crime rate using the other variables in this data set.

a)

For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
boston <- as_tibble(Boston)

boston_regress <- tibble(predictor = names(boston)[-1]) %>% 
        mutate(predictor %>% map(function(x) lm(paste('crim ~', x), boston) %>% tidy())) %>% unnest() %>% filter(term != '(Intercept)')
boston_regress %>% 
    print(n = 26) %>%
    kable() %>%
    kable_styling()
## # A tibble: 13 x 6
##    predictor term    estimate std.error statistic  p.value
##    <chr>     <chr>      <dbl>     <dbl>     <dbl>    <dbl>
##  1 zn        zn       -0.0739   0.0161      -4.59 5.51e- 6
##  2 indus     indus     0.510    0.0510       9.99 1.45e-21
##  3 chas      chas     -1.89     1.51        -1.26 2.09e- 1
##  4 nox       nox      31.2      3.00        10.4  3.75e-23
##  5 rm        rm       -2.68     0.532       -5.04 6.35e- 7
##  6 age       age       0.108    0.0127       8.46 2.85e-16
##  7 dis       dis      -1.55     0.168       -9.21 8.52e-19
##  8 rad       rad       0.618    0.0343      18.0  2.69e-56
##  9 tax       tax       0.0297   0.00185     16.1  2.36e-47
## 10 ptratio   ptratio   1.15     0.169        6.80 2.94e-11
## 11 black     black    -0.0363   0.00387     -9.37 2.49e-19
## 12 lstat     lstat     0.549    0.0478      11.5  2.65e-27
## 13 medv      medv     -0.363    0.0384      -9.46 1.17e-19
predictor term estimate std.error statistic p.value
zn zn -0.0739350 0.0160946 -4.593776 0.0000055
indus indus 0.5097763 0.0510243 9.990848 0.0000000
chas chas -1.8927766 1.5061155 -1.256727 0.2094345
nox nox 31.2485312 2.9991904 10.418989 0.0000000
rm rm -2.6840512 0.5320411 -5.044820 0.0000006
age age 0.1077862 0.0127364 8.462825 0.0000000
dis dis -1.5509017 0.1683300 -9.213458 0.0000000
rad rad 0.6179109 0.0343318 17.998199 0.0000000
tax tax 0.0297423 0.0018474 16.099388 0.0000000
ptratio ptratio 1.1519828 0.1693736 6.801430 0.0000000
black black -0.0362796 0.0038732 -9.366951 0.0000000
lstat lstat 0.5488048 0.0477610 11.490654 0.0000000
medv medv -0.3631599 0.0383902 -9.459710 0.0000000

b)

Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H 0 : β j = 0?

boston_mult_regress <- lm(crim ~ ., boston) %>%
    tidy() %>%
    arrange(p.value)

boston_mult_regress %>%
    kable() %>%
    kable_styling()
term estimate std.error statistic p.value
rad 0.5882086 0.0880493 6.6804480 0.0000000
dis -0.9871757 0.2818173 -3.5028930 0.0005022
medv -0.1988868 0.0605160 -3.2865169 0.0010868
zn 0.0448552 0.0187341 2.3943122 0.0170249
(Intercept) 17.0332275 7.2349030 2.3543132 0.0189491
black -0.0075375 0.0036733 -2.0519589 0.0407023
nox -10.3135349 5.2755363 -1.9549737 0.0511520
lstat 0.1262114 0.0757248 1.6667104 0.0962084
ptratio -0.2710806 0.1864505 -1.4539010 0.1466113
indus -0.0638548 0.0834072 -0.7655789 0.4442940
tax -0.0037800 0.0051556 -0.7331884 0.4637927
rm 0.4301305 0.6128303 0.7018754 0.4830888
chas -0.7491336 1.1801468 -0.6347800 0.5258670
age 0.0014516 0.0179251 0.0809837 0.9354878

From the results of the regression with all predictors, taking < 5e-2 to be statistically significant, rad (index of accessibility to radial highways.), dis (weighted mean of distances to five Boston employment centres.), medv (median value of owner-occupied homes in $1000s), zn (proportion of residential land zoned for lots over 25,000 sq.ft), and black (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town) appear to have statistical significance.

c)

How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.*

boston_regress %>% 
    inner_join(boston_mult_regress, by = 'term') %>% 
    ggplot(aes(estimate.x, estimate.y)) + geom_point()