Chapter 3 - Linear Regression - Applied Exercises

8) Auto data set

a)

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Print the results and comment on the output.

auto <- as_tibble(Auto)
lm_auto <- auto %>% lm(mpg ~ horsepower, .)
lm_auto %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	39.9358610	0.7174987	55.65984	0
horsepower	-0.1578447	0.0064455	-24.48914	0

i)

Is there a relationship between the predictor and the response?

To determine if there is a relationship, we need to look at two items: the p-values for the coefficients, the F-statistic, and the p-value for the F-statistic:

lm_auto %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	39.9358610	0.7174987	55.65984	0
horsepower	-0.1578447	0.0064455	-24.48914	0

lm_auto %>% 
    glance() %>%
    kable() %>%
    kable_styling()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.6059483	0.6049379	4.905757	599.7177	0	2	-1178.662	2363.324	2375.237	9385.916	390

We in the tidy() output we can see the p-value for the intercept and the coefficient is very small, indicating a very low probability for the null hypothesis.A

In the glance() output, we see the statistic column (the F-statistic) is high at around 600, with a small p-value. This indicates that there is a relationship.

ii)

How strong is the relationship

To test how string the relationship is, we can look at the residual standard error (RSE) and the R^2 value.

We look at the glance() output again and see the R^2 value is .606. Recall that the R^2 is a value between 0 and 1 that is the ‘proportion of the variance explained’. We can consider the relationship reasonably strong;

iii)

Is the relationship positive or negative? There is a negative relationship between mpg and horsepower, thus miles per gallon goes down as horsepower goes up. This aligns with our conceptual idea.

iv)

What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

We use the interval argument to the predict function. Note that the default level argument of predict() is 0.95 (95%).

predict(lm_auto, tibble(horsepower = 98), interval = 'confidence') %>%
    kable() %>%
    kable_styling()

fit	lwr	upr
24.46708	23.97308	24.96108

predict(lm_auto, tibble(horsepower = 98), interval = 'predict') %>%
    kable() %>%
    kable_styling()

fit	lwr	upr
24.46708	14.8094	34.12476

b)

Plot the response and the predictor. Display the least squares regression line.

auto %>% ggplot(aes(mpg, horsepower)) + geom_point() + geom_smooth(method = 'lm', formula = 'y ~ x')

c)

Produce diagnostic plots (Resid v Leverage, Resid v Fitted, Fitted v Std Resid) and comment on any problems

First off, lets have a look at the residuals versus the leverage:

augment(lm_auto) %>% ggplot(aes(.hat, .resid)) + geom_point()

There are a few points up in the top right. We take a look at the Cook’s distance for the observations.

augment(lm_auto) %>% mutate(i = 1:n()) %>% ggplot(aes(i, .cooksd)) + geom_bar(stat = 'identity')

A couple of high points but all below 1.

Now we look at the fitted versus the residuals, and also fit a quadratic regression. We see a bit of a U shape, indicating potential non-linearity in the data.

augment(lm_auto) %>% ggplot(aes(.fitted, .resid)) + geom_point() + geom_smooth(method = 'lm', formula = 'y ~ poly(x,2)')

9) Multiple Linear Regression - Auto Data Set

library(GGally)
library(corrplot)

a)

Produce a scatterplot matrix which includes all the data in the data set

auto %>% dplyr::select(-name) %>% ggpairs()

b)

Compute the matrix of correlations between the variables.

auto %>% dplyr::select(-name) %>% cor() %>% corrplot(method = 'color')

c)

Perform a multiple linear regression with mpg as the response and all other variables except name as the predictors.

lin_reg_auto <- lm(mpg ~ . -name, auto)
tidy(lin_reg_auto) %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	-17.2184346	4.6442941	-3.707438	0.0002402
cylinders	-0.4933763	0.3232823	-1.526147	0.1277965
displacement	0.0198956	0.0075151	2.647430	0.0084446
horsepower	-0.0169511	0.0137869	-1.229512	0.2196328
weight	-0.0064740	0.0006520	-9.928787	0.0000000
acceleration	0.0805758	0.0988450	0.815174	0.4154780
year	0.7507727	0.0509731	14.728795	0.0000000
origin	1.4261405	0.2781361	5.127492	0.0000005

glance(lin_reg_auto) %>%
    kable() %>%
    kable_styling()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.8214781	0.8182238	3.327682	252.428	0	8	-1023.475	2064.949	2100.691	4252.213	384

i)

Is there a relationship between the predictors and the response? We test the null hypothesis of “are all of the regression coefficients zero?”. The F-statistic 252 (far greater than 1) and has a p-value of 2e-139, indicating a low probability that this is just by chance. We can therefore say there is a relationship between the predictors and the response.

ii)

We look at the p-values for each of the predictors. The predictors which have a high probability of having an effect on the mpg, holding all others constant, appear to be weight, year, displacement and origin.

iii)

What does the coefficient for the year variable suggest? The year coefficient suggests that a cars mpg gets larger - and therefore better - the later a car was made.

d)

Plot the diagnostic plots of the regression and comment on any problems with the fit.

We go through our usual plots - first off is looking at the residuals versus the leverage:

augment(lin_reg_auto) %>% mutate(i = 1:n()) %>% ggplot(aes(i, .cooksd)) + geom_bar(stat = 'identity')

We don’t see values with a significant Cook’s distance. We move on to the fitted versus the residuals:

augment(lin_reg_auto) %>% ggplot(aes(.fitted, .resid)) + geom_point() + geom_smooth(method = 'lm', formula = 'y~poly(x,2)')

There is some evidence of the non-linearity of the results.

e)

**Use the ’*‘and’:’ symbols to fit linear regressions with interaction effects. Are any interactions statistically significant?**

A ’*’ adds the predictors and the interaction term, whereas the : only adds the interaction term. I.e. x\*y == x + y + x:y.

Let’s have a think about potential interactions - I think weight and year could interact, given the changes in materials. There could also be an f,Let’s have a think about potential interactions - I think weight and year could interact, given the changes in materials. There could also be and interaction between cylinders and displacement:

lm(mpg ~ weight*year + cylinders*displacement, auto) %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	-81.3476273	13.8781079	-5.861579	0.0000000
weight	0.0202479	0.0046727	4.333191	0.0000188
year	1.7535227	0.1755592	9.988211	0.0000000
cylinders	-1.4865200	0.4002062	-3.714385	0.0002338
displacement	-0.0698473	0.0127015	-5.499162	0.0000001
weight:year	-0.0003469	0.0000618	-5.612385	0.0000000
cylinders:displacement	0.0090185	0.0015597	5.782327	0.0000000

All of the values appear to be reasonably statistically significant. In fact, if we have a look at the fitted vs residuals, it looks much better than before:

lm(mpg ~ weight*year + cylinders*displacement, auto) %>% augment() %>% ggplot(aes(.fitted, .resid)) + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

f)

Try different transformations of the variables and comment ont the findings.

We try a few different transformations and pipe them through to the fitted versus residuals graph:

lm(mpg ~ sqrt(horsepower), auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))

lm(mpg ~ log(horsepower), auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))

lm(1/mpg ~ horsepower, auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))

lm(1/mpg ~ horsepower + weight*year, auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))

lm(1/mpg ~ horsepower + weight*year, auto) %>% 
    glance() %>%
    kable() %>%
    kable_styling()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.8830842	0.8818758	0.0057188	730.7686	0	5	1470.576	-2929.152	-2905.324	0.0126569	387

The last one looks quite good.

10) Carseats Data Set

a)

Fit a multiple regression model to predict Sales using Price, Urban, and US.

carseats <- as_tibble(Carseats)
cs_regress <- lm(Sales ~ Price + Urban + US, carseats)
cs_regress %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	13.0434689	0.6510122	20.0356737	0.0000000
Price	-0.0544588	0.0052419	-10.3892320	0.0000000
UrbanYes	-0.0219162	0.2716503	-0.0806778	0.9357389
USYes	1.2005727	0.2590415	4.6346731	0.0000049

b)

Provide an interpretation of each coefficient in the model. * (Intercept) - the average number of sales of carseats, ignoring all other factors. * Price - the regression indicates a relationship between price and sales, given the low p-value of the t-statistic. An increase in price of a dollar results in a decrease of 54 carseats solds. * UrbanYes - given the high p-value, there doesn’t appear to be a relationship between sales and whether a store is urban. * USYes - given the low p-value, the store bein in the US results in 1200 more carseats being sold.

c)

Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = x * Price + y * Urban + z * US, where [Urban = Yes => y = 1|Urban = No => y = 0] & [US = Yes => z = 1|US = No => z = 0]

d)

For which of the predictors can you reject the null hypothesis H 0 : βj = 0?

The null hypothesis can be rejected for Price and US.

e)

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

cs_regress_reduced <- lm(Sales ~ Price + US, carseats)
cs_regress_reduced %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	13.0307928	0.6309763	20.651794	0.0e+00
Price	-0.0544776	0.0052301	-10.416123	0.0e+00
USYes	1.1996429	0.2584610	4.641485	4.7e-06

f)

How well do the models in (a) and (e) fit the data?

We look at the F-statistic and its assoicated p-value to determine how well the models fit:

cs_regress %>% 
    glance() %>%
    kable() %>%
    kable_styling()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.2392754	0.2335123	2.472492	41.51877	0	4	-927.656	1865.312	1885.269	2420.835	396

cs_regress_reduced %>% 
    glance() %>%
    kable() %>%
    kable_styling()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.2392629	0.2354305	2.469397	62.43114	0	3	-927.6593	1863.319	1879.285	2420.874	397

We see no increase in the R-sqaured value - but we do see an increase in the F-statistic, and a decrease in its p-value.

g)

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

cs_regress_reduced %>% 
    confint()%>%
    kable() %>%
    kable_styling()

	2.5 %	97.5 %
(Intercept)	11.7903202	14.2712653
Price	-0.0647598	-0.0441954
USYes	0.6915196	1.7077663

h)

Is there evidence of outliers or high leverage observations in the model from (e)?

cs_regress_reduced %>% augment() %>% ggplot(aes(.fitted, .resid)) + geom_point()

cs_regress_reduced %>% augment() %>% ggplot(aes(.hat, .resid)) + geom_point()

There doesn’t appear to be any non-linearity in the daata, and the high leverage points don’t appear to affect the data substantially.

11) T-statistic and null-hypothesis

In this problem we will investigate the t-statistic for the null hypothesis H 0 : β = 0 in simple linear regression without an intercept. To begin, we generate a predictor x and a response y

set.seed(1)
linear_response <- tibble(x = rnorm(100), y = 2 * x + rnorm(100))
linear_response %>% ggplot(aes(x,y)) + geom_point()

a)

Perform a simple linear regression of y onto x , without an intercept. Report the coefficient estimate β̂, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis H 0 : β = 0. Comment on these results.

lr_reg <- lm(y ~ x + 0, linear_response)
lr_reg %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
x	1.993876	0.1064767	18.72593	0

The coefficient estimate is 1.99 - very close to the 2 that we used to generate the points. The standard error is .106, the t-statistic is 18.7 and the p-value is 2.64e-34. The t-statistic tells us how many standard deviations the coefficient is away from 0. The p-value gives us the probability that the null hypothesis - that the coefficient is 0 - is true.

b)

Now perform a simple linear regression of x onto y without an intercept, and report on the same metrics as a)

lr_reg_reverse <- lm(x ~ y + 0, linear_response)
lr_reg_reverse %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
y	0.3911145	0.0208863	18.72593	0

The coefficient estimate and the standard error have changed, which is as expected. The t-statistic and p-value remain the same.

c)

What is the relationship between the results obtained in (a) and (b)?

It’s the same line, thus the overall results obtained are the same.

12) Simple Linear Regression

a)

Recall that the coefficient estimate β̂ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

When the sum of the squares of the y observations is the same as the sum or the squares of the x observations.

b)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
tibble(x = rnorm(100), y = 4*x) %>% lm(y ~ x, .) %>% tidy()

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## # A tibble: 2 x 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -8.88e-17  2.19e-17  -4.06e 0 0.0000999
## 2 x            4.00e+ 0  2.43e-17   1.64e17 0

set.seed(1)
tibble(x = rnorm(100), y = 4*x) %>% lm(x ~ y, .) %>% tidy()

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## # A tibble: 2 x 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -2.22e-17  5.47e-18  -4.06e 0 0.0000999
## 2 y            2.50e- 1  1.52e-18   1.64e17 0

c)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the Asame as the coefficient estimate for the regression of Y onto X.

To have the same coefficients, we need the sum of the squares to be the same. To do this, we generate 100 random values for X, and we use the same values for Y. However we re-order the values so we don’t simply get a y = x function. We can use the sample_n() function for this:

set.seed(1)
sample_data <- tibble(x = rnorm(100)) %>% 
    mutate(y = (sample_n(., 100) %>% .[['x']]))
lm(y ~ x, sample_data) %>% tidy()

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  0.108      0.0909    1.19     0.237
## 2 x            0.00695    0.101     0.0688   0.945

lm(x ~ y, sample_data) %>% tidy()

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  0.108      0.0909    1.19     0.237
## 2 y            0.00695    0.101     0.0688   0.945

13) Simulated Linear Regressions

a)

Using the rnorm() function, create a vector, x , containing 100 observations drawn from a N (0, 1) distribution. This represents a feature, X.

x <- rnorm(100)

b)

Using the rnorm() function, create a vector, eps , containing 100 observations drawn from a N (0, 0.25) distribution

eps <- rnorm(100, 0, .25)

c)

Using x and eps , generate a vector y according to the model Y = −1 + 0.5X + e. What is the length of the vector y ? What are the values of β 0 and β 1 in this linear model?

y <- -1 + .5 * x + eps

The beta_0 is -1, and the beta_1 is .5

d)

*Create a scatterplot displaying the relationship between x and y . Comment on what you observe.

simulated <- tibble(x = x, y = y) 
simulated %>% ggplot(aes(x,y)) + geom_point()

We can see an approximate linear relationship between x and y.

e)

Fit a least squares linear model to predict y using x . Comment on the model obtained. How do β̂ 0 and β̂ 1 compare to β 0 and β 1 ?

lm(y ~ x, simulated) %>%
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	-0.9748778	0.0245031	-39.78592	0
x	0.4953323	0.0234770	21.09861	0

We see a beta_0 of -0.988, and a beta_1 of 0.527 - very close to the actual values of the function. The standard error is low at 0.0248 and 0.0241 respectively.

f)

*Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Create a legend.

simulated %>% ggplot(aes(x,y)) + 
    geom_point() + 
    geom_smooth(aes(colour = 'Regression'), method = 'lm', formula = 'y ~ x') + 
    geom_abline(aes(slope = .5, intercept = -1, colour = 'Population Mean'), size = 1) +
    labs(colour = "Lines")

g)

Now fit a polynomial regression model that predicts y using x and x^2 . Is there evidence that the quadratic term improves the model fit? Explain your answer.

lm(y ~ poly(x,2), simulated) %>%
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	-0.9635947	0.0246226	-39.1345311	0.0000000
poly(x, 2)1	5.1685804	0.2462262	20.9911878	0.0000000
poly(x, 2)2	-0.0166596	0.2462262	-0.0676599	0.9461957

lm(y ~ poly(x,2), simulated) %>%
    glance() %>%
    kable() %>%
    kable_styling()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.81958	0.81586	0.2462262	220.3173	0	3	-0.2204281	8.440856	18.86154	5.880852	97

We can see that the R^2 has not changed, and the RSE has increased slightly with the x^2 regression. However the F-statistic has decreased significantly with the x^2 regression, indicating a decrease in the significance of the model.

h) & i)

Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data.

Lets create a function to do this, with the variable being how much noise is in the data.

simulated_linear <- function(observations, mean, noise) {
    x <- rnorm(observations)
    eps <- rnorm(observations, mean, noise)
    y <- -1 + .5 * x + eps
    return(tibble(x = x, y = y))
}

set.seed(1)
low_and_high_noise <- bind_cols( simulated_linear(100, 0, .1), simulated_linear(100, 0, .5))
lm(y ~ x, low_and_high_noise) %>% 
    glance() %>%
    kable() %>%
    kable_styling()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.9564698	0.9560256	0.0962753	2153.309	0	2	93.1706	-180.3412	-172.5257	0.9083561	98

lm(y ~ x, low_and_high_noise) %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	-1.003769	0.0096987	-103.49492	0
x	0.499894	0.0107727	46.40376	0

lm(y1 ~ x1, low_and_high_noise) %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	-0.9757750	0.0495499	-19.69276	0
x1	0.5531092	0.0481297	11.49206	0

lm(y1 ~ x1, low_and_high_noise) %>% 
    glance() %>%
    kable() %>%
    kable_styling()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.574038	0.5696915	0.4952935	132.0675	0	2	-70.62324	147.2465	155.062	24.04093	98

Looking at the values, the low noise regression picks the exact coefficients that were used in the function. The R^2 is near 1 with a high F-statistic.

The higher noise, as expected, has low F-statistic and high p-values for the coefficients. The R^2 value is very low.

low_and_high_noise %>% 
    ggplot() + 
    geom_point(aes(x, y), colour = 'red') + 
    geom_point(aes(x1, y1), colour = 'blue') + 
    geom_smooth(aes(x, y), method = 'lm', formula = 'y ~ x', colour = 'red') + 
    geom_smooth(aes(x1, y1), method = 'lm', formula = 'y ~ x', colour = 'blue')

j)

** What are the confidence intervals of the data sets? Comment on the results**

lm(y ~ x, simulated) %>% 
    confint() %>%
    kable() %>%
    kable_styling()

	2.5 %	97.5 %
(Intercept)	-1.023503	-0.9262522
x	0.448743	0.5419217

lm(y ~ x, low_and_high_noise) %>% 
    confint() %>%
    kable() %>%
    kable_styling()

	2.5 %	97.5 %
(Intercept)	-1.0230161	-0.9845224
x	0.4785159	0.5212720

lm(y1 ~ x1, low_and_high_noise) %>% 
    confint() %>%
    kable() %>%
    kable_styling()

	2.5 %	97.5 %
(Intercept)	-1.0741052	-0.8774448
x1	0.4575975	0.6486210

14) Collinearity Problem

a)

Set up the data

set.seed(1)
colin_data <- tibble(
    x1 = runif(100), 
    x2 = 0.5 * x1 + rnorm(100)/10, 
    y = 2 + 2*x1 + 0.3*x2 + rnorm(100)
)

The beta_{0,1,2} coefficients of the model are (2,2,0.3) respectively.

b)

What is the correlation between x1 and x2 ? Create a scatterplot displaying the relationship between the variables.

colin_data %>% 
    dplyr::select(x1, x2) %>% 
    cor() %>%
    kable() %>%
    kable_styling()

	x1	x2
x1	1.0000000	0.8351212
x2	0.8351212	1.0000000

colin_data %>% ggplot(aes(x1, x2)) + geom_point()

c)

Using this data, fit a least squares regression to predict y using x1 and x2

colin_data_reg <- lm(y ~ x1 + x2, colin_data)
colin_data_reg %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	2.130500	0.2318817	9.1878742	0.0000000
x1	1.439555	0.7211795	1.9961126	0.0487252
x2	1.009674	1.1337225	0.8905831	0.3753565

colin_data_reg %>% 
    glance() %>%
    kable() %>%
    kable_styling()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.2088293	0.1925165	1.056178	12.80156	1.16e-05	3	-145.8366	299.6731	310.0938	108.2046	97

Can the null hypotheses beta_1 = 0 and beta_2 = 0 be rejected?

Let’s take the general view that a p-value of 0.05 is statistically significant. With this in mind, we can reject the null hypothesis for x1, but not for x2.

d) and e)

Now fit a least squares regression to predict y using only x1, with only x2. Can we reject the null hypothesis for either of these?

colin_data_reg_x1 <- lm(y ~ x1, colin_data)
colin_data_reg_x2 <- lm(y ~ x2, colin_data)
colin_data_reg_x1 %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	2.112394	0.2307448	9.154675	0.0e+00
x1	1.975929	0.3962774	4.986227	2.7e-06

colin_data_reg_x2 %>% 
    tidy() %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
(Intercept)	2.389949	0.1949307	12.260508	0.00e+00
x2	2.899585	0.6330467	4.580365	1.37e-05

I both instances we have a p-value below 0.05, and thus we can rejct the null hypothesis in both instances.

f)

Do the results obtained in (c)–(e) contradict each other? Explain your answer.

The results don’t contradict each other. When regressing using both predictors that a colinear, it can be difficult to separate out the individual effects on the response, and the power of the hypothesi stest is reduced. (see ISL 3.36).

g)

Now suppose we obtain one additional observation, which was unfortunately mismeasured. Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers

We add the new observation and calculate the mew regressions.

colin_data_add <- colin_data %>% add_row(x1 = 0.1, x2 = 0.8, y = 6)

colin_data_reg_both <- lm(y ~ x1 + x2, colin_data_add)
colin_data_reg_x1 <- lm(y ~ x1, colin_data_add)
colin_data_reg_x2 <- lm(y ~ x2, colin_data_add)

We recall from 3.4 that an outlier is a point far from the value predicted by the model. We can look at residuals, but it can be difficult to decide how large a residual needs to be before it’s classified as an outlier. This is where studentised residuals are used, where each residual is divided by e_i - its estimated standard error. Points with standardised residuals greater than 3 are possible outliers.

colin_data_reg_both %>% 
    augment() %>%
    filter(y == 6) %>%
    kable() %>%
    kable_styling()

y	x1	x2	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
6	0.1	0.8	4.292291	0.6920985	1.707709	0.4147284	1.056178	1.019021	2.07706

colin_data_reg_x1 %>%
    augment() %>%
    filter(y == 6) %>%
    kable() %>%
    kable_styling()

y	x1	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
6	0.1	2.433497	0.2033033	3.566503	0.0334716	1.055063	0.1845392	3.264591

colin_data_reg_x2 %>%
    augment() %>%
    filter(y == 6) %>%
    kable() %>%
    kable_styling()

y	x2	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
6	0.8	4.840347	0.3417746	1.159653	0.1013106	1.07214	0.0731539	1.139229

Only in the regression with x1 as the predictor could the response be considered an outlier.

Points with high leverage have an unusual x_i value, which tend to have significant impacts on the estimated regression lines. The leverage statistic is used to quantify leverage. We look at the .hat column which gives us the leverage statistic. We plot the .hat versus the .std.resid. When both x1, and x2 are used in the regression, this point appears to be a high leverage point. In the x1 regression it’s not a high leverage point, and in the x2 regression it has a bit of leverage.

colin_data_reg_both %>% augment() %>% ggplot() + geom_point(aes(.hat, .std.resid))

colin_data_reg_x1 %>% augment() %>% ggplot() + geom_point(aes(.hat, .std.resid))

colin_data_reg_x2 %>% augment() %>% ggplot() + geom_point(aes(.hat, .std.resid))

15) Boston Data Set

We will now try to predict per capita crime rate using the other variables in this data set.

a)

For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

boston <- as_tibble(Boston)

boston_regress <- tibble(predictor = names(boston)[-1]) %>% 
        mutate(predictor %>% map(function(x) lm(paste('crim ~', x), boston) %>% tidy())) %>% unnest() %>% filter(term != '(Intercept)')
boston_regress %>% 
    print(n = 26) %>%
    kable() %>%
    kable_styling()

## # A tibble: 13 x 6
##    predictor term    estimate std.error statistic  p.value
##    <chr>     <chr>      <dbl>     <dbl>     <dbl>    <dbl>
##  1 zn        zn       -0.0739   0.0161      -4.59 5.51e- 6
##  2 indus     indus     0.510    0.0510       9.99 1.45e-21
##  3 chas      chas     -1.89     1.51        -1.26 2.09e- 1
##  4 nox       nox      31.2      3.00        10.4  3.75e-23
##  5 rm        rm       -2.68     0.532       -5.04 6.35e- 7
##  6 age       age       0.108    0.0127       8.46 2.85e-16
##  7 dis       dis      -1.55     0.168       -9.21 8.52e-19
##  8 rad       rad       0.618    0.0343      18.0  2.69e-56
##  9 tax       tax       0.0297   0.00185     16.1  2.36e-47
## 10 ptratio   ptratio   1.15     0.169        6.80 2.94e-11
## 11 black     black    -0.0363   0.00387     -9.37 2.49e-19
## 12 lstat     lstat     0.549    0.0478      11.5  2.65e-27
## 13 medv      medv     -0.363    0.0384      -9.46 1.17e-19

predictor	term	estimate	std.error	statistic	p.value
zn	zn	-0.0739350	0.0160946	-4.593776	0.0000055
indus	indus	0.5097763	0.0510243	9.990848	0.0000000
chas	chas	-1.8927766	1.5061155	-1.256727	0.2094345
nox	nox	31.2485312	2.9991904	10.418989	0.0000000
rm	rm	-2.6840512	0.5320411	-5.044820	0.0000006
age	age	0.1077862	0.0127364	8.462825	0.0000000
dis	dis	-1.5509017	0.1683300	-9.213458	0.0000000
rad	rad	0.6179109	0.0343318	17.998199	0.0000000
tax	tax	0.0297423	0.0018474	16.099388	0.0000000
ptratio	ptratio	1.1519828	0.1693736	6.801430	0.0000000
black	black	-0.0362796	0.0038732	-9.366951	0.0000000
lstat	lstat	0.5488048	0.0477610	11.490654	0.0000000
medv	medv	-0.3631599	0.0383902	-9.459710	0.0000000

b)

Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H 0 : β j = 0?

boston_mult_regress <- lm(crim ~ ., boston) %>%
    tidy() %>%
    arrange(p.value)

boston_mult_regress %>%
    kable() %>%
    kable_styling()

term	estimate	std.error	statistic	p.value
rad	0.5882086	0.0880493	6.6804480	0.0000000
dis	-0.9871757	0.2818173	-3.5028930	0.0005022
medv	-0.1988868	0.0605160	-3.2865169	0.0010868
zn	0.0448552	0.0187341	2.3943122	0.0170249
(Intercept)	17.0332275	7.2349030	2.3543132	0.0189491
black	-0.0075375	0.0036733	-2.0519589	0.0407023
nox	-10.3135349	5.2755363	-1.9549737	0.0511520
lstat	0.1262114	0.0757248	1.6667104	0.0962084
ptratio	-0.2710806	0.1864505	-1.4539010	0.1466113
indus	-0.0638548	0.0834072	-0.7655789	0.4442940
tax	-0.0037800	0.0051556	-0.7331884	0.4637927
rm	0.4301305	0.6128303	0.7018754	0.4830888
chas	-0.7491336	1.1801468	-0.6347800	0.5258670
age	0.0014516	0.0179251	0.0809837	0.9354878

From the results of the regression with all predictors, taking < 5e-2 to be statistically significant, rad (index of accessibility to radial highways.), dis (weighted mean of distances to five Boston employment centres.), medv (median value of owner-occupied homes in $1000s), zn (proportion of residential land zoned for lots over 25,000 sq.ft), and black (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town) appear to have statistical significance.

c)

How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.*

boston_regress %>% 
    inner_join(boston_mult_regress, by = 'term') %>% 
    ggplot(aes(estimate.x, estimate.y)) + geom_point()

Chapter 3 - Linear Regression - Applied Exercises

Greg Foletta

8) Auto data set

a)

i)

ii)

iii)

iv)

b)

c)

9) Multiple Linear Regression - Auto Data Set

a)

b)

c)

i)

ii)

iii)

d)

e)

f)

10) Carseats Data Set

a)

b)

c)

d)

e)

f)

g)

h)

11) T-statistic and null-hypothesis

a)

b)

c)

12) Simple Linear Regression

a)

b)

c)

13) Simulated Linear Regressions

a)

b)

c)

d)

e)

f)

g)

h) & i)

j)

14) Collinearity Problem

a)

b)

c)

d) and e)

f)

g)

15) Boston Data Set

a)

b)

c)