Use the lm()
function to perform a simple linear regression with mpg
as the response and horsepower
as the predictor. Print the results and comment on the output.
auto <- as_tibble(Auto)
lm_auto <- auto %>% lm(mpg ~ horsepower, .)
lm_auto %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 39.9358610 | 0.7174987 | 55.65984 | 0 |
horsepower | -0.1578447 | 0.0064455 | -24.48914 | 0 |
Is there a relationship between the predictor and the response?
To determine if there is a relationship, we need to look at two items: the p-values for the coefficients, the F-statistic, and the p-value for the F-statistic:
lm_auto %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 39.9358610 | 0.7174987 | 55.65984 | 0 |
horsepower | -0.1578447 | 0.0064455 | -24.48914 | 0 |
lm_auto %>%
glance() %>%
kable() %>%
kable_styling()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.6059483 | 0.6049379 | 4.905757 | 599.7177 | 0 | 2 | -1178.662 | 2363.324 | 2375.237 | 9385.916 | 390 |
We in the tidy()
output we can see the p-value for the intercept and the coefficient is very small, indicating a very low probability for the null hypothesis.A
In the glance()
output, we see the statistic
column (the F-statistic) is high at around 600, with a small p-value. This indicates that there is a relationship.
How strong is the relationship
To test how string the relationship is, we can look at the residual standard error (RSE) and the R^2 value.
We look at the glance()
output again and see the R^2 value is .606. Recall that the R^2 is a value between 0 and 1 that is the ‘proportion of the variance explained’. We can consider the relationship reasonably strong;
Is the relationship positive or negative? There is a negative relationship between mpg
and horsepower
, thus miles per gallon goes down as horsepower goes up. This aligns with our conceptual idea.
What is the predicted mpg
associated with a horsepower
of 98? What are the associated 95% confidence and prediction intervals?
We use the interval
argument to the predict function. Note that the default level
argument of predict()
is 0.95 (95%).
predict(lm_auto, tibble(horsepower = 98), interval = 'confidence') %>%
kable() %>%
kable_styling()
fit | lwr | upr |
---|---|---|
24.46708 | 23.97308 | 24.96108 |
predict(lm_auto, tibble(horsepower = 98), interval = 'predict') %>%
kable() %>%
kable_styling()
fit | lwr | upr |
---|---|---|
24.46708 | 14.8094 | 34.12476 |
Plot the response and the predictor. Display the least squares regression line.
auto %>% ggplot(aes(mpg, horsepower)) + geom_point() + geom_smooth(method = 'lm', formula = 'y ~ x')
Produce diagnostic plots (Resid v Leverage, Resid v Fitted, Fitted v Std Resid) and comment on any problems
First off, lets have a look at the residuals versus the leverage:
augment(lm_auto) %>% ggplot(aes(.hat, .resid)) + geom_point()
There are a few points up in the top right. We take a look at the Cook’s distance for the observations.
augment(lm_auto) %>% mutate(i = 1:n()) %>% ggplot(aes(i, .cooksd)) + geom_bar(stat = 'identity')
A couple of high points but all below 1.
Now we look at the fitted versus the residuals, and also fit a quadratic regression. We see a bit of a U shape, indicating potential non-linearity in the data.
augment(lm_auto) %>% ggplot(aes(.fitted, .resid)) + geom_point() + geom_smooth(method = 'lm', formula = 'y ~ poly(x,2)')
library(GGally)
library(corrplot)
Produce a scatterplot matrix which includes all the data in the data set
auto %>% dplyr::select(-name) %>% ggpairs()
Compute the matrix of correlations between the variables.
auto %>% dplyr::select(-name) %>% cor() %>% corrplot(method = 'color')
Perform a multiple linear regression with mpg
as the response and all other variables except name
as the predictors.
lin_reg_auto <- lm(mpg ~ . -name, auto)
tidy(lin_reg_auto) %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -17.2184346 | 4.6442941 | -3.707438 | 0.0002402 |
cylinders | -0.4933763 | 0.3232823 | -1.526147 | 0.1277965 |
displacement | 0.0198956 | 0.0075151 | 2.647430 | 0.0084446 |
horsepower | -0.0169511 | 0.0137869 | -1.229512 | 0.2196328 |
weight | -0.0064740 | 0.0006520 | -9.928787 | 0.0000000 |
acceleration | 0.0805758 | 0.0988450 | 0.815174 | 0.4154780 |
year | 0.7507727 | 0.0509731 | 14.728795 | 0.0000000 |
origin | 1.4261405 | 0.2781361 | 5.127492 | 0.0000005 |
glance(lin_reg_auto) %>%
kable() %>%
kable_styling()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.8214781 | 0.8182238 | 3.327682 | 252.428 | 0 | 8 | -1023.475 | 2064.949 | 2100.691 | 4252.213 | 384 |
Is there a relationship between the predictors and the response? We test the null hypothesis of “are all of the regression coefficients zero?”. The F-statistic 252 (far greater than 1) and has a p-value of 2e-139, indicating a low probability that this is just by chance. We can therefore say there is a relationship between the predictors and the response.
We look at the p-values for each of the predictors. The predictors which have a high probability of having an effect on the mpg
, holding all others constant, appear to be weight
, year
, displacement
and origin
.
What does the coefficient for the year
variable suggest? The year coefficient suggests that a cars mpg
gets larger - and therefore better - the later a car was made.
Plot the diagnostic plots of the regression and comment on any problems with the fit.
We go through our usual plots - first off is looking at the residuals versus the leverage:
augment(lin_reg_auto) %>% mutate(i = 1:n()) %>% ggplot(aes(i, .cooksd)) + geom_bar(stat = 'identity')
We don’t see values with a significant Cook’s distance. We move on to the fitted versus the residuals:
augment(lin_reg_auto) %>% ggplot(aes(.fitted, .resid)) + geom_point() + geom_smooth(method = 'lm', formula = 'y~poly(x,2)')
There is some evidence of the non-linearity of the results.
**Use the ’*‘and’:’ symbols to fit linear regressions with interaction effects. Are any interactions statistically significant?**
A ’*’ adds the predictors and the interaction term, whereas the :
only adds the interaction term. I.e. x\*y == x + y + x:y
.
Let’s have a think about potential interactions - I think weight and year could interact, given the changes in materials. There could also be an f,Let’s have a think about potential interactions - I think weight and year could interact, given the changes in materials. There could also be and interaction between cylinders and displacement:
lm(mpg ~ weight*year + cylinders*displacement, auto) %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -81.3476273 | 13.8781079 | -5.861579 | 0.0000000 |
weight | 0.0202479 | 0.0046727 | 4.333191 | 0.0000188 |
year | 1.7535227 | 0.1755592 | 9.988211 | 0.0000000 |
cylinders | -1.4865200 | 0.4002062 | -3.714385 | 0.0002338 |
displacement | -0.0698473 | 0.0127015 | -5.499162 | 0.0000001 |
weight:year | -0.0003469 | 0.0000618 | -5.612385 | 0.0000000 |
cylinders:displacement | 0.0090185 | 0.0015597 | 5.782327 | 0.0000000 |
All of the values appear to be reasonably statistically significant. In fact, if we have a look at the fitted vs residuals, it looks much better than before:
lm(mpg ~ weight*year + cylinders*displacement, auto) %>% augment() %>% ggplot(aes(.fitted, .resid)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Try different transformations of the variables and comment ont the findings.
We try a few different transformations and pipe them through to the fitted versus residuals graph:
lm(mpg ~ sqrt(horsepower), auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))
lm(mpg ~ log(horsepower), auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))
lm(1/mpg ~ horsepower, auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))
lm(1/mpg ~ horsepower + weight*year, auto) %>% augment() %>% ggplot() + geom_point(aes(.fitted, .resid))
lm(1/mpg ~ horsepower + weight*year, auto) %>%
glance() %>%
kable() %>%
kable_styling()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.8830842 | 0.8818758 | 0.0057188 | 730.7686 | 0 | 5 | 1470.576 | -2929.152 | -2905.324 | 0.0126569 | 387 |
The last one looks quite good.
Fit a multiple regression model to predict Sales
using Price
, Urban
, and US
.
carseats <- as_tibble(Carseats)
cs_regress <- lm(Sales ~ Price + Urban + US, carseats)
cs_regress %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 13.0434689 | 0.6510122 | 20.0356737 | 0.0000000 |
Price | -0.0544588 | 0.0052419 | -10.3892320 | 0.0000000 |
UrbanYes | -0.0219162 | 0.2716503 | -0.0806778 | 0.9357389 |
USYes | 1.2005727 | 0.2590415 | 4.6346731 | 0.0000049 |
Provide an interpretation of each coefficient in the model. * (Intercept) - the average number of sales of carseats, ignoring all other factors. * Price - the regression indicates a relationship between price and sales, given the low p-value of the t-statistic. An increase in price of a dollar results in a decrease of 54 carseats solds. * UrbanYes - given the high p-value, there doesn’t appear to be a relationship between sales and whether a store is urban. * USYes - given the low p-value, the store bein in the US results in 1200 more carseats being sold.
Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales = x * Price + y * Urban + z * US
, where [Urban = Yes => y = 1|Urban = No => y = 0]
& [US = Yes => z = 1|US = No => z = 0]
For which of the predictors can you reject the null hypothesis H 0 : βj = 0?
The null hypothesis can be rejected for Price
and US
.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
cs_regress_reduced <- lm(Sales ~ Price + US, carseats)
cs_regress_reduced %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 13.0307928 | 0.6309763 | 20.651794 | 0.0e+00 |
Price | -0.0544776 | 0.0052301 | -10.416123 | 0.0e+00 |
USYes | 1.1996429 | 0.2584610 | 4.641485 | 4.7e-06 |
How well do the models in (a) and (e) fit the data?
We look at the F-statistic and its assoicated p-value to determine how well the models fit:
cs_regress %>%
glance() %>%
kable() %>%
kable_styling()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.2392754 | 0.2335123 | 2.472492 | 41.51877 | 0 | 4 | -927.656 | 1865.312 | 1885.269 | 2420.835 | 396 |
cs_regress_reduced %>%
glance() %>%
kable() %>%
kable_styling()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.2392629 | 0.2354305 | 2.469397 | 62.43114 | 0 | 3 | -927.6593 | 1863.319 | 1879.285 | 2420.874 | 397 |
We see no increase in the R-sqaured value - but we do see an increase in the F-statistic, and a decrease in its p-value.
Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
cs_regress_reduced %>%
confint()%>%
kable() %>%
kable_styling()
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | 11.7903202 | 14.2712653 |
Price | -0.0647598 | -0.0441954 |
USYes | 0.6915196 | 1.7077663 |
Is there evidence of outliers or high leverage observations in the model from (e)?
cs_regress_reduced %>% augment() %>% ggplot(aes(.fitted, .resid)) + geom_point()
cs_regress_reduced %>% augment() %>% ggplot(aes(.hat, .resid)) + geom_point()
There doesn’t appear to be any non-linearity in the daata, and the high leverage points don’t appear to affect the data substantially.
In this problem we will investigate the t-statistic for the null hypothesis H 0 : β = 0 in simple linear regression without an intercept. To begin, we generate a predictor x and a response y
set.seed(1)
linear_response <- tibble(x = rnorm(100), y = 2 * x + rnorm(100))
linear_response %>% ggplot(aes(x,y)) + geom_point()
Perform a simple linear regression of y onto x , without an intercept. Report the coefficient estimate β̂, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis H 0 : β = 0. Comment on these results.
lr_reg <- lm(y ~ x + 0, linear_response)
lr_reg %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
x | 1.993876 | 0.1064767 | 18.72593 | 0 |
The coefficient estimate is 1.99 - very close to the 2 that we used to generate the points. The standard error is .106, the t-statistic is 18.7 and the p-value is 2.64e-34. The t-statistic tells us how many standard deviations the coefficient is away from 0. The p-value gives us the probability that the null hypothesis - that the coefficient is 0 - is true.
Now perform a simple linear regression of x onto y without an intercept, and report on the same metrics as a)
lr_reg_reverse <- lm(x ~ y + 0, linear_response)
lr_reg_reverse %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
y | 0.3911145 | 0.0208863 | 18.72593 | 0 |
The coefficient estimate and the standard error have changed, which is as expected. The t-statistic and p-value remain the same.
What is the relationship between the results obtained in (a) and (b)?
It’s the same line, thus the overall results obtained are the same.
Recall that the coefficient estimate β̂ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
When the sum of the squares of the y observations is the same as the sum or the squares of the x observations.
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
tibble(x = rnorm(100), y = 4*x) %>% lm(y ~ x, .) %>% tidy()
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.88e-17 2.19e-17 -4.06e 0 0.0000999
## 2 x 4.00e+ 0 2.43e-17 1.64e17 0
set.seed(1)
tibble(x = rnorm(100), y = 4*x) %>% lm(x ~ y, .) %>% tidy()
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.22e-17 5.47e-18 -4.06e 0 0.0000999
## 2 y 2.50e- 1 1.52e-18 1.64e17 0
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the Asame as the coefficient estimate for the regression of Y onto X.
To have the same coefficients, we need the sum of the squares to be the same. To do this, we generate 100 random values for X
, and we use the same values for Y
. However we re-order the values so we don’t simply get a y = x
function. We can use the sample_n()
function for this:
set.seed(1)
sample_data <- tibble(x = rnorm(100)) %>%
mutate(y = (sample_n(., 100) %>% .[['x']]))
lm(y ~ x, sample_data) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.108 0.0909 1.19 0.237
## 2 x 0.00695 0.101 0.0688 0.945
lm(x ~ y, sample_data) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.108 0.0909 1.19 0.237
## 2 y 0.00695 0.101 0.0688 0.945
Using the rnorm() function, create a vector, x , containing 100 observations drawn from a N (0, 1) distribution. This represents a feature, X.
x <- rnorm(100)
Using the rnorm()
function, create a vector, eps
, containing 100 observations drawn from a N (0, 0.25) distribution
eps <- rnorm(100, 0, .25)
Using x
and eps
, generate a vector y according to the model Y = −1 + 0.5X + e
. What is the length of the vector y ? What are the values of β 0 and β 1 in this linear model?
y <- -1 + .5 * x + eps
The beta_0 is -1, and the beta_1 is .5
*Create a scatterplot displaying the relationship between x and y . Comment on what you observe.
simulated <- tibble(x = x, y = y)
simulated %>% ggplot(aes(x,y)) + geom_point()
We can see an approximate linear relationship between x and y.
Fit a least squares linear model to predict y using x . Comment on the model obtained. How do β̂ 0 and β̂ 1 compare to β 0 and β 1 ?
lm(y ~ x, simulated) %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.9748778 | 0.0245031 | -39.78592 | 0 |
x | 0.4953323 | 0.0234770 | 21.09861 | 0 |
We see a beta_0
of -0.988, and a beta_1
of 0.527 - very close to the actual values of the function. The standard error is low at 0.0248 and 0.0241 respectively.
*Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Create a legend.
simulated %>% ggplot(aes(x,y)) +
geom_point() +
geom_smooth(aes(colour = 'Regression'), method = 'lm', formula = 'y ~ x') +
geom_abline(aes(slope = .5, intercept = -1, colour = 'Population Mean'), size = 1) +
labs(colour = "Lines")
Now fit a polynomial regression model that predicts y
using x
and x^2
. Is there evidence that the quadratic term improves the model fit? Explain your answer.
lm(y ~ poly(x,2), simulated) %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.9635947 | 0.0246226 | -39.1345311 | 0.0000000 |
poly(x, 2)1 | 5.1685804 | 0.2462262 | 20.9911878 | 0.0000000 |
poly(x, 2)2 | -0.0166596 | 0.2462262 | -0.0676599 | 0.9461957 |
lm(y ~ poly(x,2), simulated) %>%
glance() %>%
kable() %>%
kable_styling()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.81958 | 0.81586 | 0.2462262 | 220.3173 | 0 | 3 | -0.2204281 | 8.440856 | 18.86154 | 5.880852 | 97 |
We can see that the R^2 has not changed, and the RSE has increased slightly with the x^2 regression. However the F-statistic has decreased significantly with the x^2 regression, indicating a decrease in the significance of the model.
Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data.
Lets create a function to do this, with the variable being how much noise is in the data.
simulated_linear <- function(observations, mean, noise) {
x <- rnorm(observations)
eps <- rnorm(observations, mean, noise)
y <- -1 + .5 * x + eps
return(tibble(x = x, y = y))
}
set.seed(1)
low_and_high_noise <- bind_cols( simulated_linear(100, 0, .1), simulated_linear(100, 0, .5))
lm(y ~ x, low_and_high_noise) %>%
glance() %>%
kable() %>%
kable_styling()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.9564698 | 0.9560256 | 0.0962753 | 2153.309 | 0 | 2 | 93.1706 | -180.3412 | -172.5257 | 0.9083561 | 98 |
lm(y ~ x, low_and_high_noise) %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -1.003769 | 0.0096987 | -103.49492 | 0 |
x | 0.499894 | 0.0107727 | 46.40376 | 0 |
lm(y1 ~ x1, low_and_high_noise) %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.9757750 | 0.0495499 | -19.69276 | 0 |
x1 | 0.5531092 | 0.0481297 | 11.49206 | 0 |
lm(y1 ~ x1, low_and_high_noise) %>%
glance() %>%
kable() %>%
kable_styling()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.574038 | 0.5696915 | 0.4952935 | 132.0675 | 0 | 2 | -70.62324 | 147.2465 | 155.062 | 24.04093 | 98 |
Looking at the values, the low noise regression picks the exact coefficients that were used in the function. The R^2 is near 1 with a high F-statistic.
The higher noise, as expected, has low F-statistic and high p-values for the coefficients. The R^2 value is very low.
low_and_high_noise %>%
ggplot() +
geom_point(aes(x, y), colour = 'red') +
geom_point(aes(x1, y1), colour = 'blue') +
geom_smooth(aes(x, y), method = 'lm', formula = 'y ~ x', colour = 'red') +
geom_smooth(aes(x1, y1), method = 'lm', formula = 'y ~ x', colour = 'blue')
** What are the confidence intervals of the data sets? Comment on the results**
lm(y ~ x, simulated) %>%
confint() %>%
kable() %>%
kable_styling()
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | -1.023503 | -0.9262522 |
x | 0.448743 | 0.5419217 |
lm(y ~ x, low_and_high_noise) %>%
confint() %>%
kable() %>%
kable_styling()
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | -1.0230161 | -0.9845224 |
x | 0.4785159 | 0.5212720 |
lm(y1 ~ x1, low_and_high_noise) %>%
confint() %>%
kable() %>%
kable_styling()
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | -1.0741052 | -0.8774448 |
x1 | 0.4575975 | 0.6486210 |
Set up the data
set.seed(1)
colin_data <- tibble(
x1 = runif(100),
x2 = 0.5 * x1 + rnorm(100)/10,
y = 2 + 2*x1 + 0.3*x2 + rnorm(100)
)
The beta_{0,1,2} coefficients of the model are (2,2,0.3) respectively.
What is the correlation between x1 and x2 ? Create a scatterplot displaying the relationship between the variables.
colin_data %>%
dplyr::select(x1, x2) %>%
cor() %>%
kable() %>%
kable_styling()
x1 | x2 | |
---|---|---|
x1 | 1.0000000 | 0.8351212 |
x2 | 0.8351212 | 1.0000000 |
colin_data %>% ggplot(aes(x1, x2)) + geom_point()
Using this data, fit a least squares regression to predict y using x1
and x2
colin_data_reg <- lm(y ~ x1 + x2, colin_data)
colin_data_reg %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 2.130500 | 0.2318817 | 9.1878742 | 0.0000000 |
x1 | 1.439555 | 0.7211795 | 1.9961126 | 0.0487252 |
x2 | 1.009674 | 1.1337225 | 0.8905831 | 0.3753565 |
colin_data_reg %>%
glance() %>%
kable() %>%
kable_styling()
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.2088293 | 0.1925165 | 1.056178 | 12.80156 | 1.16e-05 | 3 | -145.8366 | 299.6731 | 310.0938 | 108.2046 | 97 |
Can the null hypotheses beta_1 = 0
and beta_2 = 0
be rejected?
Let’s take the general view that a p-value of 0.05 is statistically significant. With this in mind, we can reject the null hypothesis for x1
, but not for x2
.
Now fit a least squares regression to predict y using only x1, with only x2. Can we reject the null hypothesis for either of these?
colin_data_reg_x1 <- lm(y ~ x1, colin_data)
colin_data_reg_x2 <- lm(y ~ x2, colin_data)
colin_data_reg_x1 %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 2.112394 | 0.2307448 | 9.154675 | 0.0e+00 |
x1 | 1.975929 | 0.3962774 | 4.986227 | 2.7e-06 |
colin_data_reg_x2 %>%
tidy() %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 2.389949 | 0.1949307 | 12.260508 | 0.00e+00 |
x2 | 2.899585 | 0.6330467 | 4.580365 | 1.37e-05 |
I both instances we have a p-value below 0.05, and thus we can rejct the null hypothesis in both instances.
Do the results obtained in (c)–(e) contradict each other? Explain your answer.
The results don’t contradict each other. When regressing using both predictors that a colinear, it can be difficult to separate out the individual effects on the response, and the power of the hypothesi stest is reduced. (see ISL 3.36).
Now suppose we obtain one additional observation, which was unfortunately mismeasured. Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers
We add the new observation and calculate the mew regressions.
colin_data_add <- colin_data %>% add_row(x1 = 0.1, x2 = 0.8, y = 6)
colin_data_reg_both <- lm(y ~ x1 + x2, colin_data_add)
colin_data_reg_x1 <- lm(y ~ x1, colin_data_add)
colin_data_reg_x2 <- lm(y ~ x2, colin_data_add)
We recall from 3.4 that an outlier is a point far from the value predicted by the model. We can look at residuals, but it can be difficult to decide how large a residual needs to be before it’s classified as an outlier. This is where studentised residuals are used, where each residual is divided by e_i - its estimated standard error. Points with standardised residuals greater than 3 are possible outliers.
colin_data_reg_both %>%
augment() %>%
filter(y == 6) %>%
kable() %>%
kable_styling()
y | x1 | x2 | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|---|
6 | 0.1 | 0.8 | 4.292291 | 0.6920985 | 1.707709 | 0.4147284 | 1.056178 | 1.019021 | 2.07706 |
colin_data_reg_x1 %>%
augment() %>%
filter(y == 6) %>%
kable() %>%
kable_styling()
y | x1 | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|
6 | 0.1 | 2.433497 | 0.2033033 | 3.566503 | 0.0334716 | 1.055063 | 0.1845392 | 3.264591 |
colin_data_reg_x2 %>%
augment() %>%
filter(y == 6) %>%
kable() %>%
kable_styling()
y | x2 | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|
6 | 0.8 | 4.840347 | 0.3417746 | 1.159653 | 0.1013106 | 1.07214 | 0.0731539 | 1.139229 |
Only in the regression with x1
as the predictor could the response be considered an outlier.
Points with high leverage have an unusual x_i value, which tend to have significant impacts on the estimated regression lines. The leverage statistic is used to quantify leverage. We look at the .hat
column which gives us the leverage statistic. We plot the .hat
versus the .std.resid
. When both x1
, and x2
are used in the regression, this point appears to be a high leverage point. In the x1
regression it’s not a high leverage point, and in the x2
regression it has a bit of leverage.
colin_data_reg_both %>% augment() %>% ggplot() + geom_point(aes(.hat, .std.resid))
colin_data_reg_x1 %>% augment() %>% ggplot() + geom_point(aes(.hat, .std.resid))
colin_data_reg_x2 %>% augment() %>% ggplot() + geom_point(aes(.hat, .std.resid))
We will now try to predict per capita crime rate using the other variables in this data set.
For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
boston <- as_tibble(Boston)
boston_regress <- tibble(predictor = names(boston)[-1]) %>%
mutate(predictor %>% map(function(x) lm(paste('crim ~', x), boston) %>% tidy())) %>% unnest() %>% filter(term != '(Intercept)')
boston_regress %>%
print(n = 26) %>%
kable() %>%
kable_styling()
## # A tibble: 13 x 6
## predictor term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 zn zn -0.0739 0.0161 -4.59 5.51e- 6
## 2 indus indus 0.510 0.0510 9.99 1.45e-21
## 3 chas chas -1.89 1.51 -1.26 2.09e- 1
## 4 nox nox 31.2 3.00 10.4 3.75e-23
## 5 rm rm -2.68 0.532 -5.04 6.35e- 7
## 6 age age 0.108 0.0127 8.46 2.85e-16
## 7 dis dis -1.55 0.168 -9.21 8.52e-19
## 8 rad rad 0.618 0.0343 18.0 2.69e-56
## 9 tax tax 0.0297 0.00185 16.1 2.36e-47
## 10 ptratio ptratio 1.15 0.169 6.80 2.94e-11
## 11 black black -0.0363 0.00387 -9.37 2.49e-19
## 12 lstat lstat 0.549 0.0478 11.5 2.65e-27
## 13 medv medv -0.363 0.0384 -9.46 1.17e-19
predictor | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
zn | zn | -0.0739350 | 0.0160946 | -4.593776 | 0.0000055 |
indus | indus | 0.5097763 | 0.0510243 | 9.990848 | 0.0000000 |
chas | chas | -1.8927766 | 1.5061155 | -1.256727 | 0.2094345 |
nox | nox | 31.2485312 | 2.9991904 | 10.418989 | 0.0000000 |
rm | rm | -2.6840512 | 0.5320411 | -5.044820 | 0.0000006 |
age | age | 0.1077862 | 0.0127364 | 8.462825 | 0.0000000 |
dis | dis | -1.5509017 | 0.1683300 | -9.213458 | 0.0000000 |
rad | rad | 0.6179109 | 0.0343318 | 17.998199 | 0.0000000 |
tax | tax | 0.0297423 | 0.0018474 | 16.099388 | 0.0000000 |
ptratio | ptratio | 1.1519828 | 0.1693736 | 6.801430 | 0.0000000 |
black | black | -0.0362796 | 0.0038732 | -9.366951 | 0.0000000 |
lstat | lstat | 0.5488048 | 0.0477610 | 11.490654 | 0.0000000 |
medv | medv | -0.3631599 | 0.0383902 | -9.459710 | 0.0000000 |
Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H 0 : β j = 0?
boston_mult_regress <- lm(crim ~ ., boston) %>%
tidy() %>%
arrange(p.value)
boston_mult_regress %>%
kable() %>%
kable_styling()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
rad | 0.5882086 | 0.0880493 | 6.6804480 | 0.0000000 |
dis | -0.9871757 | 0.2818173 | -3.5028930 | 0.0005022 |
medv | -0.1988868 | 0.0605160 | -3.2865169 | 0.0010868 |
zn | 0.0448552 | 0.0187341 | 2.3943122 | 0.0170249 |
(Intercept) | 17.0332275 | 7.2349030 | 2.3543132 | 0.0189491 |
black | -0.0075375 | 0.0036733 | -2.0519589 | 0.0407023 |
nox | -10.3135349 | 5.2755363 | -1.9549737 | 0.0511520 |
lstat | 0.1262114 | 0.0757248 | 1.6667104 | 0.0962084 |
ptratio | -0.2710806 | 0.1864505 | -1.4539010 | 0.1466113 |
indus | -0.0638548 | 0.0834072 | -0.7655789 | 0.4442940 |
tax | -0.0037800 | 0.0051556 | -0.7331884 | 0.4637927 |
rm | 0.4301305 | 0.6128303 | 0.7018754 | 0.4830888 |
chas | -0.7491336 | 1.1801468 | -0.6347800 | 0.5258670 |
age | 0.0014516 | 0.0179251 | 0.0809837 | 0.9354878 |
From the results of the regression with all predictors, taking < 5e-2 to be statistically significant, rad
(index of accessibility to radial highways.), dis
(weighted mean of distances to five Boston employment centres.), medv
(median value of owner-occupied homes in $1000s), zn
(proportion of residential land zoned for lots over 25,000 sq.ft), and black
(1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town) appear to have statistical significance.
How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.*
boston_regress %>%
inner_join(boston_mult_regress, by = 'term') %>%
ggplot(aes(estimate.x, estimate.y)) + geom_point()