Describe the null hypothesis to which the p-values given in table 3.4 correspond. Explain the conclusions you can draw based on these values. Explaination should be phrased in terms of sales
, TV
, radio
, and newspaper
.
The p-value is a value between 0 and 1 that gives a probability that the null hypothesis is true.
The null hypothesis for the next three rows of is “The advertising spend on {TV
, radio
, newspaper
} has no influence on the amount of sales
.
From the values, we can reject the null hypothesis for the Intercept
, TV
and radio
, but cannot reject it for newspaper
.
Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier and the KNN regression are related. The classifier estimates a qualitative class for an observation based on its K
nearest neighbours.
The KNN regression identifies K
training ovservations that are closest to x0
(represented by N0
) and then estimates f(x0)
using the average of all of the training responses in N0
.
Consider a data set with five predictors {X1, …, X5} = (GPA, IQ, Gender, Interaction between GPA and IQ, Interaction between GPA and Gender). The response is starting salary after graduation (in thousands of dollars).
Suppose we use least squares fit and get B0 = 50, B1 = 20. B2 = 0.07, B3 = 35, B4 = 0.01, B5 = -10.
To answer this, we fix B1, B2 and B4, so we’re dependent on B3 = 35 and B5 = -10. B3 shows males earning more than females (we assume the dummy variable is 1 for males, 0 for females). B5 shows a negative correllation between males and GPA. Taking these two together, 3 is the correct answer.
f(x) = 50 + 204.0 + 0.07110 + 35(0) + 0.01(4.0 * 110) + -10(0 * 4.0) = 50 + 80 + 7.7 + 0 + 4.4 + 0 = 142.1
The absolute value of the coefficient can’t used to determine interaction, as it relies on the underlying units. The p-value of the regression coefficient is used to determine if the term is statistically significant or not.
The residual sum of squares is a measure of the discrepency between the data an the estimation model. Even though the true relationship is linear, the cubic regression will fit the test data better as it has higher flexibility / lower bias.
We would expect on the long run that, because the relationship is linear, than the RSS for the linear regression would be lower.
Again, we would expect the cubic regression to fit the test data better as it has a higher flexibility.
Given we don’t know ‘how far from linear’ the data is, there isn’t enough information to answer this correctly.
Out of interests sake, lets generate 100 points of training data where f(x) = 3x + E. We then fit both a linear and a cubic regression to it.
library(tidyverse)
f_x <- tibble(x = c(1:100)) %>% rowwise() %>% mutate(y = 3*x + rnorm(1, 0, 10))
linear <- lm(y ~ x, data = f_x)
cubic <- lm(y ~ poly(x,3), data = f_x)
We then calculate the RSS with deviance()
.
deviance(linear)
## [1] 9078.048
deviance(cubic)
## [1] 8694.13
We can see that the cubic (in this instance) has a lower RSS. But does this always hold true?
We create a function which returns a tibble of n observations in s sets, based on a function passed to it:
generate_observations <- function(observations, sets, func) {
a <- tibble(x = 1:(observations * sets)) %>% rowwise() %>% mutate(y = func(x), set = x %% sets)
return(a)
}
We then group by each set and summarise on the RSS of a linear and cubic regression. When we graph this, we can see the linear RSS (denoted by the circle) looks to always be larger than the cubic RSS (denoted by the triangle).
generate_observations(100, 100, function(x) 3 * x + rnorm(1,0,20)) %>%
group_by(set) %>%
summarise(linear = lm(y~x) %>% deviance(), cubic = lm(y~poly(x,3)) %>% deviance()) %>%
ggplot +
geom_point(aes(set, linear), shape = 'square') +
geom_point(aes(set, cubic), shape = 'triangle') +
geom_segment(aes(x = set, y = linear, xend = set, yend = cubic), lineend = 'butt')
Even generating 10,000 different sets of 100 observations, we don’t see a linear RSS below that of a cubic RSS:
generate_observations(100, 10000, function(x) 3 * x + rnorm(1,0,20)) %>%
group_by(set) %>%
summarise(linear = lm(y~x) %>% deviance(), cubic = lm(y~poly(x,3)) %>% deviance()) %>%
mutate(rss_diff = linear - cubic) %>%
arrange(rss_diff)
## # A tibble: 10,000 x 4
## set linear cubic rss_diff
## <dbl> <dbl> <dbl> <dbl>
## 1 9598 30312. 30312. 0.0225
## 2 4270 42619. 42619. 0.112
## 3 5788 30755. 30755. 0.132
## 4 6639 32849. 32849. 0.162
## 5 1702 37173. 37172. 0.304
## 6 5387 29745. 29744. 0.401
## 7 9273 40342. 40342. 0.432
## 8 9654 39562. 39561. 0.590
## 9 876 38851. 38850. 0.748
## 10 4172 38765. 38764. 0.909
## # … with 9,990 more rows