A simple linear regression predicts a response \(Y\) based on a single predictor variable \(X\).
\[ Y \approx \beta_0 + \beta_1 X \]
Once we have training data to produce estimates for the coefficients, we can predict values by computing:
\[ \hat{y} = \beta_0 + \beta_1x\]
The goal is to obtain coefficients such that the linear model fits the data well - i.e. as close as possible to the data points. The most common approach involves minimising the least squares criterion.
We let \(e_i = y_i − \hat{y}_i\), which represents the \(i\)th residual. The residual sum of squares or RSS is the sum of the squares of each of these residuals.
An example in R - we generate a our predictions base on \(f(x) + e\), and also have a ‘guess’ at an \(\hat{f}(x)\). We calculate our \(y_i\), and then calculate the RSS.
library(tidyverse)
library(knitr)
library(kableExtra)
set.seed(1)
x <- 1:100
f <- function(x) 4 * x
f_hat <- function(x) 3.5 * x
y_i <- f(x) + rnorm(length(x), 0, .5)
y_hat_i <- f_hat(x)
RSS <- sum( (y_i - y_hat_i)^2 )
RSS %>% kable(align = 'left') %>% kable_styling()
x |
---|
84863.91 |
With some calculus, the RSS minimisers are:
\[ \hat{\beta}_1 = \frac { \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) } { \sum_{i=1}^n (x_i - \bar{x})^2 } \] and
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \] where \(\bar{x}\) and \(\bar{y}\) are the sample means.
beta_hat_1 <- sum( (x - mean(x)) * (y_i - mean(y_i)) ) / sum( (x - mean(x))^2 )
beta_hat_0 <- mean(y_i) - (beta_hat_1 * mean(x))
beta_hat_0
## [1] 0.06583287
beta_hat_1
## [1] 3.999774
We take the analogy of \(\hat{\mu}\) the sample mean being an estimate for \(\mu\), the population mean. The average sample mean over many data sets will be close to the population mean. In the code below, we create a population of 1,000,000 random variables, with very large standard deviation of 500.
set.seed(1)
population <- rnorm(1000000, 10, 500)
(mu <- mean(population))
## [1] 10.02345
mean(sample(population, 100))
## [1] 52.15676
(mu_hat <- map_dbl(1:2000, ~mean(sample(population, 100))) %>% mean())
## [1] 10.75234
We see the population mean is 10.02.
We take the mean of of a single sample, we see that this is a long way from the population mean.
However when we take the mean of 2000 sample means, we get a result that is very close to the population mean.
How close is a single estimate? In general we answer this by computing the standard error of \(\hat{\mu}\) written \(SE(\mu)\). We have the well known formula
\[ Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n} \] where \(\sigma\) is the standard deviation of each of the realisations \(y_i\) of \(Y\).
Roughly, the standard error tells us the average amount that \(\hat{\mu}\) differs from the actual value of \(\mu\).
We can wonder how close our estimates \(\beta_0\) and \(\beta_1\) are to the true values. We use the following formulas:
\[ SE(\hat{\beta}_0)^2 = \sigma^2 \bigg[ \frac{1}{n} + \frac { \bar{x}^2 } { \sum_{i=1}^n (x_i - \bar{x})^2 } \bigg] \] \[ SE(\hat{\beta}_1)^2 = \frac { \sigma^2 } { \sum_{i=1}^n (x_i - \bar{x})^2 } \] where \(\sigma^2 = Var(\epsilon)\).
For these formulas to be strictly valid, we need to assume that the errors \(\epsilon_i\) for each observation are uncorrelated with common variance \(\sigma^2\).
In general \(\sigma^2\) is not known, but can be estimated from the data. This is the residual standard error \(RSE = \sqrt{RSS/(n-2)}\).
Using our previous generated values of y_i
and y_hat_i
, lets calculate the RSE:
RSE <- sqrt( (sum((y_i - y_hat_i)^2)) / (length(y_i) - 2) )
RSE
## [1] 29.42717
The RSE for our ‘guess’ is 29.42.
The standard error can be used to calculate confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter, or that it is within 2 standard deviations of the mean. Therefore the 95% confidence interval: \[ \hat{\beta}_1 \pm 2 \cdot SE(\hat{\beta}_1) \] Thus there is a 95% probability that the true value of the slope of the linear regression is between 3.5 + 2 * 29.42, and 3.5 - 2 * 29.42. This is very wide, so we can’t be very ‘confident’. Of course our guess didn’t minimise the RSS.
The standard errors are used to perform hypothesis tests. The most common is the null hypothesis: \[ H_0 \text{ : there is no relationship between X and Y} \] versus the alternative hypothesis: \[ H_a \text{ : there is some relationship between X and Y } \] Mathematically: \[ H_0 \text{ : } \beta_1 = 0 \]
To test the null hypothesis, we need to derermine with the coefficient estimate is sufficiently far from zero that we can be confident that it’s non-zero. How far depends on the accuracy of the estimate - \(SE(\hat{\beta_1})\).
We compute a t-statistic: \[ t = \frac {\hat{\beta}_1 - 0} {SE(\hat{\beta}_1)} \] This measures the number of standard deviations \(\hat{\beta}_1\) is away from 0.
This measures how many standard deviations the coefficient is away from 0. If there is no relationship, then we expect that the formula will have a t-distribution with \(n - 2\) degrees of freedom.
We can then compute the probability of observing any value equal to \(\lvert t \lvert\) or larger, assuming \(\hat{\beta}_1 = 0\). We call this probability the p-value. Roughtly speaking, we interpret the p-value as follows: *a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance.
If we see a small p-value, then we can infer ther is an association between the predictor and the response. We reject the null hypothesis. Typical p-value cutoffs for rejecting the null hypothesis are 5% or 1%. When n = 30, these correspond to t-statistics of around 2 and 2.75 respectively.
The quality of the model is typically assessed using either the residual standard error (RSE) or the \(R^2\) statistic.
The RSE is an estimate of the standard deviation of \(\epsilon\).
\[ RSE = \sqrt{ \frac{1}{n-2}RSS } = \sqrt{\frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{y}_i)^2} \] The RSE is considered a measure of the lack of fit of the model. However it’s an absolute measure, with the same units as Y. This makes it difficult to determine what a ‘good’ RSE is.
The \(R^2\) is an alternative measure of fit, and is the proportion of variance explained. It takes on a value between 0 and 1. It is \[ R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS} \]
Let’s look at it using our generated data - we re-calculate RSS from above for visibility.
RSS <- sum( (y_i - y_hat_i)^2 )
TSS <- sum( (y_i - mean(y_i))^2 )
R_squared <- 1 - (RSS / TSS)
R_squared
## [1] 0.9363395
The constituent parts can be thought of as such: * The TSS is the amount of variability inherent in the response before the regression is performed. * The RSS measures the amount of variability left unexplained after performing the regression. * TSS - RSS is therefore the amount variability in the response that is explained by performing the regression. * \(R^2\) measures the proportion of variability in Y that can be explained using X.
Instead of fitting multiple linear regressions, the simple linear regression can be extended to accomodate multiple predictors. Each predictor is given its own slope coefficient.
\[ Y = \beta_0 + \beta_1 X_1 + \ldots + \ldots \beta_p X_p + \epsilon \]
Each \(\beta_j\) quanties the association between the variable and the response. We interpret \(\beta_j\) as the average effect on Y of a one unit increase in X_j while holding all other predictors fixed.
we choose \(\beta_0, \ldots, \beta_p\) to minimise the RSS:
\[ RSS = \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - \ldots - \hat{\beta}_p x_{ip} \]
In a simple linear regression we simply check whether the \(\beta_1 = 0\). In the multiple regression settings with \(p\) predictors, we need to ask whether all of the regression coefficients are 0:
\[ H_0 \text{ : } \beta_1 = \beta_2 = \dots = \beta_p = 0 \]
This hypothesis test is performed by computing the F-statistic: \[ F = \frac {(TSS - RSS)/p} {RSS / (n - p - 1)} \] which is the TSS minus the RSS divided by p, divided by the RSS divided by n minus p minus 1.
F = ( (TSS - RSS)/p ) / ( RSS / (n - p - 1) )
If the linear model assumptions are correct: \[E\{ RSS / (n - p - 1) \} = \sigma^2\] and that, provided \(H_0\) is true: \[E\{ (TSS - RSS) / p \} = \sigma^2\]
Thus if there is no relationship between the response and the predictors, one would expect the F-statistic to be close to 1.
How large does the F-statistic need to be? This depends on \(n\) and \(p\)- when \(n\) is large, an F-statistic just a little larger than 1 might still provide evidence against H_0. However if \(n\) is small, a large F-statistic is needed to reject \(H_0\).
For any value of \(n\) and \(p\), the p-value for the F-statistic can be calculated. Base on the p-value, we can determine whether or not to reject \(H_0\).
Given individual p-values for each coefficient, why look at the overall F-statistic? Consider \(p = 100\) and \(H_0: \beta_0 ... beta_p = 0\) is true, so no variable is truly associated with the response. About 5% of the p-values will be below 0.05 by chance. Hence if we used individual t-statistics and p-values, there is an incorrect assumption. The F-statistic does not suffer from this problem because it adjusts for the number of predictors.
Forward selection - begin with the null model, and fit p simple linear regressions and add to the model the variable that results in the lowest RSS. Backward selection - start with all variables in the model, and remove the variable with the largest p-value. The new (p - 1) model is fit and the process continues. Moxed selecton - start with no variables, and add the best fit. The p-values for data may become larger as more predictors are added. If a p-value rises above a certain threshold, it is removed.
The two most common numerical measures of model fit are RSE and R^2. The R^2 always increases when more variables are added to the model.
Once there is a model, it is straightforward to use it to predict. However there are three sorts of uncertainty associated with the prediction: 1) The coefficient estimates are estimates for the true population regression plane. This inaccuracy is related to the reducible error. We compute a confidence interval in order to determine how close Y_hat
will be to f(X)
. 2) Assuming linear f(x)
is an approximation of reality, so there is another form of reducible error which is called model bias. 3) Even if we knew f(x)
, the response cannot be predicted perfectly because of the random error e
. This is irreducible error. We use prediction intervals to answer how much Y
varies from Y_hat
. Prediction intervals are wider than confidence intevals, as they take into account the reducible and irreducible error.
If a qualitative predcitor (factor) only has two levels, then it can be incorpoated into a model using a 0/1 encoding.
When a qualitative predictor has more than two levels, additional dummy variables can be created. If there the three factors are \(\{A,B,C\}\), then \(x_i1\) is 1 if the ith value is \(A\), 0 if it’s not A, and \(x_i2\) is 1 if the ith value is B, 0 if it’s not B. If \(x_i{1,2}\) are both 0, then the value is C.
It should be noted that while the final predictions will be the same regardless of the encoding, the p-values do depend on the choice. The F-test should be used to assess the statistical evidence as this does not depend on the dummy variable.
Interaction terms can be used by adding another predictor which includes the product of two of the predictors: \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + e \]
A model with no interaction terms contains only main effects.
The hierarchical principal states that if we include an interaction in a model, we should also include the main effects, even if the p-values are not significant.
The linear model assumes that there is a straight line relationship between the predictors and the response.
Residual plots are a graphical tool for identifying non-linearity. The residuals \((y_i - \hat{y}_i)\) are plotted against the predictor \(x_i\). When there are multiple predictors, the residuals are plotted against the predicted values \(\hat{y}_i\). Ideally the residual plot will show no discernible pattern.
An important assumption of the linear regression model is that the error terms \(\epsilon_1, \epsilon_2, \ldots, \epsilon_n\) are uncorrelated. If the error terms are correlated, the estimated standard errors will tend to underestimate the true standard errors. Thus p-values will be lower than they should be.
These correlations frequently occur in time series data, where observations obtained at adjacent time points will have positively correlated errors.
The residuals can be plotted against the time, and we can discern if there is tracking - i.e. adjacent rediduals with similar values.
Another important assumption is that the error terms have a constant variance: \(Var(\epsilon_i) = \sigma^2\). This is called heteroscedasticity. In our examples above, we’ve been using the rnorm()
function to generate the error term. Let’s take a look at the variance:
One can identify non-constant variances from the presence of a funnel shape in the residual plot (x = fitted, y = residuals).
When faced with this issue, transforming the response with a concave function such as sqrt(Y)
or log(Y)
results in a greater amount of shrinkage of the larger values, leading to a reduction in heteroscedasticity.
An outlier is a point for which \(y_i\) is far from the value predicted by the model. An outlier typically does not effect on the least squares fit, but it can have an impact on the RSE, and therefore the confidence intervals and p-values.
Compute the studentised residuals, which are each residual \(e_i\) by its estimated standard error. Observations whose studentised residuals are greater than 3 in absolute value are possible outliers.
If it’s an error in data collection, the observation may be removed. However care must be taken since it may be a deficiency with the model.
Observations with high leverage have an unusual value for \(x_i\). These observations have a large impact on the least squares line.
In order to quantify an observations leverage, a leverage statistic for each observation is computed. For a simple linear regression:
\[ h_i = \frac{1}{n} + \frac { (x_i - \bar{x})^2 } { \sum_{i'=1}^n (x_{i'} - \bar{x})^2} \]
The leverage statistic is always between \(1/n\) and \(n\).
The average leverage for all the observations is \((p + 1) / n\), so if a given observation has a leverage statistic that greatly exceeds this, we may suspect the coressponding point has high leverage.
Collinearity refers to the situation where two or more predictor values are closely related to one another. It can be difficult to separate the individual effects of collinear variables on the response.
Collinearity reduces the accuracy of the coefficient estimates, and causes the standard error of \(\beta_j\) to grow. The t-statistic is the coefficient divided by the standard error, therefore collinearity results in a decline in the t-statistic. We may then fail to reject \(H_0 \text{ : } \beta_j = 0\).
Looking at the correlation matrix is a way to see if there is collinearity. However with multicollinearity it could exist between 3 or 4 variables. The variance inflation factor calculates the ratio of the variance of \(\beta_j\) with the full model to \(\beta_j\) fit on its own. The smallest VIF is 1. In general a VIF over 5 or 10 indicates a problematic amount of collinearity.
The KNN regression is closely related to the KNN classifier. Given a value for \(K\) and a prediction point \(x_0\), KNN regression identifies \(K\) training observations that are closest to \(x_0\), represented by \(\mathcal{N}_0\). It then estimates \(f(x_0)\) using the average of all the training responses in \(\mathcal{N}_0\).
As the number of dimensions increases (\(p\) gets bigger) it is typical for the MSE of KNN to increase. This results from the fact that in higher dimensions there is effectively a reduction in sample sizse. This is called the curse of dimensionality: The \(k\) observations nearest to \(x_0\) may be far away in p-dimensional space when \(p\) is large.
As a general rule, parametric methods outperform non-parametric methods when \(p\) is high.