1)
Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method for the following scenarios
- The sample size \(n\) is extremely large and the number of predictors \(n\) is small.
- Better - a more flexible approach wil fit the data closer.
- The number of predictors \(p\) is extremely large and the number of observations \(p\) is small.
- Worse - the flexible approach would overfit the small number of observations.
- The realtionship between the predictors and response is highly non-linear.
- Flexible model would be better in this circumstance.
- The variance of the error terms (\(Var(\epsilon)\)) is extremely high.
- A flexible would perfrom worse as it would attempt to follow the irreducible error.
2)
Explain for each scenario below whether it is a classification or regression problem, whether we’re interested in inference or prediction, and what \(n\) and \(n\) are.
- We collect a set of dat on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding what factors affect CEO salary.
- Regression
- Inference
- \(n = 500,\ p = 3\)
- We are considering launching a new product and wish to know whether it will be a success or failure. We collect data on 20 simlar products that were previously launched. For each product we have recorded whether it is a success or failure, price charged, marketing budget, competition price, and ten other variables.
- Classification (success or failure)
- Prediction
- \(n = 20,\ p = 14\)
- We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
- Regression
- Prediction
- \(n = 52,\ p = 3\)
3)
Sketch typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches.
The x-axis is the amount of flexibility in the model, and the y-axis represents the values for each curve.
As flexibility of a model increases:
- The bias decreases.
- The variance increases.
- The training error decreases.
- The test error decreases, then increases.
- The irreducible error is a constant.
- Bayes error rate is used in classification. It’s between 0 and 1 and is the ratio of classified observations that are on the ‘wrong’ side of the decision boundary.
4)
- Describe three real life applications for classification, including the response and the predictors. Inference or prediction?
- Classifying malicious network traffic. Response is malicious / non-malicious, predictors are the ports and protocols used: Inference.
- Describe three real life applications for regression, including the response and the predictors. Inference or prediction?
- Analysing network bandwidth. Response is the bandwidth, predicitor is the time of day. Prediction.
- Describe three real life applications for cluster analysis.
- Groups of voters.
- Groups of different customers.
- Different types of genes.
5)
What are the advantages and disadvantages of a very flexible approach? When is flexibility preferred over infexibility and vice versa?
A flexible approach may be closer to the ‘true’ model and provide a better fit to the data and provide a more accurate prediction. However if too flexible a model is used it may overfit the data. It’s predictive powers are then reduced as its variance increases.