1)

Draw an example (of your own invention) of a partition of two-dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions \(R_1, R_2, \ldots\), the cutpoints \(t_1, t_2, \ldots\), and so forth.

tibble(
    xmin = c(0, 3, 2, 3, 3),
    xmax = c(2, 4, 3, 4, 4),
    ymin = c(0, 0, 0, 4, 6),
    ymax = c(8, 4, 8, 6, 8),
    R = c('R1', 'R2', 'R3', 'R4', 'R5'),
    a = seq(from = .9, to = .5, by = -.1)
) %>%
    ggplot(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = a)) +
    geom_rect(linetype = 1) +
    geom_text(aes((xmin + xmax)/2, (ymin + ymax)/2, label = R), colour = 'white') +
    labs(x = 'X1', y = 'X2') +
    theme(legend.position = 'none')

2)

It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of the form:

\[ f(X) = \sum_{j=1}^p f_j(X_j) \]

Explain why this is the case.

We recall that with boosting we fit a multiple trees, fitting to the residuals from the previous model on each iteration rather than Y. The models are then added together. The parameter \(d\) determines the number of splits, yielding \(d + 1\) terminal nodes.

An additive model is a non-parametric regression method using a one-dimensional smoother.

The final boosting form is \(\hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}^b(x)\). As each \(\hat{f}^b\) is split on a single variable, the final form is addititive.

3)

Consider the Gini index, classification error, and cross-entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

The three concepts metntioned are the ‘replacements’ for RSS is a classification setting when determining the binary splits.

Classification error - the fraction of training observations in a region that do not belong to the most common class.

\[ E = 1 - max_{k}( \hat{p}_{mk} ) \]

Here \(\hat{p}_{mk}\) represents the proportion of training observations in the \(m\)th region that are from the \(k\)th class.

Gini index - a measure of the total variance across the classes. Can be considered node purity - a small value indicates that a node contains predominantly observations from a single clas:

\[ G = \sum_{k=1}^K \hat{p}_{mk} (1 - \hat{p}_{mk}) \]

Cross-entropy - similar to the Gini index.

\[ D = -\sum_{k = 1}^K \hat{p}_{mk} log(\hat{p}_{mk}) \]

tibble(
    p_11 = seq(0, 1, by = 0.01),
    p_12 = seq(1, 0, by = -0.01),
    E = 1 - pmax(p_11, p_12),
    G = (p_11 * (1 - p_11)) + (p_12 * (1 - p_12)),
    D = -( (p_11 * log(p_11) + (p_12 * log(p_12))) )
) %>%
    gather(type, value, c(E, G, D)) %>%
    ggplot() +
    geom_point(aes(p_11, value, colour = type)) +
    labs(x = 'p_11', y = 'Classification Error Percent')

## Warning: Removed 2 rows containing missing values (geom_point).

4)

This question relates to the plots in Figure 8.12.

a)

Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.12. The numbers inside the boxes indicate the mean of Y within each region.

The plot is as such:

tibble(
    xmin = c(1, -1, -1, 0, 0),
    xmax = c(3, 1, 0, 1, 1),
    ymin = c(-1, 1, -1, -1, 0),
    ymax = c(3, 3, 1, 0, 1),
    R = c('5', '15', '3', '10', '0'),
    colour = seq(3, 7)
) %>%
    ggplot() +
    geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = colour), linetype = 2) +
    geom_text(aes((xmin + xmax)/2, (ymin + ymax)/2, label = R), colour = 'white') +
    labs(x = 'X1', y = 'X2') +
    theme(legend.position = 'none')

We have the following paths to terminal nodes:

X1 < 1 -> 5
X1 < 1 / X2 < 1 -> 15
X1 < 1 / X2 < 1 / X1 < 0 -> 4
X1 < 1 / X2 < 1 / X1 >= 0 / X2 < 0 -> 10
X1 < 1 / X2 < 1 / X1 >= 0 / X2 >= 0 -> 0

tibble(
    L1 = rep('X1 < 1', 5),
    L2 = c(rep('X2 < 1', 4), 5),
    L3 = c(rep('X1 < 0', 3), 15, ''),
    L4 = c(3, rep('X2 < 0', 2), '', ''),
    L5 = c('', 10, 0, '', '')
) %>%
    unite(col = 'pathString', sep = '/') %>%
    as.Node() %>%
    plot()

b)

Create a diagram similar to the left-hand panel of Figure 8.12, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

The diagram appears as such:

tibble(
    L1 = c( rep('X2 < 1', 5) ),
    L2 = c( rep('X1 < 1', 2), rep('X2 < 2', 3) ),
    L3 = c( '-1.80', '0.63', rep('X1 < 0', 2), '2.49' ),
    L4 = c( '', '', '-1.06', '0.21', '' )
) %>%
    unite(col = 'pathString', sep = '/') %>%
    as.Node() %>%
    plot()

This translates to:

tibble(
    xmin = c(-3, 1, -3, 1, -3),
    xmax = c(1, 3, 1, 3, 3),
    ymin = c(-3, -3, 1, 1, 2),
    ymax = c(1, 1, 2, 2, 3),
    R = c(-1.80, 0.63, -1.06, 0.21, 2.49),
    colour = seq(3, 7)
) %>%
    ggplot() +
    geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = colour), linetype = 2) +
    geom_text(aes((xmin + xmax)/2, (ymin + ymax)/2, label = R), colour = 'white') +
    labs(x = 'X1', y = 'X2') +
    theme(legend.position = 'none')

5)

Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of \(X\), produce 10 estimates of \(P(Class is Red|X)\):

0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.

There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

In this instance, if the probability is \(\ge .5\) then then observtion is classified as red.

Majority vote - the most common class among the predictions.

q6 <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
max( ifelse(q6 < .5, 'Green', 'Red') )

## [1] "Red"

Average - the mean value of the probabilities determines the class.

ifelse( mean(q6) < .5, 'Green', 'Red' )

## [1] "Green"

6)

Provide a detailed explanation of the algorithm that is used to fit a regression tree.

Ideally we would like to partition the feature space into \(J\) boxes \(R_1, \ldots, R_J\) which minimises the RSS given by:

\[ \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2\] For each region, the squared difference between the response and the mean response for the training observations is minimised. This is infeasible to do.

The algorithm for a regression tree is:

Use ‘Recursive Binary Splitting’ to create a large tree:

For each predictor \(X_j\) and every cutpoint \(s\), pick the cutpoint that results in the lowest RSS.
This process is repeated, except instead of splitting the entire predictor space, we only split one of the two previously created regions.
This is continued until a stopping criteria is reached - i.e. no region contains less than \(x\) observations.

Apply ‘Cost Complexity Pruning’ to obtain a sequence of best subtrees as a function of \(\alpha\).

Use K-fold cross-validation to choose \(\alpha\).

Return the subtree that responds to a chosed value of \(\alpha\).

Chapter 8 - Tree Based Methods - Conceptual Exercises

Greg Foletta

1)

2)

3)

4)

a)

b)

5)

6)