Chapter 9 - Support Vector Machines

9.6 - Support Vector Machines

We use the e1071 library to demonstrate the support vector machines.

library(e1071)

9.6.1 - Support Vector Classifier

The e1071 library contains implementations for a number of statistical learning methods. The svm() function can be used to fit a support vector classifier when the argument kernel = linear is used.

We generate observations that are in two classes:

set.seed(1)
X <- tibble(
    x1 = rnorm(20), 
    x2 = rnorm(20), 
    y = as.factor(c(rep(-1, 10), rep(1,10)))
)

X[1:10,1:2] = X[1:10,1:2] + 1

X %>% ggplot(aes(x2,x1)) +
    geom_point(aes(colour = y))

The classes are not linearly separable.

We now fit the support vector classifier:

X %>%
    svm(y~., data = ., scale = F, cost = 10, kernel = 'linear') -> X_fit
    plot(X_fit, data = X)

What we can see is hyperplane (in two dimensions) splitting the classes. The decision boundary between the two classes is linear because we used the kernel = linear argument.

The support vectors are crosses and the other observations are circles. We can get their identities using index:

X_fit$index

##  [1]  1  3  4  6  8 10 11 12 18 19 20

As usual, summary() can give us information on the classifier fit:

X_fit %>% summary()

## 
## Call:
## svm(formula = y ~ ., data = ., cost = 10, kernel = "linear", 
##     scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  11
## 
##  ( 6 5 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

Lets use a smaller cost value:

X %>%
    svm(y~., data = ., scale = F, cost = .09, kernel = 'linear') %>%
    plot(data = X)

The smaller cost means we obtain more support vectors because the margin is wider.

The e1071 library contains a tune() function to perform cross validaton. By default it performs ten-fold CV.

tune(
    svm, y ~ ., data = X, kernel = 'linear',
    ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 100))
) -> X_tune
summary(X_tune)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.25 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03  0.45  0.4972145
## 2 1e-02  0.45  0.4972145
## 3 1e-01  0.35  0.4743416
## 4 1e+00  0.25  0.3535534
## 5 5e+00  0.30  0.3496029
## 6 1e+02  0.30  0.4216370

We see that any of the larger costs result in the lowest cross-validation rate.

The best model can be accessed using $best.model

X_tune$best.model -> X_best
X_best

## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = X, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1, 5, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  13

The predict() function can work on the model to predict the class label. Let’s generate a test data set:

set.seed(2)
X_testset <- tibble(
    x1 = rnorm(20), 
    x2 = rnorm(20), 
    y = as.factor(c(rep(-1, 10), rep(1,10)))
)

X_testset[1:10,1:2] = X_testset[1:10,1:2] + 1

X_testset %>% ggplot(aes(x2,x1)) +
    geom_point(aes(colour = y))

Now we can predict the labels:

X_testset %>%
    mutate(class_pred = predict(X_best, X_testset)) -> X_testset

table(real_class = X_testset$y, predicted_class = X_testset$class_pred)

##           predicted_class
## real_class -1 1
##         -1  9 1
##         1   3 7

Four of the observations have been misclassified.

No we consider when the classes are linearly seperable. We alter our set:

X[1:10,1:2] = X[1:10,1:2] + 1.2

X %>% ggplot(aes(x2,x1)) +
    geom_point(aes(colour = y))

We fit a support vector with a large cost so that no observations are misclassified:

X %>% 
    svm(y ~ ., data = ., kernel = 'linear', cost = 1e5) -> X_fit
plot(X_fit, data = X)

summary(X_fit)

## 
## Call:
## svm(formula = y ~ ., data = ., kernel = "linear", cost = 1e+05)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1e+05 
## 
## Number of Support Vectors:  3
## 
##  ( 2 1 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

9.6.2 - Support Vector Machines

In order to fit a support vector machine using a non-linear kernel, we modify the kernel = '' parameter. We can use polynomial and degree, or radial for a radial kernel.

We generate data with a non-linear boundary

tibble(
    x1 = rnorm(200),
    x2 = rnorm(200),
    y = as.factor( c(rep(1, 150), rep(2, 50)) )
) -> A

A[1:100,1:2] = A[1:100,1:2] + 2
A[101:150,1:2] = A[101:150,1:2] - 2

A %>%
    ggplot(aes(x2, x1, colour = y)) +
    geom_point()

We split the data into training and test groups, then fit the training data with svm().

set.seed(1)
A %>% resample_partition(c(train = .5, test = .5)) -> A_sample

A_train_svm <- svm(y~., data = A_sample$train, kernel = 'radial', gamma = 1, cost = 1)

plot(A_train_svm, as_tibble(A_sample$train))

summary(A_train_svm)

## 
## Call:
## svm(formula = y ~ ., data = A_sample$train, kernel = "radial", 
##     gamma = 1, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  38
## 
##  ( 21 17 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

We can see there are a number of training errors in tehe SVM fit. If we increase the value of cost, we can reduce the number of training errors, however this comes at a cost of a more irregular decision boundary and we risk overfitting the data.

We perform cross-validation using tune() to select the best cost.

set.seed(1)
A_tuned <- tune(svm, y ~ ., data = as_tibble(A_sample$train), kernel = 'radial',
                ranges = list(
                    cost = c(0.1, 1, 10, 100, 1000)
                ),
                gamma = c(0.5, 1, 2, 3, 4)
)

summary(A_tuned)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.09111111 
## 
## - Detailed performance results:
##    cost      error dispersion
## 1 1e-01 0.26222222 0.12551130
## 2 1e+00 0.09111111 0.07403703
## 3 1e+01 0.11111111 0.08748898
## 4 1e+02 0.11111111 0.08748898
## 5 1e+03 0.13111111 0.11568369

We can now view the test set predictions against the real test values.

as_tibble(A_sample$test) %>%
    mutate(y_prime = predict(A_tuned$best.model, newdata = tibble(x1 = x1, x2 = x2))) %>%
    summarise(misclassified = (1 - mean(y == y_prime)) * nrow(.))

## # A tibble: 1 x 1
##   misclassified
##           <dbl>
## 1            12

We have misclassified 13 observations in the test set.

9.6.3 - ROC Curves

First, a refresher on ROC curves. ROC stands for receiver operating characteristic, and it’s used in evaluating and comparing predictive models.

The ROC curver plots sensitivity (the probability that predicting a true positive will be a postive) against 1-specificity (the probability of predicting a real negative will be a positive).

A model that predicts at change will have a diagonal line:

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

prediction(
    sample(c(0,1), 5000, replace = T),
    sample(c(0,1), 5000, replace = T)
) %>% 
    performance(measure = 'tpr', x.measure = 'fpr') %>% 
    plot()

The further the curve is from the diagonal line, the better the model is at discriminating between positives and negatives in general. Let’s try with a simple logistic regression trying to predict the direction of the stock market:

Weekly %>%
    select(-Year, -Today, -Volume) %>%
    mutate(dir_prime = predict(glm(Direction ~ ., data = ., family = 'binomial'))) %>%
    {prediction(.$dir_prime, .$Direction)} %>%
    performance('tpr', 'fpr') %>%
    plot()

The ROCR package can be used to produce ROC curves. We first define a function to plot a ROC curve given a vector containing a numerical score for each prediction (pred) and a vector containing the class label for each observation (truth).

The prediction() function from ROCR transforms the input data into a standardised format. The performance calculates different kinds of performance measures. In the function below we’re measuring “true positive rate” against “false positive rate”.

plot_roc <- function(pred, truth) {
    prediction(pred, truth) %>%
    performance("tpr", "fpr") %>%
    plot()
}

SVMs and support vector classifiers output class labels for each observation. However it is also possible to obtain fitted values for each observation. These are the numerical scores used to determine the class labels.

In the case of a support vector classifier, the fitted value for an observation $X = (X_1, X_2, \ldots, X_p)^T$ takes the form $\hat{\beta_0} + \hat{\beta_1}X_1 + \ldots + \hat{\beta_p}X_p$.

We can use decision.values = T when using svm(). The predict() function then outputs the values.

svm_fit <- svm(
    y ~ ., 
    data = as_tibble(A_sample$train), 
    kernel = 'radial', 
    gamma = 2, 
    cost = 1, 
    decision.values = T
)


attributes( 
    predict(svm_fit, as_tibble(A_sample$train), decision.values = T)
)$decision.values -> fitted_values

plot_roc(fitted_values, as_tibble(A_sample$train)$y)

9.6.4 - SVM with Multiple Classes

If the response vector contains more than two classes, the svm() function will perform multi-class classification by using the ‘one-versus-one’ approach.

9.6.5 - Application to Gene Expression Data

We now examine the Khan data set, which consists of a number of tissue samples corresponding to four different of small round blue cell tumors.

For each tissue sample gene expression measurements are available. The data consists of training and test data.

names(Khan)

## [1] "xtrain" "xtest"  "ytrain" "ytest"

dim(Khan$xtrain)

## [1]   63 2308

dim(Khan$xtest)

## [1]   20 2308

length(Khan$ytrain)

## [1] 63

length(Khan$ytest)

## [1] 20

We will use a support vector approach to predict cancer subtype using gene expression measurements.

khan_train <- as_tibble(Khan$xtrain, .name_repair = ~make.names(c(1:ncol(Khan$xtrain)))) %>%
    mutate(y = as.factor(Khan$ytrain))

khan_svm <- svm(y ~ ., data = khan_train, kernel = 'linear', cost = 10)
summary(khan_svm)

## 
## Call:
## svm(formula = y ~ ., data = khan_train, kernel = "linear", cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  58
## 
##  ( 20 20 11 7 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  1 2 3 4

table(khan_svm$fitted, khan_train$y)

##    
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20

We can see there are no training errors - this is because there are a large number of features compared to the observations, making it is easy to find a separating hyperplane.

Let’s see how it works on the test data:

khan_test <- as_tibble(Khan$xtest, .name_repair = ~make.names(c(1:ncol(Khan$xtest)))) %>%
    mutate(y = Khan$ytest) %>%
    mutate(y_prime = predict(khan_svm, newdata = .))

table(khan_test$y, khan_test$y_prime)

##    
##     1 2 3 4
##   1 3 0 0 0
##   2 0 6 0 0
##   3 0 2 4 0
##   4 0 0 0 5

Using cost = 10 yields two test set errors on this data.