library(broom)
library(modelr)
library(tidyverse)
library(modelr)
library(ISLR)
library(ggdendro)
library(ggfortify)
We use the e1071
library to demonstrate the support vector machines.
library(e1071)
The e1071
library contains implementations for a number of statistical learning methods. The svm()
function can be used to fit a support vector classifier when the argument kernel = linear
is used.
We generate observations that are in two classes:
set.seed(1)
X <- tibble(
x1 = rnorm(20),
x2 = rnorm(20),
y = as.factor(c(rep(-1, 10), rep(1,10)))
)
X[1:10,1:2] = X[1:10,1:2] + 1
X %>% ggplot(aes(x2,x1)) +
geom_point(aes(colour = y))
The classes are not linearly separable.
We now fit the support vector classifier:
X %>%
svm(y~., data = ., scale = F, cost = 10, kernel = 'linear') -> X_fit
plot(X_fit, data = X)
What we can see is hyperplane (in two dimensions) splitting the classes. The decision boundary between the two classes is linear because we used the
kernel = linear
argument.
The support vectors are crosses and the other observations are circles. We can get their identities using index
:
X_fit$index
## [1] 1 3 4 6 8 10 11 12 18 19 20
As usual, summary()
can give us information on the classifier fit:
X_fit %>% summary()
##
## Call:
## svm(formula = y ~ ., data = ., cost = 10, kernel = "linear",
## scale = F)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 11
##
## ( 6 5 )
##
##
## Number of Classes: 2
##
## Levels:
## -1 1
Lets use a smaller cost value:
X %>%
svm(y~., data = ., scale = F, cost = .09, kernel = 'linear') %>%
plot(data = X)
The smaller cost means we obtain more support vectors because the margin is wider.
The e1071
library contains a tune()
function to perform cross validaton. By default it performs ten-fold CV.
tune(
svm, y ~ ., data = X, kernel = 'linear',
ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 100))
) -> X_tune
summary(X_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.25
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.45 0.4972145
## 2 1e-02 0.45 0.4972145
## 3 1e-01 0.35 0.4743416
## 4 1e+00 0.25 0.3535534
## 5 5e+00 0.30 0.3496029
## 6 1e+02 0.30 0.4216370
We see that any of the larger costs result in the lowest cross-validation rate.
The best model can be accessed using $best.model
X_tune$best.model -> X_best
X_best
##
## Call:
## best.tune(method = svm, train.x = y ~ ., data = X, ranges = list(cost = c(0.001,
## 0.01, 0.1, 1, 5, 100)), kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 13
The predict()
function can work on the model to predict the class label. Let’s generate a test data set:
set.seed(2)
X_testset <- tibble(
x1 = rnorm(20),
x2 = rnorm(20),
y = as.factor(c(rep(-1, 10), rep(1,10)))
)
X_testset[1:10,1:2] = X_testset[1:10,1:2] + 1
X_testset %>% ggplot(aes(x2,x1)) +
geom_point(aes(colour = y))
Now we can predict the labels:
X_testset %>%
mutate(class_pred = predict(X_best, X_testset)) -> X_testset
table(real_class = X_testset$y, predicted_class = X_testset$class_pred)
## predicted_class
## real_class -1 1
## -1 9 1
## 1 3 7
Four of the observations have been misclassified.
No we consider when the classes are linearly seperable. We alter our set:
X[1:10,1:2] = X[1:10,1:2] + 1.2
X %>% ggplot(aes(x2,x1)) +
geom_point(aes(colour = y))
We fit a support vector with a large cost so that no observations are misclassified:
X %>%
svm(y ~ ., data = ., kernel = 'linear', cost = 1e5) -> X_fit
plot(X_fit, data = X)
summary(X_fit)
##
## Call:
## svm(formula = y ~ ., data = ., kernel = "linear", cost = 1e+05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1e+05
##
## Number of Support Vectors: 3
##
## ( 2 1 )
##
##
## Number of Classes: 2
##
## Levels:
## -1 1
In order to fit a support vector machine using a non-linear kernel, we modify the kernel = ''
parameter. We can use polynomial
and degree
, or radial
for a radial kernel.
We generate data with a non-linear boundary
tibble(
x1 = rnorm(200),
x2 = rnorm(200),
y = as.factor( c(rep(1, 150), rep(2, 50)) )
) -> A
A[1:100,1:2] = A[1:100,1:2] + 2
A[101:150,1:2] = A[101:150,1:2] - 2
A %>%
ggplot(aes(x2, x1, colour = y)) +
geom_point()
We split the data into training and test groups, then fit the training data with
svm()
.
set.seed(1)
A %>% resample_partition(c(train = .5, test = .5)) -> A_sample
A_train_svm <- svm(y~., data = A_sample$train, kernel = 'radial', gamma = 1, cost = 1)
plot(A_train_svm, as_tibble(A_sample$train))
summary(A_train_svm)
##
## Call:
## svm(formula = y ~ ., data = A_sample$train, kernel = "radial",
## gamma = 1, cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 38
##
## ( 21 17 )
##
##
## Number of Classes: 2
##
## Levels:
## 1 2
We can see there are a number of training errors in tehe SVM fit. If we increase the value of cost
, we can reduce the number of training errors, however this comes at a cost of a more irregular decision boundary and we risk overfitting the data.
We perform cross-validation using tune()
to select the best cost
.
set.seed(1)
A_tuned <- tune(svm, y ~ ., data = as_tibble(A_sample$train), kernel = 'radial',
ranges = list(
cost = c(0.1, 1, 10, 100, 1000)
),
gamma = c(0.5, 1, 2, 3, 4)
)
summary(A_tuned)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.09111111
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-01 0.26222222 0.12551130
## 2 1e+00 0.09111111 0.07403703
## 3 1e+01 0.11111111 0.08748898
## 4 1e+02 0.11111111 0.08748898
## 5 1e+03 0.13111111 0.11568369
We can now view the test set predictions against the real test values.
as_tibble(A_sample$test) %>%
mutate(y_prime = predict(A_tuned$best.model, newdata = tibble(x1 = x1, x2 = x2))) %>%
summarise(misclassified = (1 - mean(y == y_prime)) * nrow(.))
## # A tibble: 1 x 1
## misclassified
## <dbl>
## 1 12
We have misclassified 13 observations in the test set.
First, a refresher on ROC curves. ROC stands for receiver operating characteristic, and it’s used in evaluating and comparing predictive models.
The ROC curver plots sensitivity (the probability that predicting a true positive will be a postive) against 1-specificity (the probability of predicting a real negative will be a positive).
A model that predicts at change will have a diagonal line:
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
prediction(
sample(c(0,1), 5000, replace = T),
sample(c(0,1), 5000, replace = T)
) %>%
performance(measure = 'tpr', x.measure = 'fpr') %>%
plot()
The further the curve is from the diagonal line, the better the model is at discriminating between positives and negatives in general. Let’s try with a simple logistic regression trying to predict the direction of the stock market:
Weekly %>%
select(-Year, -Today, -Volume) %>%
mutate(dir_prime = predict(glm(Direction ~ ., data = ., family = 'binomial'))) %>%
{prediction(.$dir_prime, .$Direction)} %>%
performance('tpr', 'fpr') %>%
plot()
The ROCR
package can be used to produce ROC curves. We first define a function to plot a ROC curve given a vector containing a numerical score for each prediction (pred
) and a vector containing the class label for each observation (truth
).
The prediction()
function from ROCR
transforms the input data into a standardised format. The performance calculates different kinds of performance measures. In the function below we’re measuring “true positive rate” against “false positive rate”.
plot_roc <- function(pred, truth) {
prediction(pred, truth) %>%
performance("tpr", "fpr") %>%
plot()
}
SVMs and support vector classifiers output class labels for each observation. However it is also possible to obtain fitted values for each observation. These are the numerical scores used to determine the class labels.
In the case of a support vector classifier, the fitted value for an observation \(X = (X_1, X_2, \ldots, X_p)^T\) takes the form \(\hat{\beta_0} + \hat{\beta_1}X_1 + \ldots + \hat{\beta_p}X_p\).
We can use decision.values = T
when using svm()
. The predict()
function then outputs the values.
svm_fit <- svm(
y ~ .,
data = as_tibble(A_sample$train),
kernel = 'radial',
gamma = 2,
cost = 1,
decision.values = T
)
attributes(
predict(svm_fit, as_tibble(A_sample$train), decision.values = T)
)$decision.values -> fitted_values
plot_roc(fitted_values, as_tibble(A_sample$train)$y)
If the response vector contains more than two classes, the svm()
function will perform multi-class classification by using the ‘one-versus-one’ approach.
We now examine the Khan
data set, which consists of a number of tissue samples corresponding to four different of small round blue cell tumors.
For each tissue sample gene expression measurements are available. The data consists of training and test data.
names(Khan)
## [1] "xtrain" "xtest" "ytrain" "ytest"
dim(Khan$xtrain)
## [1] 63 2308
dim(Khan$xtest)
## [1] 20 2308
length(Khan$ytrain)
## [1] 63
length(Khan$ytest)
## [1] 20
We will use a support vector approach to predict cancer subtype using gene expression measurements.
khan_train <- as_tibble(Khan$xtrain, .name_repair = ~make.names(c(1:ncol(Khan$xtrain)))) %>%
mutate(y = as.factor(Khan$ytrain))
khan_svm <- svm(y ~ ., data = khan_train, kernel = 'linear', cost = 10)
summary(khan_svm)
##
## Call:
## svm(formula = y ~ ., data = khan_train, kernel = "linear", cost = 10)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 58
##
## ( 20 20 11 7 )
##
##
## Number of Classes: 4
##
## Levels:
## 1 2 3 4
table(khan_svm$fitted, khan_train$y)
##
## 1 2 3 4
## 1 8 0 0 0
## 2 0 23 0 0
## 3 0 0 12 0
## 4 0 0 0 20
We can see there are no training errors - this is because there are a large number of features compared to the observations, making it is easy to find a separating hyperplane.
Let’s see how it works on the test data:
khan_test <- as_tibble(Khan$xtest, .name_repair = ~make.names(c(1:ncol(Khan$xtest)))) %>%
mutate(y = Khan$ytest) %>%
mutate(y_prime = predict(khan_svm, newdata = .))
table(khan_test$y, khan_test$y_prime)
##
## 1 2 3 4
## 1 3 0 0 0
## 2 0 6 0 0
## 3 0 2 4 0
## 4 0 0 0 5
Using cost = 10
yields two test set errors on this data.