The support vector machine is an approach for classification that was developed in the computer science community in the 1990s. The are often considered one of the best ‘out of the box’ classifiers.
It is a generalisation of a maximal margin classifier (MMC), which although elegant and simple, cannot be applied to most data sets as it requires the classes be separated by a linear boundary.
The support vector classifier extends the MMC so it can be applied in a broader range of cases.
The support vector machine is a further extension in order to accomodate non-linear class boundaries. SVMs are intended for binary class distinctions, however there are extensions for cases with more than two classes.
In \(p\) dimensional space, a hyperplane is a flat affine subspace of dimension \(p - 1\). In two dimensions, it’s a plane of one-dimension (a line), or in three dimensions its a plane of two-dimensions (a plane).
In two dimensions, a hyperplane is defined by:
\[ \beta_0 + \beta_1X_1 + \beta_2X_2 = 0 \]
By defined we mean that any \(X = (X_1, X_2)^T\) for which the equation holds is a point on the hyperplane.
This can easily be extended out to \(p\) dimensional space:
\[ \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p = 0 \]
and
\[ X = (X_1, \ldots, X_p)^T \]
If \(X\) doesn’t satisfy the equation, it will be on either one side or another side (\(X < 0\)) or the other side (\(X > 0\)) of the hyperplane. So the hyperplane is dividing the \(p\) dimensional space into two halves.
Suppose we have an \(n \times p\) matrix \(\mathbf{X}\). The \(n\) training observations fall into one of two classes - \(y_1, \ldots, y_n \in \{-1, 1\}\). We also have a test observation \(x^* = (x^*_1, \ldots, x^*_p)^T\). The goal is to develop a classifier that correctly classifies the test observation bases on its features.
We have previously seen:
This approach is based on the concept of a separating hyperplane.
We classify \(x^*\) based on the sign of \(f(x^*) = \beta_0 + \beta_1x^*_1 + \ldots + \beta_px^*_p\). If \(f(x^*)\) is positive then the test is assigned to class 1, if it’s negative it’s assigned to class -1.
The magnitude of \(f(x^*)\) can also be used to have increased confidence about the class assignment.
If the data can be perfectly separated by a hyperplane, there are in fact an infinite number of such hyperplanes. Thus we need a reasonable way to contruct the hyperplane.
The maximal margin hyperplane is the farthest away from the training observations. We compute the perdendicular distance from each training observation to the hyperplane. The maximal margin hyperplane is the separating hyperplane that has the farthest minimum distance to the hyperplane.
The classification on this is then known as a maximal margin classifier. Although often useful, there can be overfitting if \(p\) is large.A
Consider the MMC to be the midline of a slab of space between the two classes. Then consider at least three training observations will be equidistant from this midline. These are known as support vectors as they are vectors in \(p\)-dimensional space and they “support” the MMC in the sense that if they moved, the hyperplane would move as well.
In fact the hyperplane only depends on these vectors - a movement of any of the other vectors would not affect the hyperplane.
\[ maximise M_{\beta_0, \ldots, \beta_p} \]
\[ \text{subject to } \sum_{j=1}^p\beta_j^2 = 1 \]
\[ y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_px_{ip}) \ge M \forall i = 1, \ldots, n \]
The final constraint guarrentees that each observation will be on the correct side of the hyperplane, provided \(M\) is positive.
The final two constraints ensure that each observation is on the correct side of the hyperplane, and at least a distance \(M\) from the hyperplane. Hence \(M\) represents the margin. The first constraint then maximises \(M\).
The MMC is a natural way to perform classification if a separating hyperplane exists. In many cases it may not exist. However the concept can be extended to develop a hyperplane that almost separates using a soft margin. This is known as a support vector classifier.
Observations belonging to two classes may not be seperable by a hyperplane, and in fact this might not be desirable as it can be sensitive to single observations.
This we might want a hyperplane that doesn’t perfectly separate the two classes in the interest of:
The support vector classifier (or soft margin classifier) does this.
The support vector classifier classifies a test observation based on the side of the hyperplane it is on. The hyperplane is chosen to correct classify most of the training observations, but may misclassify some of them.
It is the solution to the optimisation problem:
\[ maximise M_{\beta_0, \ldots, \beta_p, \epsilon_0, \ldots, \epsilon_n} \]
\[ \text{subject to } \sum_{j=1}^p\beta_j^2 = 1 \]
\[ y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_px_{ip}) \ge M(1 - \epsilon_i) \]
\[ \epsilon_i \ge 0, \sum_{i=1}^n \epsilon_i \le C \]
where \(C\) is a non-negative tuning parameter. As with the MMC \(M\) is the margin we seek to make as large as possible. The \(\epsilon_0, \ldots, \epsilon_n\) are slack variables that allow individual observations to be on the wrong side of the margin.
The slack variable tells us where the observation is located. If \(\epsilon_i = 0\) then the \(i\)th observation is on the right side of the margin. If it’s \(e_i > 0\) then the \(i\)th observation is on the wrong side of the margin and has violated the margin. If \(\epsilon_i > 0\) then it’s on the wrong side of the hyperplane.
\(C\) bounds the sum of the \(\epsilon_i\) and so determines the number and severity of the violations to the margin we will tolerate. It can be considered the budget for the amount the margin can be violated by the \(n\) observations.
For \(C > 0\) no more than \(C\) observations can be on the wrong side of the hyperplane, because \(\epsilon_i > 1\) for those observations. \(C\) is treated as a tuning parameter selected by cross-validation.
It has an interesting property that only observations that lie on the margin or that violate the margin affect the hyperplane. The fact that the support vector classifier’s decision rule is based only on a small subset of the training observations means it is robust to the behaviour of observations far away from the hyperplane.
The support vector classifier is a natural approach in the two class setting if the decision boundary is linear. However in practice we often deal with non-linear decision boundaries.
With linear regression we enlarged the feature space with functions of the predictors in order to address non-linearity. With a support vector classifier we could address the issue in a similar way by enlarging the feature space using quadratic, cubic, …, functions:
\[ X_1, X_1^2, X_2, X_2^2, \ldots, X_p, X_p^2 \]
In the enlarged feature space, the decision boundary is still linear, however in the original feature space the decision boundary is of the form \(q(x) = 0\) where \(q\) is a quadratic polynomial whose solutions are generally non-linear.
There are a number of other ways to enlarge the feature space, and we could end up in a situation where there are a huge number of features making it computationally infeasible.
The support vector machine allows us to enlarge the fature space used by the support vector classifier in a way that leads to efficient computations.
The support vector machine (SVM) is an extension of the support vector classifier that results in enlarging the feature space in a specific way: using kernels.
It turns out that the solution to the support vector classifier problem involves only the inner products of the observations, not the observations themselves. The inner product for \(r\)-vectors \(a\) and \(b\) is \(\langle a,b \rangle \sum_{i=1}^r a_ib_i\).
It can be shown that:
\[ f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x,x_i \rangle \]
where there are \(n\) parameters \(\alpha_i\).
It turns out that \(\alpha_i\) is non-zero only for the support vectors in the solution. So if \(S\) is the collection of indicies of these support points, the solution can be re-written:
\[ f(x) = \beta_0 + \sum_{i \in S} \alpha_i\langle x, x_i \rangle \]
Now suppose that everytime an inner product appears, we replace it with a generalisation of the inner product:
\[ K(x, x_i) \]
where \(K\) is a function referred to as a kernel. The kernel quantifies the similarity of two observations.
The kernel could be \(\sum_{j=1}^p x_{ij}x_{i'j}\) which gives us back the support vector classifier.
Or something like:
\[ K(x_i, x_{i'}) = (1 + \sum_{j=1}^p x_{ij} x_{i'j} )^d \]
could be used, which is a polynomial kernel of degree \(d\). This would lead to a more flexible decision boundary. When the support vector classifier is used with a non-linear kernel, it is known as a support vector machine.
It turns out that the concept of separating hyperplanes does not lend itself naturally to more than two classes. The two most popular proposals are one-verses-one and one-versus-all methods.
A one versus one approach with \(K > 2\) classes constructs \({K}\choose{2}\) SVMs, each of which compares a pair of classes. We classify a test observation using all of the classifiers and tally the number of times the observation is assigned to each class. The final classification is based on the class to which the observation was assigned the most.
We fit \(K\) SVMs, each time comparing one of the \(K\) classes to the remaining \(K - 1\) classes.
Let \(\beta_{0k}, \beta_{1k}, \ldots, \beta_{pk}\) denote the parameters that result from fitting comparing the \(k\)th class to the others. Let \(x^*\) denote a test observation. The observation is assigned to the class for which \(\beta_{0k}x_1^*, \beta_{1k}x_2^*, \ldots, \beta_{pk}x_p^*\) is largest.