Home SLM Chapter 7 Support Vector Machines
Post
Cancel

SLM Chapter 7 Support Vector Machines

SVM is essentially separating data with planes, such that the margin between two classes is the largest.

Types

Linear SVM in linearly separable case: simply uses a hyperplane (with hard margin) to classify linearly separable data.

Linear SVM: uses a hyperplane with soft margin to separate data. The data in this case is almost linearly separable.

Non-linear SVM: use a mapping (usually non-linear) to transfer the data from the input space to a feature space.

Distance Measure

For a given data point $(x_i, y_i)$ and a classifying hyperplane defined by $(w, b)$, the functional margin between the point and the plane is

\[\hat D_i = y_i(wx_i + b)\]

When $(x_i, y_i)$ is correctly classified, the (signed) geometric margin (signed euclidean distance) is

\[\begin{aligned} D_i &= \frac{\hat D_i}{||w||} \\ & = y_i(\frac{w}{||w||}x_i + \frac{b}{||w||}) \end{aligned}\]

Note that if there is misclassification, we have $D_i \le 0$.

Unlike functional margin, geometric margin does not change as you scale $w$ and $b$ up.

Problem Setup: Hard Margin

We want to obtain a classifier that has a maximum margin between two classes. We define the margin of a perfect classifier (i.e. one that separates two classes perfectly with no misclassification, so that $D_i \ge 0$ for all $i$) as

\[D_{w, b} = \min_{i=1,...,N} D_i\]

Learning the SVM model can be viewed as the following optimization problem

\[\begin{aligned} & \max_{w, b} D_{w, b} \\ s.t. \ & y_i(\frac{w}{||w||}x_i + \frac{b}{||w||}) \ge D_{w,b} \\ \end{aligned}\]

We plug in the functional distance of the model $\hat D_{w,b}$ and get an equivalent problem

\[\begin{aligned} & \max_{w, b} \frac{\hat D_{w, b}}{||w||} \\ s.t. \ & y_i(wx_i + b) \ge \hat D_{w,b} \\ \end{aligned}\]

If we scale $w$ and $b$ carefully, we can make $\hat D_{w, b} = 1$ without changing our perfect classifier (think about the equivalence of two lines $x + y + 1 = 0$ and $2x + 2y + 2 = 0$). In other words, we make the functional distance between our perfect classifer and the points closest to the decision boundary to be 1. All other points should have a functional margin greater than or equal to 1.

Therefore, we get our final optimization problem as a Lagrangian optimization problem. Note the equivalence between minimizing $ \frac{1}{ ||w|| } $ and maximizing $ \frac{1}{2} ||w||^2 $.

\[\begin{aligned} & \min_{w, b} \frac{1}{2}||w||^2 \\ s.t. \ & y_i(wx_i + b) - 1\ge 0 \\ \end{aligned}\]

We can then write it as its dual problem (read more about Lagrangian optimization problems):

\[\begin{aligned} & \max_\alpha \min_{w, b} L(w, b, a) \\ = & \max_\alpha \min_{w, b} \frac{1}{2}||w||^2 - \sum_{i=1}^{N}\alpha_i(y_i(wx_i + b) - 1) \\ = & \max_\alpha \min_{w, b} \frac{1}{2}||w||^2 - \sum_{i=1}^{N}\alpha_iy_i(wx_i + b) + \sum_{i=1}^{N}\alpha_i \end{aligned}\]

We first deal with the inner minimization problem by solving the following

\[\begin{cases} \nabla_wL(w, b, a) = 0 \\ \nabla_bL(w, b, a) = 0 \end{cases}\]

We get

\[\begin{aligned} w^* &= \sum_{i=1}^N \alpha_i y_i x_i \\ 0 &= \sum_{i=1}^N \alpha_i y_i \end{aligned}\]

For a sample right on the decision boundary $(x_j, y_j)$, we have $y_i(w^* x_j + b) - 1 = 0$. We solve for $b$ and get

\[b^* = y_j - \sum_{i=1}^{N} \alpha_i y_i x_i x_j\]

However, we don’t need to plug $b$ back. During the calculation, $b$ is eliminated by the fact that $\sum_{i=1}^N \alpha_i y_i = 0$

Then we plug in the solution to the outer maximization problem and get our final expression that we need to optimize:

\[\begin{aligned} \min_{\alpha} & \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j x_i x_j - \sum_{i=1}^N \alpha_i \\ s.t. & \ \sum_{i=1}^N \alpha_i y_i = 0 \\ \ & \alpha_i \ge 0, \ i = 1,2,..., N \\ \end{aligned}\]

Note that we took the negative away so the problem turned into minimization.

We solve this problem, get our solution $ a^* $, calculate $ w^* $ and $ b^* $, and result in the final model

\[f(x) = \text{sign}(w^*x + b^*)\]

Note that $b^* $ and $w^*$ are only determined by a few points lying on the boundary, for which $\alpha$ values are not zero.

Problem Set Up: Soft Margin

Previously, we require that for any $i = 1,2,…,N$, the following holds

\[y_i(w x_i + b) \ge 1\]

When the data set is not linearly separable, this will become impossible to achieve. Thus, we introduce a slack variable $\xi$ to allow some misclassification. Our new requirement becomes

\[y_i(w x_i + b) \ge 1 - \xi_i\]

and, with a new term as a punishment for misclassification, out target function becomes

\[\frac{1}{2}||w||^2 + C\sum_{i=1}^N \xi_i\]

where $C$ is a parameter for punishment.

The optimization problem is now

\[\begin{align} \min_{w, b} & \ \frac{1}{2}||w||^2 + C \sum_{i=1}^N \xi_i \\ s.t. & \ y_i(wx_i + b) \ge 1 - \xi_i \\ \ & \xi_i \ge 0, \ i = 1, 2,..., N \\ \end{align}\]

Applying the same methodology as the previous section, we get our final optimization problem

\[\begin{align} \min_{\alpha} & \ \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j x_i x_j - \sum_{i=1}^N \alpha_i \\ s.t. & \ \sum_{i=1}^N \alpha_i y_i = 0 \\ & \ 0 \le \alpha_i \le C, \ i = 1,2,..., N \\ \end{align}\]

Non-linear SVMs

Since when making decisions, we never use an input alone like $w x_i$. Instead, we calculate $w (x_i \cdot x_j)$ We can take advantage of this by applying kernel trick.

We map a data point $x_i$ from input space to a feature space through

\[z = \phi(x)\]

We define the kernel function K as

\[K(x, z) = \phi(x) \phi(z)\]

where $x$ and $z$ are two elements in the input space.

With the kernel function, the decision function can be written as

\[f(x) = \text{sign}(\sum_{i=1}^{N} \alpha^* y_i K(x, x_i) + b^*)\]

And our optimization problem becomes the following

\[\begin{aligned} \min_{\alpha} & \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j K(x_i, y_j) - \sum_{i=1}^N \alpha_i \\ s.t. & \ \sum_{i=1}^N \alpha_i y_i = 0 \\ \ & 0 \le \alpha_i \le C, \ i = 1,2,..., N \\ \end{aligned}\]

Common Kernal Functions

Polynomial Kernel Function

\[K(x, z) = (x \cdot z + 1)^p\]

Gaussian Kernel Function

\[K(x, z) = \exp(-\frac{||x - z||^2}{2 \delta ^ 2})\]

SMO Algorithm (TODO)

The algorithm for solving the optimization problem.

This post is licensed under CC BY 4.0 by the author.