SVM is essentially separating data with planes, such that the margin between two classes is the largest.
Types
Linear SVM in linearly separable case: simply uses a hyperplane (with hard margin) to classify linearly separable data.
Linear SVM: uses a hyperplane with soft margin to separate data. The data in this case is almost linearly separable.
Non-linear SVM: use a mapping (usually non-linear) to transfer the data from the input space to a feature space.
Distance Measure
For a given data point $(x_i, y_i)$ and a classifying hyperplane defined by $(w, b)$, the functional margin between the point and the plane is
\[\hat D_i = y_i(wx_i + b)\]When $(x_i, y_i)$ is correctly classified, the (signed) geometric margin (signed euclidean distance) is
\[\begin{aligned} D_i &= \frac{\hat D_i}{||w||} \\ & = y_i(\frac{w}{||w||}x_i + \frac{b}{||w||}) \end{aligned}\]Note that if there is misclassification, we have $D_i \le 0$.
Unlike functional margin, geometric margin does not change as you scale $w$ and $b$ up.
Problem Setup: Hard Margin
We want to obtain a classifier that has a maximum margin between two classes. We define the margin of a perfect classifier (i.e. one that separates two classes perfectly with no misclassification, so that $D_i \ge 0$ for all $i$) as
\[D_{w, b} = \min_{i=1,...,N} D_i\]Learning the SVM model can be viewed as the following optimization problem
\[\begin{aligned} & \max_{w, b} D_{w, b} \\ s.t. \ & y_i(\frac{w}{||w||}x_i + \frac{b}{||w||}) \ge D_{w,b} \\ \end{aligned}\]We plug in the functional distance of the model $\hat D_{w,b}$ and get an equivalent problem
\[\begin{aligned} & \max_{w, b} \frac{\hat D_{w, b}}{||w||} \\ s.t. \ & y_i(wx_i + b) \ge \hat D_{w,b} \\ \end{aligned}\]If we scale $w$ and $b$ carefully, we can make $\hat D_{w, b} = 1$ without changing our perfect classifier (think about the equivalence of two lines $x + y + 1 = 0$ and $2x + 2y + 2 = 0$). In other words, we make the functional distance between our perfect classifer and the points closest to the decision boundary to be 1. All other points should have a functional margin greater than or equal to 1.
Therefore, we get our final optimization problem as a Lagrangian optimization problem. Note the equivalence between minimizing $ \frac{1}{ ||w|| } $ and maximizing $ \frac{1}{2} ||w||^2 $.
\[\begin{aligned} & \min_{w, b} \frac{1}{2}||w||^2 \\ s.t. \ & y_i(wx_i + b) - 1\ge 0 \\ \end{aligned}\]We can then write it as its dual problem (read more about Lagrangian optimization problems):
\[\begin{aligned} & \max_\alpha \min_{w, b} L(w, b, a) \\ = & \max_\alpha \min_{w, b} \frac{1}{2}||w||^2 - \sum_{i=1}^{N}\alpha_i(y_i(wx_i + b) - 1) \\ = & \max_\alpha \min_{w, b} \frac{1}{2}||w||^2 - \sum_{i=1}^{N}\alpha_iy_i(wx_i + b) + \sum_{i=1}^{N}\alpha_i \end{aligned}\]We first deal with the inner minimization problem by solving the following
\[\begin{cases} \nabla_wL(w, b, a) = 0 \\ \nabla_bL(w, b, a) = 0 \end{cases}\]We get
\[\begin{aligned} w^* &= \sum_{i=1}^N \alpha_i y_i x_i \\ 0 &= \sum_{i=1}^N \alpha_i y_i \end{aligned}\]For a sample right on the decision boundary $(x_j, y_j)$, we have $y_i(w^* x_j + b) - 1 = 0$. We solve for $b$ and get
\[b^* = y_j - \sum_{i=1}^{N} \alpha_i y_i x_i x_j\]However, we don’t need to plug $b$ back. During the calculation, $b$ is eliminated by the fact that $\sum_{i=1}^N \alpha_i y_i = 0$
Then we plug in the solution to the outer maximization problem and get our final expression that we need to optimize:
\[\begin{aligned} \min_{\alpha} & \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j x_i x_j - \sum_{i=1}^N \alpha_i \\ s.t. & \ \sum_{i=1}^N \alpha_i y_i = 0 \\ \ & \alpha_i \ge 0, \ i = 1,2,..., N \\ \end{aligned}\]Note that we took the negative away so the problem turned into minimization.
We solve this problem, get our solution $ a^* $, calculate $ w^* $ and $ b^* $, and result in the final model
\[f(x) = \text{sign}(w^*x + b^*)\]Note that $b^* $ and $w^*$ are only determined by a few points lying on the boundary, for which $\alpha$ values are not zero.
Problem Set Up: Soft Margin
Previously, we require that for any $i = 1,2,…,N$, the following holds
\[y_i(w x_i + b) \ge 1\]When the data set is not linearly separable, this will become impossible to achieve. Thus, we introduce a slack variable $\xi$ to allow some misclassification. Our new requirement becomes
\[y_i(w x_i + b) \ge 1 - \xi_i\]and, with a new term as a punishment for misclassification, out target function becomes
\[\frac{1}{2}||w||^2 + C\sum_{i=1}^N \xi_i\]where $C$ is a parameter for punishment.
The optimization problem is now
\[\begin{align} \min_{w, b} & \ \frac{1}{2}||w||^2 + C \sum_{i=1}^N \xi_i \\ s.t. & \ y_i(wx_i + b) \ge 1 - \xi_i \\ \ & \xi_i \ge 0, \ i = 1, 2,..., N \\ \end{align}\]Applying the same methodology as the previous section, we get our final optimization problem
\[\begin{align} \min_{\alpha} & \ \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j x_i x_j - \sum_{i=1}^N \alpha_i \\ s.t. & \ \sum_{i=1}^N \alpha_i y_i = 0 \\ & \ 0 \le \alpha_i \le C, \ i = 1,2,..., N \\ \end{align}\]Non-linear SVMs
Since when making decisions, we never use an input alone like $w x_i$. Instead, we calculate $w (x_i \cdot x_j)$ We can take advantage of this by applying kernel trick.
We map a data point $x_i$ from input space to a feature space through
\[z = \phi(x)\]We define the kernel function K as
\[K(x, z) = \phi(x) \phi(z)\]where $x$ and $z$ are two elements in the input space.
With the kernel function, the decision function can be written as
\[f(x) = \text{sign}(\sum_{i=1}^{N} \alpha^* y_i K(x, x_i) + b^*)\]And our optimization problem becomes the following
\[\begin{aligned} \min_{\alpha} & \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j K(x_i, y_j) - \sum_{i=1}^N \alpha_i \\ s.t. & \ \sum_{i=1}^N \alpha_i y_i = 0 \\ \ & 0 \le \alpha_i \le C, \ i = 1,2,..., N \\ \end{aligned}\]Common Kernal Functions
Polynomial Kernel Function
\[K(x, z) = (x \cdot z + 1)^p\]Gaussian Kernel Function
\[K(x, z) = \exp(-\frac{||x - z||^2}{2 \delta ^ 2})\]SMO Algorithm (TODO)
The algorithm for solving the optimization problem.