We want to find a hyper-plane w ⊤ x + b = 0 w^\top x + b = 0 w⊤x+b=0 that maximizes the margin.
We first show that the vector w w w is orthogonal to this hyper-plane. Let x 1 x_1 x1, x 2 x_2 x2 be any element on the hyper-plane. So we have w ⊤ x 1 + b = 0 w^\top x_1 + b = 0 w⊤x1+b=0 and w ⊤ x 2 + b = 0 w^\top x_2 + b = 0 w⊤x2+b=0. Then w ⊤ ( x 1 − x 2 ) = 0 w^\top (x_1 - x_2) = 0 w⊤(x1−x2)=0, which implies w w w is orthogonal to the hyper-plane. Now, we set two dashed lines to w ⊤ x + b = 1 w^\top x + b = 1 w⊤x+b=1 and w ⊤ x + b = − 1 w^\top x + b = -1 w⊤x+b=−1. In fact, “ 1 1 1” doesn’t matter and we can pick any value here. “ 1 1 1” is just the convention.
Now we pick any line parallel (orthogonal) to w w w (hyper-plane), then the line intersect two dashed line with point x ( + ) x^{(+)} x(+) and x ( − ) x^{(-)} x(−). We want to maximize the margin
m a r g i n = ∣ ∣ x ( + ) − x ( − ) ∣ ∣ margin = || x^{(+)} - x^{(-)}|| margin=∣∣x(+)−x(−)∣∣
Since w w w is parallel to x ( + ) − x ( − ) x^{(+)} - x^{(-)} x(+)−x(−), we have x ( + ) − x ( − ) x^{(+)} - x^{(-)} x(+)−x(−) = λ w \lambda w λw for some λ \lambda λ. Then
x ( + ) = λ w + x ( − ) x^{(+)} = \lambda w + x^{(-)} x(+)=λw+x(−)
Since w ⊤ x ( + ) + b = 1 w^\top x^{(+)} + b = 1 w⊤x(+)+b=1, we have w ⊤ ( λ w + x ( − ) ) + b = 1 w^\top (\lambda w + x^{(-)}) + b = 1 w⊤(λw+x(−))+b=1 and then λ w ⊤ w + w ⊤ x ( − ) + b = 1 \lambda w^\top w + w^\top x^{(-)} + b = 1 λw⊤w+w⊤x(−)+b=1. Since w ⊤ x ( − ) + b = − 1 w^\top x^{(-)} + b = -1 w⊤x(−)+b=−1, we have λ = w ⊤ w = 2 \lambda = w^\top w = 2 λ=w⊤w=2. So
λ = 2 w ⊤ w \lambda = \frac{2}{w^\top w} λ=w⊤w2
Now we can rewrite the margin as
m a r g i n = ∣ ∣ ( λ w + x ( − ) ) − x ( − ) ∣ ∣ = ∣ ∣ λ w ∣ ∣ = ∣ ∣ 2 w ⊤ w w ∣ ∣ = 2 ∣ ∣ w ∣ ∣ margin = ||(\lambda w + x^{(-)}) - x^{(-)}|| = ||\lambda w|| = ||\frac{2}{w^\top w} w || = \frac{2}{||w||} margin=∣∣(λw+x(−))−x(−)∣∣=∣∣λw∣∣=∣∣w⊤w2w∣∣=∣∣w∣∣2
we can construct the following objective function for SVM:
we can re-write it as
Note that the above constraints are hard constraints and it only works if the data are linearly separable.
Therefore, if data is not linearly separable, we want to make a soft constraint (relaxation). That is, we allow the model to make some error, but we will add some penalty for them.
Note that if λ → ∞ \lambda \to \infty λ→∞, then we allow no error; if λ = 0 \lambda = 0 λ=0, then we add no penalty. We call ϵ i \epsilon_i ϵi a slack variable. Ideally, we want ϵ i = 0 \epsilon_i = 0 ϵi=0; if it makes an error, ϵ i > 0 \epsilon_i > 0 ϵi>0.
Since ( w ⊤ x ( i ) + b ) y ( i ) ≥ 1 − ϵ i (w^\top x^{(i)} + b)y^{(i)} \geq 1 - \epsilon_i (w⊤x(i)+b)y(i)≥1−ϵi, we have ϵ i ≥ 1 − ( w ⊤ x ( i ) + b ) y ( i ) \epsilon_i \geq 1-(w^\top x^{(i)} + b)y^{(i)} ϵi≥1−(w⊤x(i)+b)y(i). If ϵ i ≤ 0 \epsilon_i \leq 0 ϵi≤0, we have no loss. Otherwise, we add ϵ i = 1 − ( w ⊤ x ( i ) + b ) y ( i ) \epsilon_i = 1-(w^\top x^{(i)} + b)y^{(i)} ϵi=1−(w⊤x(i)+b)y(i) as loss. So right now our new objective function is
Stochastic Gradient descent for Hinge Loss objective function:
We are going to map all points through a non-linear function and then used SVM in this transformed space. The idea is that if the non-linear map we use maps the two sets of points such that the two sets of points can be separated by a line after the transformation, then SVM can be used in this transformed space instead of the original space.
Let ϕ : X → F \phi: \mathcal{X} \to \mathcal{F} ϕ:X→F be the non linear map described in the earlier paragraph, where X \mathcal{X} X is the space from which inputs points are coming from and F \mathcal{F} F is the transformed space. For SVM to work, we don’t need to know ϕ \phi ϕ explicitly, but only need to know the dot product of the transformed points ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ ⟨\phi(x_i), \phi(x_j)⟩ ⟨ϕ(xi),ϕ(xj)⟩. So, instead of working with ϕ \phi ϕ, they can instead work with K : X × X → R K: \mathcal{X} \times \mathcal{X} \to \mathbb{R} K:X×X→R where K K K takes two points as input and returns a real value that represents ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ ⟨\phi(x_i), \phi(x_j)⟩ ⟨ϕ(xi),ϕ(xj)⟩. Note that ϕ \phi ϕ exists only when K K K is positive definite. With this, we are able to run SVM on an infinite dimensional space.
Gram Matrix: Given a set of vectors in V \mathcal{V} V, the Gram Matrix is the matrix of all possible inner products in V \mathcal{V} V. That is, G i j = v i ⋅ v j G_{ij} = v_i \cdot v_j Gij=vi⋅vj.
As the dimensionality increases, the classifier’s performance increases until the optimal number of features is reached. Further increasing the dimensionality without increasing the number of training samples results in a decrease in classifier performance. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, the time complexity of a classification algorithm is proportional to the dimension of the data point. So, higher dimension means larger time complexity (not to mention space complexity to store those large dimensional points).
Mercer’s Theorem (simplified idea): In a finite input space, if the Kernel matrix K \mathbf{K} K (also known as Gram matrix) is positive semi-definite ( K i j = K ( x i , x j ) = ϕ ( x i ) ⋅ ϕ ( x j ) \mathbf{K}_{ij} = K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j) Kij=K(xi,xj)=ϕ(xi)⋅ϕ(xj)), then the matrix element, i.e. the function K, can be a kernel function.
Let x i = ( x i 1 , x i 2 ) , x j = ( x j 1 , x j 2 ) x_i = (x_{i1}, x_{i2}), x_j = (x_{j1}, x_{j2}) xi=(xi1,xi2),xj=(xj1,xj2) and ϕ ( x i ) = ( x i 1 2 , 2 x i 1 x i 2 , x i 2 2 ) , ϕ ( x j ) = ( x j 1 2 , 2 x j 1 x j 2 , x j 2 2 ) \phi(x_i)=(x_{i1}^2, \sqrt{2}x_{i1}x_{i2}, x_{i2}^2), \phi(x_j)=(x_{j1}^2, \sqrt{2}x_{j1}x_{j2}, x_{j2}^2) ϕ(xi)=(xi12,2xi1xi2,xi22),ϕ(xj)=(xj12,2xj1xj2,xj22). Then
From this example, we see that even if the dimension of the feature space is higher than the input space, we can still do the computation in the low dimension input space as long as we choose a good kernel! Therefore, it’s possible to run SVM on an infinite dimensional feature space but do the same amount of computation as in the low dimension input space if we choose a good kernel.
Note:
k ( x i , x j ) = ( x i ⋅ x j + 1 ) d k(x_i, x_j) = (x_i \cdot x_j + 1)^d k(xi,xj)=(xi⋅xj+1)d
where d d d is the degree of the polynomial. This type of kernel represents the similarity of vectors in a feature space over polynomials of the original variables. It is popular in natural language processing.
k ( x , y ) = exp ( − ∥ x i − x j ∥ 2 2 σ 2 ) k(x, y) = \text{exp}\left(-\frac{\|x_i - x_j\|^2}{2\sigma^2}\right) k(x,y)=exp(−2σ2∥xi−xj∥2)
This type of kernel is useful when there is no prior knowledge about the data; it has good performance when there is the assumption og general smoothness of the data. It is an example of the radial basis function kernel (below). σ \sigma σ is the regularization variable that can be tuned specifically for each problem.
k ( x i , x j ) = exp ( − γ ∥ x i − x j ∥ 2 ) k(x_i, x_j) = \text{exp}(-\gamma \|x_i - x_j\|^2) k(xi,xj)=exp(−γ∥xi−xj∥2)
for γ > 0 \gamma > 0 γ>0. The difference between this kernel and the gaussian kernel is the amount of regularization applied.
k ( x , y ) = exp ( − ∥ x i − x j ∥ 2 σ 2 ) k(x, y) = \text{exp}\left(-\frac{\|x_i - x_j\|}{2\sigma^2}\right) k(x,y)=exp(−2σ2∥xi−xj∥)
Our primal problem is
Note that the primal problem of SVM is a convex problem and the constraints are convex. We know that for any convex optimization problem with differentiable objective and constraint functions, any points that satisfy the KKT conditions are primal and dual optimal, and have zero duality gap.
We can re-write the primal problem as
where ϕ ( x i ) \phi(x_i) ϕ(xi) is a map of x i x_i xi from input space to feature space. Now
Then we can use the 4 4 4-th KKT condition (gradient w.r.t. w , b , ϵ i w, b, \epsilon_i w,b,ϵi is 0 0 0):
Therefore we have
Our dual problem of SVM is
Note that the maximization only depends on the dot product of ϕ ( x i ) , ϕ ( x j ) \phi(x_i), \phi(x_j) ϕ(xi),ϕ(xj). We define a function
K ( x i , x j ) = ϕ ( x i ) ⋅ ϕ ( x j ) K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j) K(xi,xj)=ϕ(xi)⋅ϕ(xj)
All we need is the function K K K, a kernel function, which provides with the dot product of two vectors in another space and we don’t need to know the transformation into the other space.
We can re-write the problem as
Reference: