机器学习笔记(5)支持向量机

Introduction

Here is a not so short note (Rolling Update) for machine learning and artificial intelligience. Main concepts and some references are presented. This note is based on Wikipedia, Andrew Ng’s Lecture Notes CS229 of Standford, Machine Learning course of SJTU IEEE-yangyang, and Machine Learning Crash Course of Google.

In this note, let’s discuss about the support vector machine (SVM). Reference is added: the course of CMU (http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Slides/

Linear classifier

A classification decision is made by the linear classifier based on the value of a linear combination of the characteristics (features), which is typically presented to the machine by a feature vector.

  • Input: x\vec{x}
  • Output: y=f(wx)=f(jwjxj)y = f(\vec{w} \cdot \vec{x}) = f(\sum_j w_j x_j)

ff is such a function, that maps the values above a threshold to the first class and other values to the second class, or give the probability. w\vec{w} is a vector of weights, which can be learned from a set of labeled samples. There are two classes of methods for determining the parameters of w\vec{w}:

  • Generative model: a model of the joint probability distribution on P(X,Y)P(X,Y)
    • Linear discriminant analysis (LDA)
    • Naive Bayes classifier
    • Hidden Markov model
  • discriminative model: a model of the conditional probability of YY with an observation xx as P(YX=x)P(Y \mid X=x)
    • Logistic regression: maximum likelihood estimation of w\vec{w}
    • Perceptron: an algorithm that attempts to fix all errors encountered in the training set
    • SVM: maximizes the margin between the decision hyperplane and the examples in the training set.
    • Neural networks

Generative algorithms try to learn P(X,Y)P(X,Y) which can be transformed into P(YX=x)P(Y \mid X=x) later to classify the data. Discriminative algorithms try to learn P(YX)P(Y \mid X) directly from the data and then classify data, does not care how the data was generated (P(X)P(X)). Here is the ralation:

P(YX)=P(X,Y)/P(X)=P(XY)P(Y)/P(X)P(Y \mid X) = P(X,Y) / P(X) = P(X \mid Y) P(Y) / P(X)

Generative algorithms are suitable for unsupervised learning. Discriminative models usually proceeds in a supervised way.

Maximum margin classifier

In the case of support vector machines, a data point is viewed as a pp-dimensional vector (a list of pp numbers). The points was supposed to separated by a p1p-1 dimensional hyperplane. Many hyperplanes exists. The best choice is the one that represents the largest margin between the two classes, which is known as the maximum-margin hyperplane. This classifier is also called as a maximum margin classifier.

The equation of a hyperplane can be written as:

ωTxb=0\omega^T x - b = 0

Note that ωTx=ωx\omega^T x = \vec{\omega}\cdot \vec{x}.

Two classes can be determined by:

  • f(x)=ωTxb>0f(x) = \omega^T x - b > 0 leads to y=1y = 1
  • f(x)=ωTxb<0f(x) = \omega^T x - b < 0 leads to y=1y = -1

The problem is how to find the optimum hyperplane with a maximum margin.

Functional margin

The functional margin is defined as:

γ^=yf(x)=y(ωTxb)\hat{\gamma} = yf(x) = y \cdot (\omega^T x - b)

The result would be positive for properly classified points and negative otherwise. Hence, a large functional margin represents a confident and a correct prediction. The functional margin has a problem. For example, the two equations ωTxb=0\omega^T x - b =0 and 2ωTx2b=02\omega^T x - 2b =0 represent the same hyper plane, but the functional margins differs.

Geometrical margin

The geometrical margin is defined as:

γ~=yγ=yωTxbω=γ^ω\tilde{\gamma} = y\gamma = y \cdot \frac{\omega^T x - b}{|| \omega ||} = \frac{\hat{\gamma}}{|| \omega ||}

where ω\omega is the normal vector of the hyper plane. Here ω|| \omega || is the Euclidean norm (2-norm), gives the ordinary distance from the origin to the point ω\omega.

The margin in maximum margin classifier refers to the geometrical margin. Samples on the margin are called the support vectors.

If the training data is linearly separable, two parallel hyperplanes can be selected to separate the two classes of data so that the distance between them (called margin) is as large as possible. After normalized or standardized, these hyperplanes are:

  • label 1: ωTxb=1\omega^T x - b = 1
  • label -1: ωTxb=1\omega^T x - b = -1

The geometrical margin between these two hyperplanes is 2/ω2/|| \omega ||. The optimization problem is to minimize 12ω2\frac{1}{2}|| \omega ||^2 s.t. (subject to) yi(ωxib)1y_i(\vec{\omega}\cdot \vec{x_i} - b)\geq 1 for each ii in data. This optimization is called the constrained optimization with the ‘s.t.’. It can be solved efficiently by quadratic programming (QPQP). The max-margin hyperplane is completely determined by those xi\vec{x_i} which lie nearest to the hyperplane. These xi\vec{x_i} are called support vectors. Moving other points a little does not affect the decision boundary. To predict the labels of new points, only the support vectors are necessary. We use 12ω2\frac{1}{2}|| \omega ||^2 for matematical purpose when derivating the Lagrangian function to optimize it.

Soft-margin for not linearly separable data

It could be very complicated or over-fitting to classify all samples correctly in practical. For the soft-margin SVM, errors in classification are allowed, which is more common. The criteria now is to minimize the number and the value of noises. E.g. A large margin causes lots of misclassified samples. A smaller margin can reduce the number of errors, but each error becomes bigger.

The optimization problem is to minimize 12ω2+Ciξi\frac{1}{2}|| \omega ||^2 + C \sum\limits_{i} \xi_i s.t. yi(ωxib)1ξiy_i(\vec{\omega}\cdot \vec{x_i} - b)\geq 1-\xi_i and ξi0\xi_i \geq 0 for each ii in data. ξ\xi is pronounced as /ksi/, called “slack” variables, pay linear penalty.

  • ξi=0\xi_i = 0 means the classification of sample ii is correct. The larger xix_i is, the bigger mistake the classification makes.
  • 0<ξi10 < \xi_i \leq 1 means sample ii is in the margin, resulting to margin violation.
  • ξi>1\xi_i > 1 stands for sample ii is misclassified.

CC is the tradeoff parameter, chosen by cross-validation. A large CC means we will get a small margin width and reduce the number of errors as much as possible. Contrarily, a small CC means the margin width can be large even though we get more errors. The product CξiC \xi_i is the penalty for misclassifying. If CC \to \infty, the soft-margin is recovered to the hard-margin SVM.

ξi\xi_i is defined as:

ξi=loss(sgn(ωxib),yi)=1yi(ωxib)\xi_i = loss(sgn(\vec{\omega}\cdot \vec{x_i} - b),y_i) = 1-y_i(\vec{\omega}\cdot \vec{x_i} - b)

With the regularization and introducing the Hinge loss:

max(0,1yi(ωxib))max(0, 1-y_i(\vec{\omega}\cdot \vec{x_i} - b))

the optimization problem is to minimize:

12ω2+λi=1mmax(0,1yi(ωxib))\frac{1}{2}|| \omega ||^2 + \lambda \sum_{i=1}^m max(0, 1-y_i(\vec{\omega}\cdot \vec{x_i} - b))

It belongs to the quadratic programming (QPQP) with a convex function f(x)f(x).

A convex function is the line segment between any two points on the graph lies above or on it, e.g., f(x)=x2f(x)=x^2. For a twice differentiable function of a single variable, if the second derivative is always greater than or equal to zero for its entire domain then the function is convex. The locally optimal point of a convex problem is globally optimal.

Lagrangian Duality

The standard form convex optimization problem can be expressed as: minimize f(x)=12ω2f(x)=\frac{1}{2}|| \omega ||^2, subject to gi(x)0,i=1,...,kg_i(x) \leq 0, i=1,...,k and hi(ω)=0,i=1,...,lh_i(\omega) = 0, i=1,...,l. The generalized Lagrange function is introduced as:

L(ω,α,β)=f(ω)+i=1kαigi(ω)+i=1lβihi(ω)\mathscr{L}(\omega, \alpha, \beta) = f(\omega) + \sum_{i=1}^k \alpha_i g_i(\omega) + \sum_{i=1}^l \beta_i h_i(\omega)

where αi\alpha_i and βi\beta_i are the Lagrangian multipliers. αi>0\alpha_i>0 must be satisfied.

There is a lemma: if ω\omega satisfies the primal constraint ss,

maxα,β;αi0L(ω,α,β)=f(ω)max_{\alpha,\beta; \alpha_i \geq 0} \mathscr{L}(\omega, \alpha, \beta) = f(\omega)

Otherwise, the maximum of L\mathscr{L} can be ++\infty, which is related to the values of αi,βi\alpha_i,\beta_i.

The primal optimization problem minωf(x)min_\omega f(x) is equivalent to:

p=minω maxα,β;αi0L(ω,α,β)p* = min_\omega \ max_{\alpha,\beta; \alpha_i \geq 0} \mathscr{L}(\omega, \alpha, \beta)

The dual problem is:

d=maxα,β;αi0 minωL(ω,α,β)d^* = max_{\alpha,\beta; \alpha_i \geq 0} \ min_\omega \mathscr{L}(\omega, \alpha, \beta)

If both of the optimization solution of the primal problem dd^* and that of the dual problem pp^* exsit, we have the theorem of weak duality:

dpd^* \leq p^*

Proof:

minωL(ω,α,β)L(ω,α,β)maxα,β;αi0L(ω,α,β)min_\omega \mathscr{L}(\omega, \alpha, \beta) \leq \mathscr{L}(\omega, \alpha, \beta) \leq max_{\alpha,\beta; \alpha_i \geq 0} \mathscr{L}(\omega, \alpha, \beta)

maxα,β;αi0 minωL(ω,α,β)minω maxα,β;αi0L(ω,α,β)max_{\alpha,\beta; \alpha_i \geq 0} \ min_\omega \mathscr{L}(\omega, \alpha, \beta) \leq min_\omega \ max_{\alpha,\beta; \alpha_i \geq 0} \mathscr{L}(\omega, \alpha, \beta)

If the primal optimization problem is solved through the dual optimization problem, we need make sure that $ d^* = p^* $, which is called the strong duality. If there exist a saddle point of L(ω,α,β)\mathscr{L^*}(\omega^*, \alpha^*, \beta^*),

d=p=L(ω,α,β)d^* = p^* = \mathscr{L^*}(\omega^*, \alpha^*, \beta^*)

The saddle point satisfies the Karush-Kuhn-Tucker (KKT) conditions:

ωL(ω,α,β)=0\frac{\partial}{\partial \omega} \mathscr{L}(\omega, \alpha, \beta)=0

αiL(ω,α,β)=0, i=1,...,k\frac{\partial}{\partial \alpha_i} \mathscr{L}(\omega, \alpha, \beta)=0,\ i=1,...,k

βiL(ω,α,β)=0, i=1,...,l\frac{\partial}{\partial \beta_i} \mathscr{L}(\omega, \alpha, \beta)=0,\ i=1,...,l

αigi(ω)=0, i=1,...,k\alpha_i g_i(\omega) = 0,\ i=1,...,k

gi(ω)0, i=1,...,kg_i(\omega) \leq 0,\ i=1,...,k

αi0, i=1,...,k\alpha_i \geq 0,\ i=1,...,k

hj(ω)=0, j=1,...,lh_j(\omega) = 0,\ j=1,...,l

Note that, gi(ω)=0g_i(\omega) = 0 can be derived if αi>0\alpha_i > 0 is satisfied. The theorem is described as: If ω,αandβ\omega*, \alpha^* and \beta^* satisfy the KKT condition, it is also a solution to the primal and the dual problems.

Recall the optimization problem in the section Geometrical margin, as:

minω,b12ω2min_{\omega,b} \frac{1}{2}|| \omega ||^2

s.t. 1yi(ωxib)0,i=1,...,m1 - y_i(\vec{\omega}\cdot \vec{x_i} - b) \leq 0, i=1,...,m, note that there are mm conditons.

The Lagrangian function is:

L(ω,b,α)=12ωTωi=1mαi[yi(ωxib)1]\mathscr{L}(\omega,b, \alpha) = \frac{1}{2} \vec{\omega}^T \vec{\omega} - \sum_{i=1}^m \alpha_i [y_i(\vec{\omega}\cdot \vec{x_i} - b) - 1]

Thus, the primal optimization problem is equivalent to:

p=minω,b maxα,β;αi0L(ω,b,α)p* = min_{\omega,b} \ max_{\alpha,\beta; \alpha_i \geq 0} \mathscr{L}(\omega,b, \alpha)

The corresponding dual problem is:

d=maxα,β;αi0 minω,bL(ω,b,α)d* = max_{\alpha,\beta; \alpha_i \geq 0}\ min_{\omega,b} \mathscr{L}(\omega,b, \alpha)

The minω,bL(ω,b,α)min_{\omega,b} \mathscr{L}(\omega,b, \alpha) can be solved by letting ωL=0\nabla_\omega \mathscr{L} =0 and bL=0\nabla_b \mathscr{L} =0.

ωL=ωi=1mαiyixi=0\nabla_\omega \mathscr{L} = \vec{\omega} - \sum_{i=1}^m \alpha_i y_i \vec{x_i} = 0

bL=i=1mαiyi=0\nabla_b \mathscr{L} = \sum_{i=1}^m \alpha_i y_i = 0

Thus,

ω=i=1mαiyixi\vec{\omega} = \sum_{i=1}^m \alpha_i y_i \vec{x_i}

ωTω=i,j=1mαiαjyiyj(xiTxj)\vec{\omega}^T \vec{\omega} = \sum_{i,j=1}^m \alpha_i \alpha_j y_i y_j (\vec{x_i}^T \vec{x_j})

i=1mαi[yi(ωxib)]=i=1mαiyiωxi0=i=1mαiyiωTxi=i,j=1mαiαjyiyj(xiTxj)=ωTω\sum_{i=1}^m \alpha_i [y_i(\vec{\omega}\cdot \vec{x_i} - b)] = \sum_{i=1}^m \alpha_i y_i \vec{\omega}\cdot \vec{x_i} - 0 = \sum_{i=1}^m \alpha_i y_i \vec{\omega}^T \vec{x_i} = \sum_{i,j=1}^m \alpha_i \alpha_j y_i y_j (\vec{x_i}^T \vec{x_j}) = \vec{\omega}^T \vec{\omega}

L(ω,b,α)=i=1mαi12i,j=1mαiαjyiyj(xiTxj)\mathscr{L}(\omega,b, \alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i,j=1}^m \alpha_i \alpha_j y_i y_j (\vec{x_i}^T \vec{x_j})

The training data points whose the weight αi\alpha_i are nonzero are called the support vectors. The optimal ω\vec{\omega} is a linear combination of a small number of data points. For a new data z\vec{z}, we can compute:

ζ=ωzb=i=1mαiyi(xiTz)b\zeta = \vec{\omega} \cdot \vec{z} - b = \sum_{i=1}^m \alpha_i y_i (\vec{x_i}^T \vec{z}) - b

where z\vec{z} is classified as class 1 if ζ>0\zeta>0 and as class 2 otherwise. It means we make decisions by comparing each new example with only the support vectors.

The kernel method

The SVM optimization problem is rewritten as finding the maximum of:

L(ω,b,α)=i=1mαi12i,j=1mαiαjyiyj(xiTxj)\mathscr{L}(\omega,b, \alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i,j=1}^m \alpha_i \alpha_j y_i y_j (\vec{x_i}^T \vec{x_j})

s.t. 0αiC,i=1,...,m0\leq \alpha_i \leq C, i=1,...,m and i=1mαiyi=0\sum\limits_{i=1}^m \alpha_i y_i=0. Note that there are m+1m+1 conditions. The data points only appear as inner product, not their explicit coordinates.

Typically in the original features, the data is not linearly-separable. We can replace x\vec{x} by ϕ(x)\phi(\vec{x}) for some nonlinear ϕ\phi, and the decision boundary is a nonlinear surface ωϕ(x)b=0\vec{\omega} \cdot \phi(\vec{x}) - b = 0. For example, let ϕ=xy\phi=xy, the features are expanded into three dimensional point as (x,y,xy) from two dimensional point (x,y).

Define the kernel function KK by:

K(xi,xj)=ϕ(xi)Tϕ(xj)K(\vec{x_i},\vec{x_j}) = \phi(\vec{x_i})^T \phi(\vec{x_j})

The expanded SVM optimization problem is written as finding the maximum of:

L(ω,b,α)=i=1mαi12i,j=1mαiαjyiyjϕ(xi)Tϕ(xj)\mathscr{L}(\omega,b, \alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i,j=1}^m \alpha_i \alpha_j y_i y_j \phi(\vec{x_i})^T \phi(\vec{x_j})

s.t. 0αiC,i=1,...,m0\leq \alpha_i \leq C, i=1,...,m and i=1mαiyi=0\sum\limits_{i=1}^m \alpha_i y_i=0. For a new data z\vec{z}, let’s see:

ζ=ωzb=i=1mαiyiϕ(xi)Tϕ(z)b\zeta = \vec{\omega} \cdot \vec{z} - b = \sum_{i=1}^m \alpha_i y_i \phi(\vec{x_i})^T \phi(\vec{z}) - b

where z\vec{z} is classified as class 1 if ζ>0\zeta>0 and as class 2 otherwise.

The kernel trick

The problem is, the feature expansion yield lots of features because it is high dimensional. The feature space is typically infinite-dimensional. For example, the polynomial of degree kk on nn original features yields O(nk)O(n^k) expanded features. Thus the feature expansion could be inefficient and sometimes overfitting.

By using the kernel function mapping, we can operate in high-dimensional and implicit feature space only computing the inner products \phi(\vec{x_i})^T \phi(\vec{x_j}). This operation is often computationally cheaper than the explicit computation of ϕ\phi. This approach is called the kernel trick, to make optimization efficient when there are lots of features.

  • Linear kernel: K(x,x)=xTxK(\vec{x},\vec{x}') = \vec{x}^T \vec{x}'
  • Polynomial kernel: K(x,x)=(1+xTx)pK(\vec{x},\vec{x}') = (1 + \vec{x}^T \vec{x}')^p
  • Radial basis kernel (RBK): K(x,x)=exp(12xx2)K(\vec{x},\vec{x}') = exp(-\frac{1}{2} ||\vec{x} - \vec{x}'||^2)

The kernel matrix

If we have the feature mapping ϕ\phi for x1,x2,...,xmx_1,x_2,...,x_m, the kernel KK is a x×xx\times x matrix as:

Ki,j=ϕ(xi)Tϕ(xj){K_{i,j}} = \phi(\vec{x_i})^T \phi(\vec{x_j})

where KK is a symmetric positive-semidefinite matrix (or non-negative definite: all of whose eigenvalues are nonnegative).

Ki,j=Kj,iK_{i,j}=K_{j,i}

The necessary and sufficient condition of kernel, Mercer’s theorem: KK is a continuous symmetric non-negative definite kernel. The eigenvalues λi\lambda_i is nonnegative. This KK is also called the Mercer kernel.

yTKy0,yy^T K y \leq 0, \forall y

Sequential minimal optimization (SMO) algorithm

Coordinate descent / ascend

Coordinate descent is used to deal with the unconstrained optimization problems. To find the minimum of a multivariable function, coordinate descent (ascend) is an optimization algorithm that successively minimizes (maximizes) along one coordinate directions while fixing all other coordinates. It may get stuck for a non-smooth multivariable function.

SMO

The expanded SVM optimization problem above is a constrained one.

Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. Sequential minimal optimization breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. (John C. Platt, 1998)

The constraint involving the Lagrange multipliers αi\alpha_i, the smallest possible problem involves two such multipliers, α1\alpha_1 and α2\alpha_2. The SMO algorithm is:

  • Select one pair αi\alpha_i and αj\alpha_j and optimize the pair (αi,αj\alpha_i, \alpha_j) while holding other multipliers fixed.
  • Repeat till convergence.
  • Solved when all the multipliers satisfy the KKT conditions. Heuristics are used to choose the pair of multipliers so as to accelerate the rate of convergence, since there are n(n1)2\frac{n(n-1)}{2} choices for (αi,αj\alpha_i, \alpha_j).

α1y1+α2y2=ξ\alpha_1 y_1 + \alpha_2 y_2 = \xi

0α1C, 0α2C0 \leq \alpha_1 \leq C,\ 0 \leq \alpha_2 \leq C

where ξ\xi is the negative of the sum over the rest of terms in the equality constraint, which is fixed in each iteration.

The SMO algorithm can be considered a special case of the Osuna algorithm, where the size of the optimization is two and both Lagrange multipliers are replaced at every step with new multipliers that are chosen via good heuristics.

Reward