Support Vector Machines

Matthieu R Bloch

Optimal soft-margin hyperplane revisited

The optimal soft-margin hyperplane is the solution of the following \[\argmin_{\bfw,b,\boldsymbol{\xi}}\frac{1}{2}\norm[2]{\bfw}^2+\frac{C}{N}\sum_{i=1}^N\xi_i\quad\text{ s.t. }\quad\forall i\in\intseq{1}{N}\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1-\xi_i\text{ and }\xi_i\geq 0\]
- Problem is differentiable, convex, duality gap is zero: KKT conditions are necessary and sufficient,
- We will kernelize the dual problem
The Lagrangian is \[\calL(\bfw,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})\eqdef \frac{1}{2}\bfw^\intercal\bfw + \frac{C}{N}\sum_{i=1}^N\xi_i+\sum_{i=1}^N\lambda_i (1-\xi_i-y_i(\bfw^\intercal \bfx_i+b))-\sum_{i=1}^N\mu_i\xi_i\] with \(\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}\geq \boldsymbol{0}\).
The Lagrange dual function is \[\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})=\min_{\bfw,b,\boldsymbol{\xi}}\calL(\bfw,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})\]
The dual problem is \(\max_{\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}\geq \boldsymbol{0}}\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})\)

Let’s simplify \(\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})\) using the KKT conditions
The dual function is \[\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})=-\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\lambda_i\lambda_j y_i y_j \bfx_i^\intercal\bfx_j + \sum_{i=1}^N\lambda_i\]
The dual optimization problem function is \[\max_{\boldsymbol{\lambda},\boldsymbol{\mu}} -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\lambda_i\lambda_j y_i y_j \bfx_i^\intercal\bfx_j + \sum_{i=1}^N\lambda_i\text{ s .t. }\begin{cases}\forall i\in\intseq{1}{N}\quad \sum_{i=1}^N\lambda_iy_i=0\\\forall i\in\intseq{1}{N}\quad 0\leq\lambda_i\leq\frac{C}{N}\end{cases}\]
We can very efficiently solve for \(\boldsymbol{\lambda}^*\)

Assume that we now know \((\boldsymbol{\lambda}^*,\boldsymbol{\mu}^*)\), how do we find \((\boldsymbol{\bfw}^*,b^*)\)?
\[\bfw^* = \sum_{i=1}^N\lambda_i^*y_i\bfx_i\quad\text{and}\quad b^* = y_i-{\bfw^*}^\intercal\bfx_i\] for some \(i\in\intseq{1}{N}\) such that \(0<\lambda_i^*<\frac{C}{N}\)
The only data points that matter are those for which \(\lambda_i^*\neq 0\)
By completementary slackness they are the ones for which \(y_i({{\bfw^*}^\intercal} \bfx_i+b)=1-\xi_i^*\)
- These points are called support vectors
- Points are on or inside the margin
- In practice, the number of support vectors is often \(\ll N\)

Given an inner product kernel \(k(\cdot, \cdot)\), the SVM classifier is \[h^{\mathrm{SVM}}(\bfx)\eqdef \sgn{\sum_{i\in\intseq{1}{N}}\lambda_i^*y_i k(\bfx_i,\bfx)+b^*}\] where \(\boldsymbol{\lambda}^*\) is the solution of \[\max_{\boldsymbol{\lambda},\boldsymbol{\mu}} -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\lambda_i\lambda_j y_i y_j k(\bfx_i,\bfx_j) + \sum_{i=1}^N\lambda_i\text{ s .t. }\begin{cases}\forall i\in\intseq{1}{N}\quad \sum_{i=1}^N\lambda_iy_i=0\\\forall i\in\intseq{1}{N}\quad 0\leq\lambda_i\leq\frac{C}{N}\end{cases}\] and \[b^*=y_i-\sum_{j=1}^N\lambda_j^* y_j k(\bfx_i,\bfx_j)\] for some \(i\in\intseq{1}{N}\) such that \(0<\lambda_i^*<\frac{C}{N}\)
We never need to specify explicitly \(\Phi\) and \(\calH\)
Other algorithms can be kernelized