Matthieu R Bloch
The Lagrangian is \[\calL(\bfw,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})\eqdef \frac{1}{2}\bfw^\intercal\bfw + \frac{C}{N}\sum_{i=1}^N\xi_i+\sum_{i=1}^N\lambda_i (1-\xi_i-y_i(\bfw^\intercal \bfx_i+b))-\sum_{i=1}^N\mu_i\xi_i\] with \(\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}\geq \boldsymbol{0}\).
The Lagrange dual function is \[\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})=\min_{\bfw,b,\boldsymbol{\xi}}\calL(\bfw,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})\]
The dual problem is \(\max_{\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}\geq \boldsymbol{0}}\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})\)
Let’s simplify \(\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})\) using the KKT conditions
The dual function is \[\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})=-\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\lambda_i\lambda_j y_i y_j \bfx_i^\intercal\bfx_j + \sum_{i=1}^N\lambda_i\]
The dual optimization problem function is \[\max_{\boldsymbol{\lambda},\boldsymbol{\mu}} -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\lambda_i\lambda_j y_i y_j \bfx_i^\intercal\bfx_j + \sum_{i=1}^N\lambda_i\text{ s .t. }\begin{cases}\forall i\in\intseq{1}{N}\quad \sum_{i=1}^N\lambda_iy_i=0\\\forall i\in\intseq{1}{N}\quad 0\leq\lambda_i\leq\frac{C}{N}\end{cases}\]
We can very efficiently solve for \(\boldsymbol{\lambda}^*\)
Assume that we now know \((\boldsymbol{\lambda}^*,\boldsymbol{\mu}^*)\), how do we find \((\boldsymbol{\bfw}^*,b^*)\)?
\[\bfw^* = \sum_{i=1}^N\lambda_i^*y_i\bfx_i\quad\text{and}\quad b^* = y_i-{\bfw^*}^\intercal\bfx_i\] for some \(i\in\intseq{1}{N}\) such that \(0<\lambda_i^*<\frac{C}{N}\)
The only data points that matter are those for which \(\lambda_i^*\neq 0\)
Given an inner product kernel \(k(\cdot, \cdot)\), the SVM classifier is \[h^{\mathrm{SVM}}(\bfx)\eqdef \sgn{\sum_{i\in\intseq{1}{N}}\lambda_i^*y_i k(\bfx_i,\bfx)+b^*}\] where \(\boldsymbol{\lambda}^*\) is the solution of \[\max_{\boldsymbol{\lambda},\boldsymbol{\mu}} -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\lambda_i\lambda_j y_i y_j k(\bfx_i,\bfx_j) + \sum_{i=1}^N\lambda_i\text{ s .t. }\begin{cases}\forall i\in\intseq{1}{N}\quad \sum_{i=1}^N\lambda_iy_i=0\\\forall i\in\intseq{1}{N}\quad 0\leq\lambda_i\leq\frac{C}{N}\end{cases}\] and \[b^*=y_i-\sum_{j=1}^N\lambda_j^* y_j k(\bfx_i,\bfx_j)\] for some \(i\in\intseq{1}{N}\) such that \(0<\lambda_i^*<\frac{C}{N}\)
Other algorithms can be kernelized