Support Vector Machines

Matthieu R Bloch

Optimal soft-margin hyperplane revisited

  • The optimal soft-margin hyperplane is the solution of the following \[\argmin_{\bfw,b,\boldsymbol{\xi}}\frac{1}{2}\norm[2]{\bfw}^2+\frac{C}{N}\sum_{i=1}^N\xi_i\quad\text{ s.t. }\quad\forall i\in\intseq{1}{N}\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1-\xi_i\text{ and }\xi_i\geq 0\]
    • Problem is differentiable, convex, duality gap is zero: KKT conditions are necessary and sufficient,
    • We will kernelize the dual problem
  • The Lagrangian is \[\calL(\bfw,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})\eqdef \frac{1}{2}\bfw^\intercal\bfw + \frac{C}{N}\sum_{i=1}^N\xi_i+\sum_{i=1}^N\lambda_i (1-\xi_i-y_i(\bfw^\intercal \bfx_i+b))-\sum_{i=1}^N\mu_i\xi_i\] with \(\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}\geq \boldsymbol{0}\).

  • The Lagrange dual function is \[\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})=\min_{\bfw,b,\boldsymbol{\xi}}\calL(\bfw,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})\]

  • The dual problem is \(\max_{\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}\geq \boldsymbol{0}}\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})\)

Optimal soft-margin hyperplane: kernelization

  • Let’s simplify \(\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})\) using the KKT conditions

  • The dual function is \[\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})=-\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\lambda_i\lambda_j y_i y_j \bfx_i^\intercal\bfx_j + \sum_{i=1}^N\lambda_i\]

  • The dual optimization problem function is \[\max_{\boldsymbol{\lambda},\boldsymbol{\mu}} -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\lambda_i\lambda_j y_i y_j \bfx_i^\intercal\bfx_j + \sum_{i=1}^N\lambda_i\text{ s .t. }\begin{cases}\forall i\in\intseq{1}{N}\quad \sum_{i=1}^N\lambda_iy_i=0\\\forall i\in\intseq{1}{N}\quad 0\leq\lambda_i\leq\frac{C}{N}\end{cases}\]

  • We can very efficiently solve for \(\boldsymbol{\lambda}^*\)

Optimal soft-margin hyperplane: primal solutions

  • Assume that we now know \((\boldsymbol{\lambda}^*,\boldsymbol{\mu}^*)\), how do we find \((\boldsymbol{\bfw}^*,b^*)\)?

  • \[\bfw^* = \sum_{i=1}^N\lambda_i^*y_i\bfx_i\quad\text{and}\quad b^* = y_i-{\bfw^*}^\intercal\bfx_i\] for some \(i\in\intseq{1}{N}\) such that \(0<\lambda_i^*<\frac{C}{N}\)

  • The only data points that matter are those for which \(\lambda_i^*\neq 0\)

  • By completementary slackness they are the ones for which \(y_i({{\bfw^*}^\intercal} \bfx_i+b)=1-\xi_i^*\)
    • These points are called support vectors
    • Points are on or inside the margin
    • In practice, the number of support vectors is often \(\ll N\)

Support Vector Machines

  • Given an inner product kernel \(k(\cdot, \cdot)\), the SVM classifier is \[h^{\mathrm{SVM}}(\bfx)\eqdef \sgn{\sum_{i\in\intseq{1}{N}}\lambda_i^*y_i k(\bfx_i,\bfx)+b^*}\] where \(\boldsymbol{\lambda}^*\) is the solution of \[\max_{\boldsymbol{\lambda},\boldsymbol{\mu}} -\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\lambda_i\lambda_j y_i y_j k(\bfx_i,\bfx_j) + \sum_{i=1}^N\lambda_i\text{ s .t. }\begin{cases}\forall i\in\intseq{1}{N}\quad \sum_{i=1}^N\lambda_iy_i=0\\\forall i\in\intseq{1}{N}\quad 0\leq\lambda_i\leq\frac{C}{N}\end{cases}\] and \[b^*=y_i-\sum_{j=1}^N\lambda_j^* y_j k(\bfx_i,\bfx_j)\] for some \(i\in\intseq{1}{N}\) such that \(0<\lambda_i^*<\frac{C}{N}\)

  • We never need to specify explicitly \(\Phi\) and \(\calH\)
  • Other algorithms can be kernelized