Introduction to Kernel methods

Matthieu R Bloch

Non linear features

  • LDA, logistic, PLA, are all linear classifiers: classification region boundaries are hyperplanes
    • Some datasets are not linearly separable!
  • We can create nonlinear classifiers by transforming the data through a non linear map \(\Phi:\bbR^d\to\bbR^p\)
  • \[\Phi:\left[\begin{array}{c}x_1\\\vdots\\ x_d\end{array}\right]\to \left[\begin{array}{c}\phi_1(\bfx)\\\vdots\\\vdots\\ \phi_p(\bfx)\end{array}\right]\]

  • One can then apply linear methods on the transformed feature vector \(\Phi(\bfx)\)

  • Ring data

  • Challenges: if \(p\gg n\) this gets computationally challenging and there is a risk of overfitting!

Kernel methods - An observation

  • Consider the maximum margin hyperplane with non linear transform \(\Phi:\bbR^d\to\bbR^p\) \[\argmin_{\bfw,b}\frac{1}{2}\norm[2]{\bfw}^2\text{ s.t. }\forall i\quad y_i(\bfw^\intercal\Phi(\bfx_i)+b)\geq 1\]

  • One can show (later) that the optimal \(\bfw\) is of the form \(\bfw = \sum_{i=1}^N \alpha_i \Phi(\bfx_i)\)

  • \[\begin{aligned}\norm[2]{\bfw}^2 = \bfw^\intercal\bfw = \sum_{i=1}^N\sum_{j=1}^N \alpha_i\alpha_j\Phi(\bfx_i)^\intercal\Phi(\bfx_j)=\sum_{i=1}^N\sum_{j=1}^N \alpha_i\alpha_j\langle\Phi(\bfx_i),\Phi(\bfx_j)\rangle\end{aligned}\]

  • \[\begin{aligned}\bfw^\intercal \Phi(\bfx_j) = \sum_{i=1}^N \alpha_i \Phi(\bfx_i)^\intercal\Phi(\bfx_j) = \sum_{i=1}^N\alpha_i \langle\Phi(\bfx_i),\Phi(\bfx_j)\rangle\end{aligned}\]

  • The only quantities we really care about are the dot products \(\langle\Phi(\bfx_i),\Phi(\bfx_j)\rangle\)
    • There are only \(N^2\) of them
    • The dimension of \(\Phi(\bfx)\) does not appear explicitly (hidden in dot product), we only work in \(\bbR^d\)
    • The nonlinear features may not be computed explicitly in \(\langle\Phi(\bfx_i),\Phi(\bfx_j)\rangle\)

Kernel methods - the trick

  • Implicitly define features through the choice of a kernel

  • An inner product kernel is a mapping \(k:\bbR^d\times\bbR^d\to\bbR\) for which there exists a Hilbert space \(\calH\) and a mapping \(\Phi:\bbR^d\to\calH\) such that \[\forall \bfu,\bfv\in\bbR^d\quad k(\bfu,\bfv)=\langle\Phi(\bfu),\Phi(\bfv)\rangle_\calH\]

  • Quadratic kernel \(k(\bfu,\bfv)=(\bfu^\intercal \bfv)^2\)

  • A function \(k:\bbR^d\times\bbR^d\to\bbR\) is a positive semidefinite kernel if
    • \(k\) is symmetric, i.e., \(k(\bfu,\bfv)=k(\bfv,\bfu)\)
    • for all \(\{\bfx_i\}_{i=1}^N\), the Gram matrix \(\bfK\) is positive semidefinite, i.e., \[\bfx^\intercal\bfK\bfx\geq 0\text{ with }\bfK=[K_{i,j}]\text{ and }K_{i,j}\eqdef k(\bfx_i,\bfx_j)\]

Kernel methods - Aronszjan’s theorem

  • A function \(k:\bbR^d\times\bbR^d\to\bbR\) is an inner product kernel if and only if \(k\) is a positive semidefinite kernel.

  • Examples of kernels
    • Homogeneous polynomial kernel: \(k(\bfu,\bfv) = (\bfu^\intercal\bfv)^m\) with \(m\in\bbN^*\)
    • Inhomogenous polynomial kernel: \(k(\bfu,\bfv) = (\bfu^\intercal\bfv+c)^m\) with \(c>0\), \(m\in\bbN^*\)
    • Radial basis function (RBF) kernel: \(k(\bfu,\bfv) = \exp\left(-\frac{\norm{\bfu-\bfv}^2}{2\sigma^2}\right)\) with \(\sigma^2>0\)
  • Question: How do we kernelize maximum margin optimization problem and solve efficiently?

Constrained optimization

  • General form of constrained optimization problem \[\min_{\bfx\in\bbR^d} f(\bfx)\text{ such that }\begin{cases}g_i(\bfx)\leq 0\quad\forall i\in\intseq{1}{m}\\h_j(\bfx)=0\quad\forall j\in\intseq{1}{p}\end{cases}\]
    • \(f\) is called the objective function
    • \(g_i(\bfx)\) is called an inequality constraint
    • \(h_j(\bfx)\) is called an equality constraint
    • if \(\bfx\) satisfies all the constraints, we say that it is feasible.
  • A constrained optimization problem is convex if \(f\) is convex, the \(g_i\)’s are convex, and the \(h_j\)’s are affine

  • Turn constrained optimization into unconstrained one using Lagrangian \[\calL(\bfx,\boldsymbol{\lambda},\boldsymbol{\mu})\eqdef f(\bfx)+\sum_{i=1}^m\lambda_i g_i(\bfx)+\sum_{j=1}^p\mu_jh_j(\bfx)\text{ with }\boldsymbol{\lambda}\geq \boldsymbol{0}\]
    • \(\boldsymbol{\lambda}\eqdef [\lambda_1,\cdots,\lambda_m]^\intercal\) and \(\boldsymbol{\mu}\eqdef [\mu_1,\cdots,\mu_p]^\intercal\) are dual variables or Lagrange multipliers

Primal and Dual problems

  • The Lagrange dual function is \[\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})=\min_{\bfx}\calL(\bfx,\boldsymbol{\lambda},\boldsymbol{\mu})\qquad \textsf{always concave}\]

  • The dual optimization problem is \[\max_{\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}} \calL(\boldsymbol{\lambda},\boldsymbol{\mu}) = \max_{\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}}\min_{\bfx}\calL(\bfx,\boldsymbol{\lambda},\boldsymbol{\mu})\].

  • The primal function is \[\calL_P(\bfx)\eqdef \max_{\boldsymbol{\lambda}\geq\boldsymbol{0},\boldsymbol{\mu}}\calL(\bfx,\boldsymbol{\lambda},\boldsymbol{\mu})\]

  • The primal optimization problem is \[\min_{\bfx}\calL_P(\bfx) = \min_{\bfx}\max_{\boldsymbol{\lambda}\geq\boldsymbol{0},\boldsymbol{\mu}}\calL(\bfx,\boldsymbol{\lambda},\boldsymbol{\mu})\]