Matthieu R Bloch
\[\Phi:\left[\begin{array}{c}x_1\\\vdots\\ x_d\end{array}\right]\to \left[\begin{array}{c}\phi_1(\bfx)\\\vdots\\\vdots\\ \phi_p(\bfx)\end{array}\right]\]
One can then apply linear methods on the transformed feature vector \(\Phi(\bfx)\)
Ring data
Challenges: if \(p\gg n\) this gets computationally challenging and there is a risk of overfitting!
Consider the maximum margin hyperplane with non linear transform \(\Phi:\bbR^d\to\bbR^p\) \[\argmin_{\bfw,b}\frac{1}{2}\norm[2]{\bfw}^2\text{ s.t. }\forall i\quad y_i(\bfw^\intercal\Phi(\bfx_i)+b)\geq 1\]
One can show (later) that the optimal \(\bfw\) is of the form \(\bfw = \sum_{i=1}^N \alpha_i \Phi(\bfx_i)\)
\[\begin{aligned}\norm[2]{\bfw}^2 = \bfw^\intercal\bfw = \sum_{i=1}^N\sum_{j=1}^N \alpha_i\alpha_j\Phi(\bfx_i)^\intercal\Phi(\bfx_j)=\sum_{i=1}^N\sum_{j=1}^N \alpha_i\alpha_j\langle\Phi(\bfx_i),\Phi(\bfx_j)\rangle\end{aligned}\]
\[\begin{aligned}\bfw^\intercal \Phi(\bfx_j) = \sum_{i=1}^N \alpha_i \Phi(\bfx_i)^\intercal\Phi(\bfx_j) = \sum_{i=1}^N\alpha_i \langle\Phi(\bfx_i),\Phi(\bfx_j)\rangle\end{aligned}\]
Implicitly define features through the choice of a kernel
An inner product kernel is a mapping \(k:\bbR^d\times\bbR^d\to\bbR\) for which there exists a Hilbert space \(\calH\) and a mapping \(\Phi:\bbR^d\to\calH\) such that \[\forall \bfu,\bfv\in\bbR^d\quad k(\bfu,\bfv)=\langle\Phi(\bfu),\Phi(\bfv)\rangle_\calH\]
Quadratic kernel \(k(\bfu,\bfv)=(\bfu^\intercal \bfv)^2\)
A function \(k:\bbR^d\times\bbR^d\to\bbR\) is an inner product kernel if and only if \(k\) is a positive semidefinite kernel.
A constrained optimization problem is convex if \(f\) is convex, the \(g_i\)’s are convex, and the \(h_j\)’s are affine
The Lagrange dual function is \[\calL_D(\boldsymbol{\lambda},\boldsymbol{\mu})=\min_{\bfx}\calL(\bfx,\boldsymbol{\lambda},\boldsymbol{\mu})\qquad \textsf{always concave}\]
The dual optimization problem is \[\max_{\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}} \calL(\boldsymbol{\lambda},\boldsymbol{\mu}) = \max_{\boldsymbol{\lambda}\geq \boldsymbol{0},\boldsymbol{\mu}}\min_{\bfx}\calL(\bfx,\boldsymbol{\lambda},\boldsymbol{\mu})\].
The primal function is \[\calL_P(\bfx)\eqdef \max_{\boldsymbol{\lambda}\geq\boldsymbol{0},\boldsymbol{\mu}}\calL(\bfx,\boldsymbol{\lambda},\boldsymbol{\mu})\]
The primal optimization problem is \[\min_{\bfx}\calL_P(\bfx) = \min_{\bfx}\max_{\boldsymbol{\lambda}\geq\boldsymbol{0},\boldsymbol{\mu}}\calL(\bfx,\boldsymbol{\lambda},\boldsymbol{\mu})\]