Kernel Methods

Matthieu R Bloch

Why kernel methods?

Linear least square regression is well understood
- Analysis goes back to Legendre and Gauss
- Many results results available from statistical literature
- Challenge: real-data is more complicated
Kernel methods
- Methods based on pairwise comparisons (with some constraints)
  - Given \(x_i,x_j\in\calX\), measure distance with \(K:\calX\times\calX\to\bbR\)
  - Some choices of \(K\) are particularly useful
- No assumptions on type of data (vectors, strings, images, graphs)
- Versatile algorithms for supervises and unsupervised learning
- Leverages linear methods after non-linear transformation of features
- Appealing from a computational point of view
  - Scales “well” with dimension

More technical justification

\(K:\calX\times\calX\to\bbR\) is an inner product kernel if and only if \(K\) is a positive definite kernel.
Useful to implicitly define a transformation of feature vectors \(\bfx\in\bbR^d\) to \(\Phi(\bfx)\in\calH\)
Particularly useful when a machine learning task can be kernelized (SVM, etc.)
Objectives of module
- What can we say about \(\calH\) more generally?
- Can we develop a principled approach to kernelization?

Inner product and norm

An inner product space over \(\bbR\) is a vector space \(\calV\) equipped with a positive definite symmetric bilinear form \(\dotp{\cdot}{\cdot}:\calV\times\calV\to\bbR\) called an inner product
An inner product space is also called a pre-Hilbert space
An inner product satisfies \(\forall x,y\in\calV\) \(\dotp{x}{y}^2\leq\dotp{x}{x}\dotp{y}{y}\)
A norm on a vector space \(\calV\) over \(\bbR\) is a function \(\norm{\cdot}:\calV\to\bbR\) that satisfies:
- Positive definiteness: \(\forall x\in\calV\) \(\norm{x}\geq 0\) with equality iff \(x=0\)
- Homogeneity: \(\forall x\in\calV\) \(\forall\alpha\in\bbR\) \(\norm{\alpha x}=\abs{\alpha}\norm{x}\)
- Subadditivity: \(\forall x,y\in\calV\) \(\norm{x+y}\leq \norm{x}+\norm{y}\)
\[\bfx\in\bbR^d\qquad\norm[0]{\bfx}\eqdef\card{\set{i:x_i\neq 0}}\quad\norm[1]{\bfx}\eqdef\sum_{i=1}^d\abs{x_i}\quad \norm[2]{\bfx}\eqdef\sqrt{\sum_{i=1}^d x_i^2}\]

Induced norm and orthogonality

In an inner product space, an inner product induces a norm \(\norm{x} \eqdef \sqrt{\dotp{x}{x}}\)
A norm \(\norm{\cdot}\) is induced by an inner product on \(\calV\) iff \(\forall x,y\in\calV\) \(\norm{x}^2+\norm{y}^2 = \frac{1}{2}\left(\norm{x+y}^2+\norm{x-y}^2\right)\)

If this is the case, the inner product is given by the polarization identity \[\dotp{x}{y}=\frac{1}{2}\left(\norm{x}^2+\norm{y}^2-\norm{x-y}^2\right)\]
Two vectors \(x,y\in\calV\) are orthogonal if \(\dotp{x}{y}=0\). We write \(x\perp y\) for simplicity.

A vector \(x\in\calV\) is orthogonal to a set \(\calS\subset\calV\) if \(\forall s\in\calS\) \(\dotp{x}{s}=0\). We write \(x\perp \calS\) for simplicity.
If \(x\perp y\) then \(\norm{x+y}^2=\norm{x}^2+\norm{y}^2\)

Hilbert space

A Cauchy sequence in \(\calV\) is a sequence \(\set{x_n}_{n\geq 0}\) if \(\forall\epsilon>0\) \(\exists N\in\bbN^*\) such that \(\forall n,m>N\) \(\norm{x_n-x_m}\leq \epsilon\)
A Hilbert space \(\calH\) is a complete inner product space, i.e., an inner product space in which every Cauchy sequence in \(\calH\) converges to a point in \(\calH\)
Intuition: a Hilbert space has no “missing point”
A finite dimensional inner product space is complete (many interesting Hilbert spaces are infinite dimensional!)
Let \(\calH\) be a Hilbert space and \(\calM\) a closed subspace of \(\calH\). For any \(x\in\calH\), \(\exists\) a unique \(m^*\in\calM\) for which \[ \forall m\in\calM\quad \norm{x-m^*}\leq\norm{x-m} \] Furthermore \(m^*\) is the unique vector \(m\in\calM\) such that \(x-m\perp \calM\).

The vector \(m^*\) is called the orthogonal projection of \(x\) onto \(\calM\).

Dual space and Riesz representation

The (topological) dual of a Hilbert space \(\calH\) is the space of continuous linear functionals \(f:\calH\to\bbR\).
In a Hilbert space \(\calH\), all continuous linear functions are of the form \(\dotp{\cdot}{g}_\calH\) for some \(g\in\calH\).

Reproducible Kernel Hilbert Space

Let \(\calX\) be a set, \(\calH\subset \bbR^\calX\) be a Hilbert space with inner product \(\dotp{\cdot}{\cdot}_\calH\). The function \(K:\calX\times\calX\to\bbR\) is a reproducing kernel of \(\calH\) if the following holds:
- \(\calH\) contains all functions of the form \[\forall x\in\calX \quad K_x:\calX\to\bbR:t\mapsto K(x,t)\]
- \(\forall x\in\calX\) and \(\forall f\in\calH\), the reproducing property holds \[f(x) = \dotp{f}{K_x}_\calH\]
If a reproducing kernel exists, \(\calH\) is called a reproducing kernel Hilbert space (RKHS).
An RKHS is a space of functions
Every \(x\in\calX\) can be mapped to a function in \(\calH\) with the kernel map \(\Phi:\calX\to\calH:x\mapsto K_x\)
The kernel and the reproducing property require an RKHS to be very “well behaved”
Compare to the Hilbert space \(L^2\) of square integrable functions

Properties of RKHS

A Hilbert space \(\calH\subset \bbR^\calX\) is an RKHS if and only if for all \(x\in\calX\) the linear functional \(\delta_x:\calH\to\bbR:f\mapsto f(x)\) is continuous.
Convergence in an RKHS implies pointwise convergence.
An RKHS \(\calH\) has a unique reproducing kernel \(K\). Conversely a reproducing kernel \(K:\calX\times\calX\to\bbR\) is associated to a unique RKHS.
We can talk equivalently about a reproducing kernel or an RKHS without ambiguity
For any function \(f\) in an RKHS \(\calH\), for any \(x,y\in\calX\) \[\abs{f(x)-f(y)}\leq \norm[\calH]{f}\norm[\calH]{K_x-K_y}\]
The norm \(\norm[\calH]{f}\) controls how fast the function varies w.r.t. the geometry defined by the kernel on \(\calX\).

Finding reproducing kernels

A function \(K:\calX\times\calX\to\bbR\) is a reproducing kernel if and only if it is a positive definite kernel.
This is essentially Aronszajn’s theorem!
Take \(\calX=\bbR^d\). The linear kernel is \(K(\bfx,\bfy)=\dotp{\bfx}{\bfy}_{\bbR^d}=\bfx^\intercal\bfy\).

The polynomial kernel of degree 2 is \(K(\bfx,\bfy)=\dotp{\bfx}{\bfy}_{\bbR^d}^2=(\bfx^\intercal\bfy)^2\).
One can easily combine kernels to create new ones
Let \(K_1\) and \(K_2\) be positive definite kernels and let \(\alpha\geq 0\). Then, \(K_1+K_2\), \(K_1\times K_2\) and \(\alpha K_1\) are positive definite kernels.

If \(\set{K_i}_{i\geq 1}\) is a sequence of positive definite kernels that converges pointwise to \(K\), then \(K\) is a positive definite kernel.

Representer theorem

Let \(\calX\) be a set endowed with a positive definite kernel \(K\). Let \(\calH\) be the associated RKHS.

Let \(\calS\eqdef\set{x_i}_{i=1}^n\) be a finite set of points in \(\calX\).

Let \(\Psi:\bbR^{n+1}\to\bbR\) be strictly increasing w.r.t the last variable.

The solution of the problem \(\min_{f\in\calH}\Psi(f(x_1),\cdots,f(x_n),\norm[\calH]{f})\) is of the form \[ \forall x\in\calX\quad f(x) = \bfalpha^\intercal \bfK(x) = \sum_{i=1}^n\alpha_iK(x_i,x) \] where \(\bfalpha\eqdef\mat{ccc}{\alpha_1&\cdots&\alpha_n}^\intercal\) and \(\bfK(x)\eqdef\mat{ccc}{K(x_1,x)&\cdots&K(x_n,x)}^\intercal\).
The solution lives in a subspace of dimension \(n\)
Practical implication: Let \(\bfK\eqdef\mat{c}{K(x_i,x_j)}_{i,j\in\intseq{1}{n}}\) be the Gram matrix of \(\calS\). We can equivalently solve \[ \min_{\bfalpha} \Psi(\mat{c}{\bfK\bfalpha}_1,\cdots,\mat{c}{\bfK\bfalpha}_n,\bfalpha^\intercal\bfK\bfalpha) \] where \(\mat{c}{\bfK\bfalpha}_j\eqdef \bfK(\bfx_j)\). Can usually be solved analytically or by numerical methods

Benefits of kernel methods

Principle: map data from \(\calX\) to RKHS \(\calH\) defined by kernel \(K\) as \[ \varphi:\calX\to\calH:x\mapsto K_x\eqdef K(\cdot,x) \]
Embed data in vector space: there might not exist a nice structure in \(\calX\), but \(\calH\) is a Hilbert space
- Geometrical operations are possible (projections, etc.)
- Dimension could be infinite!
Lift data to higher dimension: even if \(\calX\) is already a Hilbert space (\(\calX=\bbR^d\)), lifting can help
- Nicer properties: linear separability, clustering
- linear forms \(\dotp{f}{K_x}_\calH=f(x)\) in \(\calH\) corresponds to non-linear model over \(\calX\)
Kernel trick: machine learning in \(\calH\) does not require explicit operation in \(\calH\)
- Any algorithm to process finite-dimensional vectors that can be expressed only in terms of pairwise inner products can be applied to potentially infinite-dimensional vectors in the feature space of a p.d. kernel by replacing each inner product evaluation by a kernel evaluation. \[ \bfx_i^\intercal\bfx_j\eqdef \dotp{\bfx_i}{\bfx_j}_{\bbR^d }\leftrightarrow K(x_i,x_j) \eqdef \dotp{K_{x_i}}{K_{x_j}}_{\calH} \]
- Elements in the feature space implicitly manipulated through pairwise inner products.

Kernel Methods

Why kernel methods?

More technical justification

Inner product and norm

Induced norm and orthogonality

Hilbert space

Dual space and Riesz representation

Reproducible Kernel Hilbert Space

Properties of RKHS

Finding reproducing kernels

Representer theorem

Benefits of kernel methods

Support Vector Machine revisited

Kernel Ridge Regression revisited

Kernel logistic regression

Kernel PCA

Kernel k-means