Kernel Methods

Matthieu R Bloch

Why kernel methods?

  • Linear least square regression is well understood
    • Analysis goes back to Legendre and Gauss
    • Many results results available from statistical literature
    • Challenge: real-data is more complicated
  • Kernel methods
    • Methods based on pairwise comparisons (with some constraints)
      • Given \(x_i,x_j\in\calX\), measure distance with \(K:\calX\times\calX\to\bbR\)
      • Some choices of \(K\) are particularly useful
    • No assumptions on type of data (vectors, strings, images, graphs)
    • Versatile algorithms for supervises and unsupervised learning
    • Leverages linear methods after non-linear transformation of features
    • Appealing from a computational point of view
      • Scales “well” with dimension

More technical justification

  • \(K:\calX\times\calX\to\bbR\) is an inner product kernel if and only if \(K\) is a positive definite kernel.

  • Useful to implicitly define a transformation of feature vectors \(\bfx\in\bbR^d\) to \(\Phi(\bfx)\in\calH\)

  • Particularly useful when a machine learning task can be kernelized (SVM, etc.)

  • Objectives of module
    • What can we say about \(\calH\) more generally?
    • Can we develop a principled approach to kernelization?

Inner product and norm

  • An inner product space over \(\bbR\) is a vector space \(\calV\) equipped with a positive definite symmetric bilinear form \(\dotp{\cdot}{\cdot}:\calV\times\calV\to\bbR\) called an inner product

  • An inner product space is also called a pre-Hilbert space

  • An inner product satisfies \(\forall x,y\in\calV\) \(\dotp{x}{y}^2\leq\dotp{x}{x}\dotp{y}{y}\)

  • A norm on a vector space \(\calV\) over \(\bbR\) is a function \(\norm{\cdot}:\calV\to\bbR\) that satisfies:
    • Positive definiteness: \(\forall x\in\calV\) \(\norm{x}\geq 0\) with equality iff \(x=0\)
    • Homogeneity: \(\forall x\in\calV\) \(\forall\alpha\in\bbR\) \(\norm{\alpha x}=\abs{\alpha}\norm{x}\)
    • Subadditivity: \(\forall x,y\in\calV\) \(\norm{x+y}\leq \norm{x}+\norm{y}\)
  • \[\bfx\in\bbR^d\qquad\norm[0]{\bfx}\eqdef\card{\set{i:x_i\neq 0}}\quad\norm[1]{\bfx}\eqdef\sum_{i=1}^d\abs{x_i}\quad \norm[2]{\bfx}\eqdef\sqrt{\sum_{i=1}^d x_i^2}\]

Induced norm and orthogonality

  • In an inner product space, an inner product induces a norm \(\norm{x} \eqdef \sqrt{\dotp{x}{x}}\)

  • A norm \(\norm{\cdot}\) is induced by an inner product on \(\calV\) iff \(\forall x,y\in\calV\) \(\norm{x}^2+\norm{y}^2 = \frac{1}{2}\left(\norm{x+y}^2+\norm{x-y}^2\right)\)

    If this is the case, the inner product is given by the polarization identity \[\dotp{x}{y}=\frac{1}{2}\left(\norm{x}^2+\norm{y}^2-\norm{x-y}^2\right)\]

  • Two vectors \(x,y\in\calV\) are orthogonal if \(\dotp{x}{y}=0\). We write \(x\perp y\) for simplicity.

    A vector \(x\in\calV\) is orthogonal to a set \(\calS\subset\calV\) if \(\forall s\in\calS\) \(\dotp{x}{s}=0\). We write \(x\perp \calS\) for simplicity.

  • If \(x\perp y\) then \(\norm{x+y}^2=\norm{x}^2+\norm{y}^2\)

Hilbert space

  • A Cauchy sequence in \(\calV\) is a sequence \(\set{x_n}_{n\geq 0}\) if \(\forall\epsilon>0\) \(\exists N\in\bbN^*\) such that \(\forall n,m>N\) \(\norm{x_n-x_m}\leq \epsilon\)

  • A Hilbert space \(\calH\) is a complete inner product space, i.e., an inner product space in which every Cauchy sequence in \(\calH\) converges to a point in \(\calH\)

  • Intuition: a Hilbert space has no “missing point”

  • A finite dimensional inner product space is complete (many interesting Hilbert spaces are infinite dimensional!)

  • Let \(\calH\) be a Hilbert space and \(\calM\) a closed subspace of \(\calH\). For any \(x\in\calH\), \(\exists\) a unique \(m^*\in\calM\) for which \[ \forall m\in\calM\quad \norm{x-m^*}\leq\norm{x-m} \] Furthermore \(m^*\) is the unique vector \(m\in\calM\) such that \(x-m\perp \calM\).

    The vector \(m^*\) is called the orthogonal projection of \(x\) onto \(\calM\).

Dual space and Riesz representation

  • The (topological) dual of a Hilbert space \(\calH\) is the space of continuous linear functionals \(f:\calH\to\bbR\).

  • In a Hilbert space \(\calH\), all continuous linear functions are of the form \(\dotp{\cdot}{g}_\calH\) for some \(g\in\calH\).

Reproducible Kernel Hilbert Space

  • Let \(\calX\) be a set, \(\calH\subset \bbR^\calX\) be a Hilbert space with inner product \(\dotp{\cdot}{\cdot}_\calH\). The function \(K:\calX\times\calX\to\bbR\) is a reproducing kernel of \(\calH\) if the following holds:
    • \(\calH\) contains all functions of the form \[\forall x\in\calX \quad K_x:\calX\to\bbR:t\mapsto K(x,t)\]
    • \(\forall x\in\calX\) and \(\forall f\in\calH\), the reproducing property holds \[f(x) = \dotp{f}{K_x}_\calH\]

    If a reproducing kernel exists, \(\calH\) is called a reproducing kernel Hilbert space (RKHS).

  • An RKHS is a space of functions

  • Every \(x\in\calX\) can be mapped to a function in \(\calH\) with the kernel map \(\Phi:\calX\to\calH:x\mapsto K_x\)

  • The kernel and the reproducing property require an RKHS to be very “well behaved”

  • Compare to the Hilbert space \(L^2\) of square integrable functions

Properties of RKHS

  • A Hilbert space \(\calH\subset \bbR^\calX\) is an RKHS if and only if for all \(x\in\calX\) the linear functional \(\delta_x:\calH\to\bbR:f\mapsto f(x)\) is continuous.

  • Convergence in an RKHS implies pointwise convergence.

  • An RKHS \(\calH\) has a unique reproducing kernel \(K\). Conversely a reproducing kernel \(K:\calX\times\calX\to\bbR\) is associated to a unique RKHS.

  • We can talk equivalently about a reproducing kernel or an RKHS without ambiguity

  • For any function \(f\) in an RKHS \(\calH\), for any \(x,y\in\calX\) \[\abs{f(x)-f(y)}\leq \norm[\calH]{f}\norm[\calH]{K_x-K_y}\]

  • The norm \(\norm[\calH]{f}\) controls how fast the function varies w.r.t. the geometry defined by the kernel on \(\calX\).

Finding reproducing kernels

  • A function \(K:\calX\times\calX\to\bbR\) is a reproducing kernel if and only if it is a positive definite kernel.

  • This is essentially Aronszajn’s theorem!

  • Take \(\calX=\bbR^d\). The linear kernel is \(K(\bfx,\bfy)=\dotp{\bfx}{\bfy}_{\bbR^d}=\bfx^\intercal\bfy\).

    The polynomial kernel of degree 2 is \(K(\bfx,\bfy)=\dotp{\bfx}{\bfy}_{\bbR^d}^2=(\bfx^\intercal\bfy)^2\).

  • One can easily combine kernels to create new ones

  • Let \(K_1\) and \(K_2\) be positive definite kernels and let \(\alpha\geq 0\). Then, \(K_1+K_2\), \(K_1\times K_2\) and \(\alpha K_1\) are positive definite kernels.

    If \(\set{K_i}_{i\geq 1}\) is a sequence of positive definite kernels that converges pointwise to \(K\), then \(K\) is a positive definite kernel.

Representer theorem

  • Let \(\calX\) be a set endowed with a positive definite kernel \(K\). Let \(\calH\) be the associated RKHS.

    Let \(\calS\eqdef\set{x_i}_{i=1}^n\) be a finite set of points in \(\calX\).

    Let \(\Psi:\bbR^{n+1}\to\bbR\) be strictly increasing w.r.t the last variable.

    The solution of the problem \(\min_{f\in\calH}\Psi(f(x_1),\cdots,f(x_n),\norm[\calH]{f})\) is of the form \[ \forall x\in\calX\quad f(x) = \bfalpha^\intercal \bfK(x) = \sum_{i=1}^n\alpha_iK(x_i,x) \] where \(\bfalpha\eqdef\mat{ccc}{\alpha_1&\cdots&\alpha_n}^\intercal\) and \(\bfK(x)\eqdef\mat{ccc}{K(x_1,x)&\cdots&K(x_n,x)}^\intercal\).

  • The solution lives in a subspace of dimension \(n\)

  • Practical implication: Let \(\bfK\eqdef\mat{c}{K(x_i,x_j)}_{i,j\in\intseq{1}{n}}\) be the Gram matrix of \(\calS\). We can equivalently solve \[ \min_{\bfalpha} \Psi(\mat{c}{\bfK\bfalpha}_1,\cdots,\mat{c}{\bfK\bfalpha}_n,\bfalpha^\intercal\bfK\bfalpha) \] where \(\mat{c}{\bfK\bfalpha}_j\eqdef \bfK(\bfx_j)\). Can usually be solved analytically or by numerical methods

Benefits of kernel methods

  • Principle: map data from \(\calX\) to RKHS \(\calH\) defined by kernel \(K\) as \[ \varphi:\calX\to\calH:x\mapsto K_x\eqdef K(\cdot,x) \]

  • Embed data in vector space: there might not exist a nice structure in \(\calX\), but \(\calH\) is a Hilbert space
    • Geometrical operations are possible (projections, etc.)
    • Dimension could be infinite!
  • Lift data to higher dimension: even if \(\calX\) is already a Hilbert space (\(\calX=\bbR^d\)), lifting can help
    • Nicer properties: linear separability, clustering
    • linear forms \(\dotp{f}{K_x}_\calH=f(x)\) in \(\calH\) corresponds to non-linear model over \(\calX\)
  • Kernel trick: machine learning in \(\calH\) does not require explicit operation in \(\calH\)
    • Any algorithm to process finite-dimensional vectors that can be expressed only in terms of pairwise inner products can be applied to potentially infinite-dimensional vectors in the feature space of a p.d. kernel by replacing each inner product evaluation by a kernel evaluation. \[ \bfx_i^\intercal\bfx_j\eqdef \dotp{\bfx_i}{\bfx_j}_{\bbR^d }\leftrightarrow K(x_i,x_j) \eqdef \dotp{K_{x_i}}{K_{x_j}}_{\calH} \]
    • Elements in the feature space implicitly manipulated through pairwise inner products.

Support Vector Machine revisited

Kernel Ridge Regression revisited

Kernel logistic regression

Kernel PCA

Kernel k-means