Mathematical Foundations of Machine Learning

Prof. Matthieu Bloch

Monday September 23, 2024 (v1.1)

Last time

Image
  • Last class: Wednesday, September 18, 2024
    • We talked about regularized least square regression and the solution
  • Today: representer theorem and RKHS
  • To be effectively prepared for today's class, you should have:
    1. Come to class last week
    2. Gone over slides and read associated lecture notes
    3. Started Homework 3 (due Sunday September 29, 2024)
  • Logistics
    • Office hours
      • Jack Hill: Wednesday 11:30am-12:30pm in TSRB and hybrid
      • Anuvab Sen: Thursday 12pm-1pm in TSRB and hybrid
    • Gradescope: make sure you perform a quality check of your submission
  • Midterm: Wednesday October 9, 2024 3:30pm-4:45pm in usual classroom

Ridge regression

  • We can adapt the regularization approach to the situation of a finite dimension Hilbert space \(\calF\) \[ \min_{f\in\calF}\sum_{i=1}^n(y_i-f(\bfx_i))^2 + \lambda\norm[\calF]{f}^2 \]

    • We are penalizing the norm of the entire function \(f\)
  • Using a basis for the space \(\set{\psi_i}_{i=1}^d\) , and constructing \(\boldsymbol{\Psi}\) as earlier, we obtain \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\boldsymbol{\Psi}\bftheta}^2 + \lambda \bftheta^\intercal\matG\bftheta \] with \(\matG\) the Gram matrix for the basis.

  • Since \(\boldsymbol{\Psi}^\intercal \boldsymbol{\Psi}+\lambda\matG\) is invertible, we find the solution as \[ \bftheta^* = (\boldsymbol{\Psi}^\intercal \boldsymbol{\Psi}+\lambda\matG)^{-1}\boldsymbol{\Psi}^\intercal \bfy \] and we can reconstruct the function as \(f(\bfx) = \sum_{i=1}^d\theta_i^*\psi_{i}(\bfx)\).

  • If \(\bfG\) is well conditioned, the resulting function is not too sensitive to the choice of the basis

Least-Squares in infinite dimension Hilbert spaces

  • In \(\bbR^d\), the problem \(\min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\bfX\bftheta}^2 + \lambda\norm[2]{\bftheta}^2\) has a solution

    • \(\bftheta^* = \matX^\intercal\bfalpha\) with \(\bfalpha =(\bfX\bfX^\intercal+\lambda\bfI)^{-1}\bfy\)
    • \(\matX\matX^\intercal\in\bbR^{n\times n}\) is dimension independent!
    • We will be able to extend this to infinite dimensional Hilbert spaces!
  • Let \(\calF\) be a Hilbert space and let \(f\in\calF\) be the function we are trying to estimate

    • We will estimate \(f\in\calF\) using noisy observations \(\dotp{f}{x_i}\) with \(\set{x_i}_{i=1}^n\) elements of \(\calF\)

    • This is the equivalent of saying \(\bfy = \bfA\bfx+\bfn\) in finite dimension, where \(\bfn\) is additive noise

\[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-{\dotp{f}{x_i}}_{\calF}}^2+\lambda\norm[\calF]{f} \] has solution \[ f = \sum_{i=1}^n\alpha_i x_i\textsf{ with } \bfalpha = (\matK+\lambda\matI)^{-1}\vecy\qquad \matK=\mat{c}{\dotp{x_i}{x_j}}_{1\leq i,j\leq n} \]

The big picture

  • For a Hilbert space \(\calF\) and \(n\) pairs \((x_i,y_i)\in\calF\times \bbR\), we know how to solve the following problem: \[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-{\dotp{f}{x_i}}_{\calF}}^2+\lambda\norm[\calF]{f}^2 \]

  • We would really like to solve the following problem for \(n\) pairs \((\bfx_i,y_i)\in\bbR^d\times\bbR\) \[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-f(\bfx_i)}^2+\lambda\norm[\calF]{f}^2 \]

  • The question whether \(f(\bfx_i) = {\dotp{f}{x_i}}_{\calF}\) for some \(x_i\in\calF\) function of \(\bfx_i\): can this be done?

  • Reproducing Kernel Hilbert Spaces (RKHSs) are specific Hilbert spaces where this happens to be true
    • Hilbert space of functions where the sampling linear operation is a continuous linear functional
  • As usual, we're throwing definitions at our problem to make progress

Linear functionals on Hilbert spaces

  • In what follows, \(\calF\) is a Hilbert space with scalar field \(\bbR\)

A functional \(F:\calF\to\bbR\) associates a real number to an element of a Hilbert space \(\calF\)

  • Notation can be tricky when the Hilbert space is a space of functions: \(F\) can act on a function \(f\in\calF\)

A functional \(F:\calF\to\bbR\) is continuous at \(x\in\calF\) if \[ \forall \epsilon>0\exists\delta>0\textsf{ such that } \norm[\calF]{x-y}\leq \delta\Rightarrow \abs{F(x)-F(y)}\leq\epsilon \] If this is true for every \(x\in\calF\), \(F\) is continuous.

  1. All norms are continuous functionals
  2. For \(c\in\calF\) \(F_c:\calF\to\bbR:x\mapsto\dotp{x}{c}\) is continuous

Continuous linear functionals on Hilbert spaces

A functional \(F\) is linear if \(\forall a,b\in\bbR\) \(\forall x,y\in\calF\) \(F(ax+by) = aF(x)+bF(y)\).

  • Continuous linear functions are much more constrained than one would imagine

A linear functional \(F:\calF\to\bbR\) is bounded if there exists \(M>0\) such that \[ \forall x\in\calF\quad\abs{F(x)}\leq M\norm[\calF]{x} \]

A linear functional on a Hilbert space that is countinuous at \(0\) is bounded.

For a linear functional \(F:\calF\to\bbR\), the following statements are equivalent:

  1. \(F\) is continuous at 0
  2. \(F\) is continuous at some point \(x\in\calF\)
  3. \(F\) is continuous everywhere on \(\calF\)
  4. \(F\) is uniformly continuous everywhere on \(\calF\)

Representation of (continuous) linear functionals

Let \(F:\calF\to\bbR\) be a linear functional on an \(n\)-dimensional Hilbert space \(\calF\).

There exists \(c\in\calF\) such that \(F(x)=\dotp{x}{c}\) for every \(x\in\calF\)

  • Linear functional over finite dimensional Hilbert spaces are continuous!

  • This is not true in infinite dimension

Let \(F:\calF\to\bbR\) be a continuous linear functional on a (possible infinite dimensional) separable Hilbert space \(\calF\).

There exists \(c\in\calF\) such that \(F(x)=\dotp{x}{c}\) for every \(x\in\calF\)

If \(\set{\psi_n}_{n\geq 1}\) is an orthobasis for \(\calH\), then we can construct \(c\) above as \[ c\eqdef \sum_{n=1}^\infty F(\psi_n)\psi_n \]

Reproducing Kernel Hilbert Spaces

An RKHS is a Hilbert space \(\calH\) of real-valued functions \(f:\bbR^d\to\bbR\) in which the sampling operation \(\calS_\bftau:\calH\to\bbR:f\mapsto f(\bftau)\) is continuous for every \(\bftau\in\bbR^d\).

In other words, for each \(\bftau\in\bbR^d\), there exists \(k_\bftau\in\calH\) s.t. \[ f(\bftau) = {\dotp{f}{k_\bftau}}_\calH\text{ for all } f\in\calH \]

The kernel of an RKHS is \[ k:\bbR^d\times\bbR^d\to\bbR:(\bft,\bftau)\mapsto k_{\bftau}(\bft) \] where \(k_\bftau\) is the element of \(\calH\) that defines the sampling at \(\bftau\).

A (separable) Hilbert space with orthobasis \(\set{\psi_n}_{n\geq 1}\) is an RKHS iff \(\forall \bftau\in\bbR^d\) \(\sum_{n=1}^\infty\abs{\psi_{n}(\tau)}^2<\infty\)

RKHS and non orthogonal basis

  • If \(\set{\phi_n}_{n\geq 1}\) is a Riesz basis for \(\calH\), we know that every \(x\in\calH\) can be written \[ x = \sum_{n\geq 1}\alpha_n\phi_n\textsf{ with } \alpha_n\eqdef\dotp{x}{\smash{\widetilde{\phi}_n}} \] where \(\set{\widetilde{\phi}_n}_{n\geq 1}\) is the dual basis.

A (separable) Hilbert space with Riesz basis \(\set{\phi_n}_{n\geq 1}\) is an RKHS with kernel \[ k(\bft,\bftau) =\sum_{n=1}^\infty \phi_n(\bftau)\widetilde{\phi}_n(\bft) \] iff \(\forall \bftau\in\bbR^d\) \(\sum_{n=1}^\infty\abs{\phi_{n}(\tau)}^2<\infty\)

Kernel regression

  • Regression problem: given \(n\) pairs \((\bfx_i,y_i)\in\bbR^d\times\bbR\), solve \[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-f(\bfx_i)}^2+\lambda\norm[\calF]{f}^2 \]

  • If we restrict \(\calF\) to be an RKHS, the problem becomes \[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-{\dotp{f}{x_i}}_{\calF}}^2+\lambda\norm[\calF]{f}^2 \]

    where \(x_i\eqdef k_{\bfx_i}\) provides the mapping between \(\bbR^d\) and \(\calF\) \[ x_i:\bfR^d\to\bbR:\bft\mapsto k_{\bfx_i}(\bft) = k(\bfx_i,\bft) \]

  • The solution is given by \(\widehat{f} = \sum_{i=1}^n \widehat{\alpha}_i x_i\textsf{ with }\widehat{\bfalpha}\eqdef (\bfK+\lambda\bfI)^{-1}\bfy\) and \(\bfK\eqdef[K_{i,j}]_{1\leq i,j\leq n}\) with \(K_{i,j}=\dotp{x_i}{x_j}\)

Kernel regression

  • Kernel magic
    1. \(K_{ij} = \dotp{x_i}{x_j}=\dotp{k_{\bfx_i}}{k_{\bfx_j}} = k_{\bfx_i}(\bfx_j) = k(\bfx_i,\bfx_j)\)
    2. \(\widehat{f}(\bfx) = \dotp{\widehat{f}}{k_{\bfx}} = \sum_{i=1}^n\widehat{\alpha_i}k(\bfx_i,\bfx)\)
  • Remarks
    • We solved an infinite dimensional problem using an \(n\times n\) system of equations and linear algebra
    • Our solution and the evaluation only depend on the kernel; we never need to work directly in \(\calF\)
  • Question: can we skip \(\calF\) entirely? how do we find "good" kernels?

Aronszjan's theorem

An inner product kernel is a mapping \(k:\bbR^d\times\bbR^d\to\bbR\) for which there exists a Hilbert space \(\calH\) and a mapping \(\Phi:\bbR^d\to\calH\) such that \[\forall \bfu,\bfv\in\bbR^d\quad k(\bfu,\bfv)=\langle\Phi(\bfu),\Phi(\bfv)\rangle_\calH\]

A function \(k:\bbR^d\times\bbR^d\to\bbR\) is a positive semidefinite kernel if

  1. \(k\) is symmetric, i.e., \(k(\bfu,\bfv)=k(\bfv,\bfu)\)
  2. for all \(\{\bfx_i\}_{i=1}^N\), the Gram matrix \(\bfK\) is positive semidefinite, i.e., \[\bfx^\intercal\bfK\bfx\geq 0\text{ with }\bfK=[K_{i,j}]\text{ and }K_{i,j}\eqdef k(\bfx_i,\bfx_j)\]

A function \(k:\bbR^d\times\bbR^d\to\bbR\) is an inner product kernel if and only if \(k\) is a positive semidefinite kernel.

Examples

Regression using linear and quadratic functions in \(\bbR^d\)

Regression using Radial Basis Functions

  • Examples of kernels
    • Homogeneous polynomial kernel: \(k(\bfu,\bfv) = (\bfu^\intercal\bfv)^m\) with \(m\in\bbN^*\)
    • Inhomogenous polynomial kernel: \(k(\bfu,\bfv) = (\bfu^\intercal\bfv+c)^m\) with \(c>0\), \(m\in\bbN^*\)
    • Radial basis function (RBF) kernel: \(k(\bfu,\bfv) = \exp\left(-\frac{\norm{\bfu-\bfv}^2}{2\sigma^2}\right)\) with \(\sigma^2>0\)

Next time

  • Next class: Wednesday September 25, 2024
  • To be effectively prepared for next class, you should:
    1. Go over today's slides and read associated lecture notes here and there and there and there
    2. Work on Homework 3
  • Optional
    • Export slides for next lecture as PDF (be on the lookout for an announcement when they're ready)