Gaussian Processes

Matthieu Bloch

Thursday, November 17, 2022

Today in ECE 6555

  • Announcements
    • 5 lectures left (including today)
    • Class on Tuesday November 22, 2022 will be online
  • Last time
    • Gaussian processes - prediction
  • Today
    • More Gaussian Processes
  • Questions?

Last time: Schur complement

  • Suppose that \[ \matM = \left[\begin{array}{cc}\matM_{11}&\matM_{12}\\\matM_{21}&\matM_{22}\end{array}\right]. \] If \(\matM_{22}\) invertible, the Schur complement of \(\matM\) in \(\matM_{22}\) is \(\matS_{22} \eqdef \matM_{11}-\matM_{12}\matM_{22}^{-1}\matM_{21}\).

    If \(\matM_{11}\) invertible, the Schur complement of \(\matM\) in \(\matM_{11}\) is \(\matS_{11} \eqdef \matM_{22}-\matM_{21}\matM_{11}^{-1}\matM_{12}\).

  • \[ \matM^{-1} = \left[\begin{array}{cc}\matS_{22}^{-1}&-\matS_{22}^{-1}\matM_{12}\matM_{22}^{-1}\\-\matM_{22}^{-1}\matM_{21}\matS_{22}^{-1}&\matM_{22}^{-1}+\matM_{22}^{-1}\matM_{21}\matS_{22}^{-1}\matM_{12}\matM_{22}^{-1}\end{array}\right]. \] \[ \matM^{-1} = \left[\begin{array}{cc}\matM_{11}^{-1}+\matM_{11}^{-1}\matM_{12}\matS_{11}^{-1}\matM_{21}\matM_{11}^{-1}&-\matM_{11}^{-1}\matM_{12}\matS_{11}^{-1}\\-\matS_{11}^{-1}\matM_{21}\matM_{11}^{-1}&\matS_{11}^{-1}\end{array}\right]. \]

Last time: Predicting unknown function values

  • \(n\) observations \(X \eqdef (\vecx_i,f_i)\), \(n^*\) test points \(X^* \eqdef (\vecx_j^*,f_j^*)\)
  • The joint distribution is \[ \left[\begin{array}{c}\vecf\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]
  • Assume zero mean
  • The distribution of \(\vecf^*\) conditioned on \(X\), \(X^*\) and \(\vecf\) is \[ \calN\left(\matK(X^*,X)\matK(X,X)^{-1}\vecf,\matK(X^*,X^*)-\matK(X^*,X)\matK(X,X)^{-1}\matK(X,X^*)\right) \] Hence we can estimate \(\vecf^*\) as \(\matK(X^*,X)\matK(X,X)^{-1}\vecf\) and we can quantify our uncertainty.

Predicting unknown function values with noise

  • Often, we only observe \(y = f(x)+{\varepsilon}\) with \({\varepsilon}\sim\calN(0,\sigma^2)\) (i.i.d. across different measurements)

    • We only observe \(\vecy = \vecf+\boldsymbol{\varepsilon}\) with \(\boldsymbol{\varepsilon}\sim\calN(0,\sigma^2\matI)\) so that
  • \[ \left[\begin{array}{c}\vecy\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)+\sigma^2\matI&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]

  • The distribution of \(\vecf^*\) conditioned on \(X\), \(X^*\) and \(\vecy\) is \[ \calN\left(\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\vecy,\matK(X^*,X^*)-\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\matK(X,X^*)\right) \] Hence we can estimate \(\vecf^*\) as \(\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\vecy\) and we can quantify our uncertainty.

  • Small simplification: single test point \(x^*\) \[ f^* = \veck_*^T(\matK + \sigma^2\matI)^{-1}\vecy\qquad \sigma_{f^*} = k(x^*,x^*) - \veck_*^T(\matK+\sigma^2\matI)^{-1}\veck_* \]

    • Note that we can view this as a linear prediction \(f^* = \sum_{i=1}^n\alpha_ik(x_i,x^*)\)

Remarks

  • Non zero-mean Gaussian processes
    • Predict \(\vecf^*\) as
    \[ \vecm(X^*) + \matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}(\vecy-\vecm(X) \]
    • Variance is still \[ \matK(X^*,X^*)-\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\matK(X,X^*) \]
  • Do we need the probabilistic modeling?
    • Recall our two approaches to the Kalman filter
    • We can think of GP modeling completely deterministically using an appropriate space of functions \[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-f(\vecx_i)}^2+\lambda\norm[\calH]{f} \]
    • What this is space of function? Reproducing Kernel Hilbert Space

Vector spaces

  • A vector space \(\calV\) over a field \(\bbK\) consists of a set \(\calV\) of vectors, a closed addition rule \(+\) and a closed scalar multiplication \(\cdot\) such that 8 axioms are satisfied:
    1. \(\forall x,y\in\calV\) \(x+y=y+x\) (commutativity)
    2. \(\forall x,y,z\in\calV\) \(x+(y+z)=(x+y)+z\) (associativity)
    3. \(\exists 0\in\calV\) such that \(\forall x\in\calV\) \(x+0=x\) (identity element)
    4. \(\forall x\in\calV\) \(\exists y\in\calV\) such that \(x+y=0\) (inverse element)
    5. \(\forall x\in\calV\) \(1\cdot x= x\)
    6. \(\forall \alpha, \beta\in\bbK\) \(\forall x\in\calV\) \(\alpha\cdot(\beta\bfx)=(\alpha\cdot\beta)\cdot\bfx\) (associativity)
    7. \(\forall \alpha, \beta\in\bbK\) \(\forall x\in\calV\) \((\alpha+\beta)x = \alpha x+\beta x\) (distributivity)
    8. \(\forall \alpha\in\bbK\) \(\forall x,y\in\calV\) \(\alpha(x+y) = \alpha x+\alpha y\) (distributivity)
    • \(0\in\calV\) is unique
    • Every \(x\in\calV\) has a unique inverse
    • \(0\cdot x = 0\)
    • The inverse of \(x\in\calV\) is \((-1)\cdot x\eqdef -x\)

Subspaces and span

  • A subset \(\calW\neq\emptyset\) of a vector space \(\calV\) is a vector subspace if \(\forall x,y\in\calW\forall \lambda,\mu\in\bbK\) \(\lambda x+\mu y \in\calW\)

  • If \(\calW\) is a vector subspace of a vector space \(\calV\), \(\calW\) is a vector space.

  • Let \(\set{v_i}_{i=1}^n\) be a set of vectors in a vector space \(\calV\).

  • For \(\set{a_i}_{i=1}^n\in\bbK^n\), \(\sum_{i=1}^na_iv_i\) is called a linear combination of the vectors \(\set{v_i}_{i=1}^n\).

  • The span of the vectors \(\set{v_i}_{i=1}^n\) is the set \[ \text{span}(\set{v_i}_{i=1}^n)\eqdef \{\sum_{i=1}^na_iv_i:\set{a_i}_{i=1}^n\in\bbK^n\} \]

  • The span of the vectors \(\set{v_i}_{i=1}^n\in\calV^n\) is a vector subspace of \(\calV\).

Linear independence

  • Let \(\set{v_i}_{i=1}^n\) be a set of vectors in a vector space \(\calV\)

  • \(\set{v_i}_{i=1}^n\) is linearly independent (or the vectors \(\set{v_i}_{i=1}^n\) are linearly independent ) if (and only if) \[ \sum_{i=1}^na_iv_i = 0\Rightarrow \forall i\in\intseq{1}{n}\,a_i=0 \] Otherwise the set is (or the vectors are) linearly dependent.

  • Any set of linearly dependent vectors contains a subset of linearly independent vectors with the same span.

Bases

  • A basis of vector subspace \(\calW\) of a vector space \(\calV\) is a countable set of vectors \(\calB\) such that:
    1. \(\text{span}(\calB)=\calW\)
    2. \(\calB\) is linearly independent
  • If a non vector space \(\calV\neq \set{0}\) as a finite basis with \(n\in\bbN^*\) elements, \(n\) is called the dimension of \(\calV\), denoted \(\dim{\calV}\). If the basis has an infinite number of elements, the dimension is infinite

  • Any two bases for the same finite dimensional vector space contain the same number of elements.

  • You should be somewhat familiar with bases (at least in \(\bbR^n\)):
    • the representation of a vector on a basis is unique
    • every subspace has a basis
    • having a basis reduces the operations on vectors to operations on their components
  • Things sort of work in infinite dimensions, but we have to be bit more careful

Norm and inner product

  • The properties of vector space seen thus far provide an algebraic structure

  • We are missing a topological structure to measure length and distance

  • A norm on a vector space \(\calV\) over \(\bbR\) is a function \(\norm{\cdot}:\calV\to\bbR\) that satisfies:
    • Positive definiteness: \(\forall x\in\calV\) \(\norm{x}\geq 0\) with equality iff \(x=0\)
    • Homogeneity: \(\forall x\in\calV\) \(\forall\alpha\in\bbR\) \(\norm{\alpha x}=\abs{\alpha}\norm{x}\)
    • Subadditivity: \(\forall x,y\in\calV\) \(\norm{x+y}\leq \norm{x}+\norm{y}\)
  • \(\norm{x}\) measures a length, \(\norm{x-y}\) measures a distance

  • In addition to a topological and algebraic strucure, what if we want to do geometry?

  • An inner product space over \(\bbR\) is a vector space \(\calV\) equipped with a positive definite symmetric bilinear form \(\dotp{\cdot}{\cdot}:\calV\times\calV\to\bbR\) called an inner product

  • An inner product space is also called a pre-Hilbert space

Induced norm

  • In an inner product space, an inner product induces a norm \(\norm{x} \eqdef \sqrt{\dotp{x}{x}}\)

  • A norm \(\norm{\cdot}\) is induced by an inner product on \(\calV\) iff \(\forall x,y\in\calV\) \(\norm{x}^2+\norm{y}^2 = \frac{1}{2}\left(\norm{x+y}^2+\norm{x-y}^2\right)\) If this is the case, the inner product is given by the polarization identity \[\dotp{x}{y}=\frac{1}{2}\left(\norm{x}^2+\norm{y}^2-\norm{x-y}^2\right)\]

  • Induced norm have some nice additional properties
  • An induced norm satisfies \(\forall x,y\in\calV\) \(\norm{x+y}\leq \norm{x}+\norm{y}\)

  • An inner product satisfies \(\forall x,y\in\calV\) \(\dotp{x}{y}^2\leq\dotp{x}{x}\dotp{y}{y}\)

Orthogonality

  • In the following \(\calV\) is an inner product space with induced norm \(\norm{\cdot}\)
  • The angle between two non-zero vectors \(x,y\in\calV\) is \[ \cos\theta \eqdef \frac{\dotp{x}{y}}{\norm{x}\norm{y}} \]

  • Two vectors \(x,y\in\calV\) are orthogonal if \(\dotp{x}{y}=0\). We write \(x\perp y\) for simplicity.

    A vector \(x\in\calV\) is orthogonal to a set \(\calS\subset\calV\) if \(\forall s\in\calS\) \(\dotp{x}{s}=0\). We write \(x\perp \calS\) for simplicity.

  • If \(x\perp y\) then \(\norm{x+y}^2=\norm{x}^2+\norm{y}^2\)

  • Inner product spaces have almost all the properties of \(\bbR^n\)

Hilbert spaces

  • In infinite dimensions, things are a little bit tricky. What does the following mean? \[ x(t) = \sum_{n=1}^\infty \alpha_n\psi_n(t) \]

  • We need to define a notion of convergence, e.g., \[ \lim_{N\to\infty}\norm{x(t)-\sum_{n=1}^N \alpha_n\psi_n(t)}=0 \]

  • Problems can still arise if "points are missing"; we avoid this by introducing the notion of completeness

  • A inner product space \(\calV\) is complete if every Cauchy sequence converges, i.e., for every \(\set{x_i}_{i\geq1}\) in \(\calV\) \[ \lim_{\min(m,n)\to\infty}\norm{x_m-x_n}=0\Rightarrow \lim_{n\to\infty}x_n = x^*\in\calV. \]

  • We won't worry too much about proving that spaces are complete

  • A complete normed vector space is a Banach space

Orthogonality principle

  • Let \(\calH\) be a Hilbert space with induced norm \(\dotp{\cdot}{\cdot}\) and induced norm \(\norm{\cdot}\) ; let \(\calT\) be subspace of \(\calH\)

  • For \(x\in\calH\), what is the closest point of \(\hat{x}\in\calT\)? How do we solve \(\min_{y\in\calT}\norm{x-y}\)?

  • This problem has a unique solution given by the orthogonality principle

  • Let \(\calX\) be a pre-Hilbert space, \(\calT\) be a subspace of \(\calX\), and \(x\in\calX\).

    If there exists a vector \(m^*\in\calT\) such that \(\forall m\in\calT\) \(\norm{x-m^*}\leq \norm{x-m}\), then \(m^*\) is unique.

    \(m^*\in\calT\) is a unique minimizer if and only if the error \(x-m^*\) be orthogonal to \(\calT\).

  • This doesn't say that \(m^*\) exists!

  • Let \(\calH\) be a Hilbert space, \(\calT\) be a closed subspace of \(\calX\), and \(x\in\calX\).

    There exists a unique vector \(m^*\in\calT\) such that \(\forall m\in\calT\) \(\norm{x-m^*}\leq \norm{x-m}\).

    \(m^*\in\calT\) is a unique minimizer if and only if the error \(x-m^*\) be orthogonal to \(\calT\)

Orthobases

  • A collection of vectors \(\set{v_i}_{i=1}^n\) in a finite dimensional Hilbert space \(\calH\) is an orthobasis if 1) \(\text{span}(\set{v_i}_{i=1}^n)=\calH\); 2) \(\forall i\neq j\in\intseq{1}{n}\,v_i\perp v_j\); 3) \(\forall i\in\intseq{1}{n} \,\norm{v_i}=1\).

  • If the last condition is not met, this is just called an orthogonal basis

  • Orthobases are useful because we can write \(x=\sum_{i=1}^n\dotp{x}{v_i}v_i\) (what happens in a non-orthonormal basis?)

  • We would like to extend this idea to infinite dimension and happily write \(x=\sum_{i=1}^\infty\dotp{x}{v_i}v_i\)

    • We have to be a bit careful
    • With a little bit of machinery, this works: separability + completeness

Separable space

  • A space is separable if it contains a countable dense subset.

  • Separability is the key property to deal with sequences instead of collections

  • Any separable Hilbert space has an orthonormal basis.

  • Most useful Hilbert spaces are separable! We won't worry about non-separable Hilbert spaces

  • Key take away for separable Hilbert spaces

    • \(x=\sum_{i=1}^\infty\dotp{x}{v_i}v_i\) is perfectly well defined for an orthonormal basis
    • Parseval's identity tell us that \(\norm{x}^2=\sum_{k\geq 1}\abs{\dotp{x}{v_k}}^2\).
    • It looks like we don't need to even worry about the nature of \(\calH\) and only think about coefficients \(\dotp{x}{v_i}\)
  • Any separable Hilbert space is isomorphic to \(\ell_2\)

Linear functionals on Hilbert spaces

  • In what follows, \(\calF\) is a Hilbert space with scalar field \(\bbR\)
  • A functional \(F:\calF\to\bbR\) associates real-valued number to an element of a Hilbert space \(\calF\)

  • Notation can be tricky when the Hilbert space is a space of functions: \(F\) can act on a function \(f\in\calF\)
  • A functional \(F:\calF\to\bbR\) is continuous at \(x\in\calF\) if \[ \forall \epsilon>0\exists\delta>0\textsf{ such that } \norm[\calF]{x-y}\leq \delta\Rightarrow \abs{F(x)-F(y)}\leq\epsilon \] If this is true for every \(x\in\calF\), \(F\) is continuous.

    1. All norms are continuous functionals
    2. \(F:\calF\to\bbR:x\mapsto\dotp{x}{c}\) for some \(c\in\calF\) is continuous

Continuous linear functionals on Hilbert spaces

  • A functional \(F\) is linear if \(\forall a,b\in\bbR\) \(\forall x,y\in\calF\) \(F(ax+by) = aF(x)+bF(y)\).

  • Continuous linear functions are much more constrained than one would imagine

  • A linear functional \(F:\calF\to\bbR\) is bounded if there exists \(M>0\) such that \[ \forall x\in\calF\quad\abs{F(x)}\leq M\norm[\calF]{x} \]

  • A linear functional on a Hilbert space that is countinuous at \(0\) is bounded.

  • For a linear functional \(F:\calF\to\bbR\), the following statements are equivalent:
    1. \(F\) is continuous at 0
    2. \(F\) is continuous at some point \(x\in\calF\)
    3. \(F\) is continuous everywhere on \(\calF\)
    4. \(F\) is uniformly continuous everywhere on \(\calF\)

Representing (continuous) linear functionals

  • Let \(F:\calF\to\bbR\) be a linear functional on an \(n\)-dimensional Hilbert space \(\calF\).

    Then there exists \(c\in\calF\) such that \(F(x)=\dotp{x}{c}\) for every \(x\in\calF\)

  • Linear functional over finite dimensional Hilbert spaces are continuous!

  • This is not true in infinite dimension

  • Let \(F:\calF\to\bbR\) be a continuous linear functional on a (possible infinite dimensional) separable Hilbert space \(\calF\).

    Then there exists \(c\in\calF\) such that \(F(x)=\dotp{x}{c}\) for every \(x\in\calF\)

  • If \(\set{\psi_n}_{n\geq 1}\) is an orthobasis for \(\calF\), then we can construct \(c\) above as \[ c\eqdef \sum_{n=1}^\infty F(\psi_n)\psi_n \]

Reproducing Kernel Hilbert Spaces

  • An RKHS is a Hilbert space \(\calH\) of real-valued functions \(f:\bbR^d\to\bbR\) in which the sampling operation \(\calS_\bftau:\calH\to\bbR:f\mapsto f(\bftau)\) is continuous for every \(\bftau\in\bbR^d\).

    In other words, for each \(\bftau\in\bbR^d\), there exists \(k_\bftau\in\calH\) s.t. \[ f(\bftau) = {\dotp{f}{k_\bftau}}_\calH\text{ for all } f\in\calH \]

  • The kernel of an RKHS is \[ k:\bbR^d\times\bbR^d\to\bbR:(\bft,\bftau)\mapsto k_{\bftau}(\bft) \] where \(k_\bftau\) is the element of \(\calH\) that defines the sampling at \(\bftau\).

  • A (separable) Hilbert space with orthobasis \(\set{\psi_n}_{n\geq 1}\) is an RKHS iff \(\forall \bftau\in\bbR^d\) \(\sum_{n=1}^\infty\abs{\psi_{n}(\tau)}^2<\infty\)

Representer theorem

  • An RKHS is just the right space to solve our problem

  • If \(\calH\) is an RKHS, then \[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-f(\vecx_i)}^2+\lambda\norm[\calH]{f} \] has solution \[ f = \sum_{i=1}^n\alpha_i k_{\vecx_i}\textsf{ with } \bfalpha = (\matK+\lambda\matI)^{-1}\vecy\qquad \matK=\mat{c}{k(\vecx_i,\vecx_j)}_{1\leq i,j\leq n} \]