Gaussian Processes

Matthieu Bloch

Thursday, November 17, 2022

Today in ECE 6555

Announcements
- 5 lectures left (including today)
- Class on Tuesday November 22, 2022 will be online
Last time
- Gaussian processes - prediction
Today
- More Gaussian Processes
Questions?

Last time: Schur complement

Suppose that \[ \matM = \left[\begin{array}{cc}\matM_{11}&\matM_{12}\\\matM_{21}&\matM_{22}\end{array}\right]. \] If \(\matM_{22}\) invertible, the Schur complement of \(\matM\) in \(\matM_{22}\) is \(\matS_{22} \eqdef \matM_{11}-\matM_{12}\matM_{22}^{-1}\matM_{21}\).

If \(\matM_{11}\) invertible, the Schur complement of \(\matM\) in \(\matM_{11}\) is \(\matS_{11} \eqdef \matM_{22}-\matM_{21}\matM_{11}^{-1}\matM_{12}\).
\[ \matM^{-1} = \left[\begin{array}{cc}\matS_{22}^{-1}&-\matS_{22}^{-1}\matM_{12}\matM_{22}^{-1}\\-\matM_{22}^{-1}\matM_{21}\matS_{22}^{-1}&\matM_{22}^{-1}+\matM_{22}^{-1}\matM_{21}\matS_{22}^{-1}\matM_{12}\matM_{22}^{-1}\end{array}\right]. \] \[ \matM^{-1} = \left[\begin{array}{cc}\matM_{11}^{-1}+\matM_{11}^{-1}\matM_{12}\matS_{11}^{-1}\matM_{21}\matM_{11}^{-1}&-\matM_{11}^{-1}\matM_{12}\matS_{11}^{-1}\\-\matS_{11}^{-1}\matM_{21}\matM_{11}^{-1}&\matS_{11}^{-1}\end{array}\right]. \]

Last time: Predicting unknown function values

\(n\) observations \(X \eqdef (\vecx_i,f_i)\), \(n^*\) test points \(X^* \eqdef (\vecx_j^*,f_j^*)\)
The joint distribution is \[ \left[\begin{array}{c}\vecf\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]
Assume zero mean

The distribution of \(\vecf^*\) conditioned on \(X\), \(X^*\) and \(\vecf\) is \[ \calN\left(\matK(X^*,X)\matK(X,X)^{-1}\vecf,\matK(X^*,X^*)-\matK(X^*,X)\matK(X,X)^{-1}\matK(X,X^*)\right) \] Hence we can estimate \(\vecf^*\) as \(\matK(X^*,X)\matK(X,X)^{-1}\vecf\) and we can quantify our uncertainty.

Predicting unknown function values with noise

Often, we only observe \(y = f(x)+{\varepsilon}\) with \({\varepsilon}\sim\calN(0,\sigma^2)\) (i.i.d. across different measurements)
- We only observe \(\vecy = \vecf+\boldsymbol{\varepsilon}\) with \(\boldsymbol{\varepsilon}\sim\calN(0,\sigma^2\matI)\) so that
\[ \left[\begin{array}{c}\vecy\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)+\sigma^2\matI&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]
The distribution of \(\vecf^*\) conditioned on \(X\), \(X^*\) and \(\vecy\) is \[ \calN\left(\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\vecy,\matK(X^*,X^*)-\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\matK(X,X^*)\right) \] Hence we can estimate \(\vecf^*\) as \(\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\vecy\) and we can quantify our uncertainty.
Small simplification: single test point \(x^*\) \[ f^* = \veck_*^T(\matK + \sigma^2\matI)^{-1}\vecy\qquad \sigma_{f^*} = k(x^*,x^*) - \veck_*^T(\matK+\sigma^2\matI)^{-1}\veck_* \]
- Note that we can view this as a linear prediction \(f^* = \sum_{i=1}^n\alpha_ik(x_i,x^*)\)

Remarks

Non zero-mean Gaussian processes
- Predict \(\vecf^*\) as
\[ \vecm(X^*) + \matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}(\vecy-\vecm(X) \]
- Variance is still \[ \matK(X^*,X^*)-\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\matK(X,X^*) \]
Do we need the probabilistic modeling?
- Recall our two approaches to the Kalman filter
- We can think of GP modeling completely deterministically using an appropriate space of functions \[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-f(\vecx_i)}^2+\lambda\norm[\calH]{f} \]
- What this is space of function? Reproducing Kernel Hilbert Space

Vector spaces

A vector space \(\calV\) over a field \(\bbK\) consists of a set \(\calV\) of vectors, a closed addition rule \(+\) and a closed scalar multiplication \(\cdot\) such that 8 axioms are satisfied:
1. \(\forall x,y\in\calV\) \(x+y=y+x\) (commutativity)
2. \(\forall x,y,z\in\calV\) \(x+(y+z)=(x+y)+z\) (associativity)
3. \(\exists 0\in\calV\) such that \(\forall x\in\calV\) \(x+0=x\) (identity element)
4. \(\forall x\in\calV\) \(\exists y\in\calV\) such that \(x+y=0\) (inverse element)
5. \(\forall x\in\calV\) \(1\cdot x= x\)
6. \(\forall \alpha, \beta\in\bbK\) \(\forall x\in\calV\) \(\alpha\cdot(\beta\bfx)=(\alpha\cdot\beta)\cdot\bfx\) (associativity)
7. \(\forall \alpha, \beta\in\bbK\) \(\forall x\in\calV\) \((\alpha+\beta)x = \alpha x+\beta x\) (distributivity)
8. \(\forall \alpha\in\bbK\) \(\forall x,y\in\calV\) \(\alpha(x+y) = \alpha x+\alpha y\) (distributivity)
- \(0\in\calV\) is unique
- Every \(x\in\calV\) has a unique inverse
- \(0\cdot x = 0\)
- The inverse of \(x\in\calV\) is \((-1)\cdot x\eqdef -x\)

Subspaces and span

A subset \(\calW\neq\emptyset\) of a vector space \(\calV\) is a vector subspace if \(\forall x,y\in\calW\forall \lambda,\mu\in\bbK\) \(\lambda x+\mu y \in\calW\)
If \(\calW\) is a vector subspace of a vector space \(\calV\), \(\calW\) is a vector space.
Let \(\set{v_i}_{i=1}^n\) be a set of vectors in a vector space \(\calV\).
For \(\set{a_i}_{i=1}^n\in\bbK^n\), \(\sum_{i=1}^na_iv_i\) is called a linear combination of the vectors \(\set{v_i}_{i=1}^n\).
The span of the vectors \(\set{v_i}_{i=1}^n\) is the set \[ \text{span}(\set{v_i}_{i=1}^n)\eqdef \{\sum_{i=1}^na_iv_i:\set{a_i}_{i=1}^n\in\bbK^n\} \]
The span of the vectors \(\set{v_i}_{i=1}^n\in\calV^n\) is a vector subspace of \(\calV\).

Linear independence

Let \(\set{v_i}_{i=1}^n\) be a set of vectors in a vector space \(\calV\)
\(\set{v_i}_{i=1}^n\) is linearly independent (or the vectors \(\set{v_i}_{i=1}^n\) are linearly independent ) if (and only if) \[ \sum_{i=1}^na_iv_i = 0\Rightarrow \forall i\in\intseq{1}{n}\,a_i=0 \] Otherwise the set is (or the vectors are) linearly dependent.
Any set of linearly dependent vectors contains a subset of linearly independent vectors with the same span.

Bases

A basis of vector subspace \(\calW\) of a vector space \(\calV\) is a countable set of vectors \(\calB\) such that:
1. \(\text{span}(\calB)=\calW\)
2. \(\calB\) is linearly independent
If a non vector space \(\calV\neq \set{0}\) as a finite basis with \(n\in\bbN^*\) elements, \(n\) is called the dimension of \(\calV\), denoted \(\dim{\calV}\). If the basis has an infinite number of elements, the dimension is infinite
Any two bases for the same finite dimensional vector space contain the same number of elements.
You should be somewhat familiar with bases (at least in \(\bbR^n\)):
- the representation of a vector on a basis is unique
- every subspace has a basis
- having a basis reduces the operations on vectors to operations on their components
Things sort of work in infinite dimensions, but we have to be bit more careful

Norm and inner product

The properties of vector space seen thus far provide an algebraic structure
We are missing a topological structure to measure length and distance
A norm on a vector space \(\calV\) over \(\bbR\) is a function \(\norm{\cdot}:\calV\to\bbR\) that satisfies:
- Positive definiteness: \(\forall x\in\calV\) \(\norm{x}\geq 0\) with equality iff \(x=0\)
- Homogeneity: \(\forall x\in\calV\) \(\forall\alpha\in\bbR\) \(\norm{\alpha x}=\abs{\alpha}\norm{x}\)
- Subadditivity: \(\forall x,y\in\calV\) \(\norm{x+y}\leq \norm{x}+\norm{y}\)
\(\norm{x}\) measures a length, \(\norm{x-y}\) measures a distance
In addition to a topological and algebraic strucure, what if we want to do geometry?
An inner product space over \(\bbR\) is a vector space \(\calV\) equipped with a positive definite symmetric bilinear form \(\dotp{\cdot}{\cdot}:\calV\times\calV\to\bbR\) called an inner product
An inner product space is also called a pre-Hilbert space

Induced norm

In an inner product space, an inner product induces a norm \(\norm{x} \eqdef \sqrt{\dotp{x}{x}}\)
A norm \(\norm{\cdot}\) is induced by an inner product on \(\calV\) iff \(\forall x,y\in\calV\) \(\norm{x}^2+\norm{y}^2 = \frac{1}{2}\left(\norm{x+y}^2+\norm{x-y}^2\right)\) If this is the case, the inner product is given by the polarization identity \[\dotp{x}{y}=\frac{1}{2}\left(\norm{x}^2+\norm{y}^2-\norm{x-y}^2\right)\]
Induced norm have some nice additional properties
An induced norm satisfies \(\forall x,y\in\calV\) \(\norm{x+y}\leq \norm{x}+\norm{y}\)
An inner product satisfies \(\forall x,y\in\calV\) \(\dotp{x}{y}^2\leq\dotp{x}{x}\dotp{y}{y}\)

Orthogonality

In the following \(\calV\) is an inner product space with induced norm \(\norm{\cdot}\)
The angle between two non-zero vectors \(x,y\in\calV\) is \[ \cos\theta \eqdef \frac{\dotp{x}{y}}{\norm{x}\norm{y}} \]
Two vectors \(x,y\in\calV\) are orthogonal if \(\dotp{x}{y}=0\). We write \(x\perp y\) for simplicity.

A vector \(x\in\calV\) is orthogonal to a set \(\calS\subset\calV\) if \(\forall s\in\calS\) \(\dotp{x}{s}=0\). We write \(x\perp \calS\) for simplicity.
If \(x\perp y\) then \(\norm{x+y}^2=\norm{x}^2+\norm{y}^2\)
Inner product spaces have almost all the properties of \(\bbR^n\)

Hilbert spaces

In infinite dimensions, things are a little bit tricky. What does the following mean? \[ x(t) = \sum_{n=1}^\infty \alpha_n\psi_n(t) \]
We need to define a notion of convergence, e.g., \[ \lim_{N\to\infty}\norm{x(t)-\sum_{n=1}^N \alpha_n\psi_n(t)}=0 \]
Problems can still arise if "points are missing"; we avoid this by introducing the notion of completeness
A inner product space \(\calV\) is complete if every Cauchy sequence converges, i.e., for every \(\set{x_i}_{i\geq1}\) in \(\calV\) \[ \lim_{\min(m,n)\to\infty}\norm{x_m-x_n}=0\Rightarrow \lim_{n\to\infty}x_n = x^*\in\calV. \]
We won't worry too much about proving that spaces are complete
A complete normed vector space is a Banach space

Orthogonality principle

Let \(\calH\) be a Hilbert space with induced norm \(\dotp{\cdot}{\cdot}\) and induced norm \(\norm{\cdot}\) ; let \(\calT\) be subspace of \(\calH\)
For \(x\in\calH\), what is the closest point of \(\hat{x}\in\calT\)? How do we solve \(\min_{y\in\calT}\norm{x-y}\)?
This problem has a unique solution given by the orthogonality principle
Let \(\calX\) be a pre-Hilbert space, \(\calT\) be a subspace of \(\calX\), and \(x\in\calX\).

If there exists a vector \(m^*\in\calT\) such that \(\forall m\in\calT\) \(\norm{x-m^*}\leq \norm{x-m}\), then \(m^*\) is unique.

\(m^*\in\calT\) is a unique minimizer if and only if the error \(x-m^*\) be orthogonal to \(\calT\).
This doesn't say that \(m^*\) exists!

Let \(\calH\) be a Hilbert space, \(\calT\) be a closed subspace of \(\calX\), and \(x\in\calX\).

There exists a unique vector \(m^*\in\calT\) such that \(\forall m\in\calT\) \(\norm{x-m^*}\leq \norm{x-m}\).

\(m^*\in\calT\) is a unique minimizer if and only if the error \(x-m^*\) be orthogonal to \(\calT\)

Orthobases

A collection of vectors \(\set{v_i}_{i=1}^n\) in a finite dimensional Hilbert space \(\calH\) is an orthobasis if 1) \(\text{span}(\set{v_i}_{i=1}^n)=\calH\); 2) \(\forall i\neq j\in\intseq{1}{n}\,v_i\perp v_j\); 3) \(\forall i\in\intseq{1}{n} \,\norm{v_i}=1\).
If the last condition is not met, this is just called an orthogonal basis
Orthobases are useful because we can write \(x=\sum_{i=1}^n\dotp{x}{v_i}v_i\) (what happens in a non-orthonormal basis?)
We would like to extend this idea to infinite dimension and happily write \(x=\sum_{i=1}^\infty\dotp{x}{v_i}v_i\)
- We have to be a bit careful
- With a little bit of machinery, this works: separability + completeness

Separable space

A space is separable if it contains a countable dense subset.
Separability is the key property to deal with sequences instead of collections
Any separable Hilbert space has an orthonormal basis.
Most useful Hilbert spaces are separable! We won't worry about non-separable Hilbert spaces
Key take away for separable Hilbert spaces
- \(x=\sum_{i=1}^\infty\dotp{x}{v_i}v_i\) is perfectly well defined for an orthonormal basis
- Parseval's identity tell us that \(\norm{x}^2=\sum_{k\geq 1}\abs{\dotp{x}{v_k}}^2\).
- It looks like we don't need to even worry about the nature of \(\calH\) and only think about coefficients \(\dotp{x}{v_i}\)
Any separable Hilbert space is isomorphic to \(\ell_2\)

Linear functionals on Hilbert spaces

In what follows, \(\calF\) is a Hilbert space with scalar field \(\bbR\)

A functional \(F:\calF\to\bbR\) associates real-valued number to an element of a Hilbert space \(\calF\)
Notation can be tricky when the Hilbert space is a space of functions: \(F\) can act on a function \(f\in\calF\)

A functional \(F:\calF\to\bbR\) is continuous at \(x\in\calF\) if \[ \forall \epsilon>0\exists\delta>0\textsf{ such that } \norm[\calF]{x-y}\leq \delta\Rightarrow \abs{F(x)-F(y)}\leq\epsilon \] If this is true for every \(x\in\calF\), \(F\) is continuous.
1. All norms are continuous functionals
2. \(F:\calF\to\bbR:x\mapsto\dotp{x}{c}\) for some \(c\in\calF\) is continuous

Continuous linear functionals on Hilbert spaces

A functional \(F\) is linear if \(\forall a,b\in\bbR\) \(\forall x,y\in\calF\) \(F(ax+by) = aF(x)+bF(y)\).
Continuous linear functions are much more constrained than one would imagine
A linear functional \(F:\calF\to\bbR\) is bounded if there exists \(M>0\) such that \[ \forall x\in\calF\quad\abs{F(x)}\leq M\norm[\calF]{x} \]
A linear functional on a Hilbert space that is countinuous at \(0\) is bounded.
For a linear functional \(F:\calF\to\bbR\), the following statements are equivalent:
1. \(F\) is continuous at 0
2. \(F\) is continuous at some point \(x\in\calF\)
3. \(F\) is continuous everywhere on \(\calF\)
4. \(F\) is uniformly continuous everywhere on \(\calF\)

Representing (continuous) linear functionals

Let \(F:\calF\to\bbR\) be a linear functional on an \(n\)-dimensional Hilbert space \(\calF\).

Then there exists \(c\in\calF\) such that \(F(x)=\dotp{x}{c}\) for every \(x\in\calF\)
Linear functional over finite dimensional Hilbert spaces are continuous!
This is not true in infinite dimension
Let \(F:\calF\to\bbR\) be a continuous linear functional on a (possible infinite dimensional) separable Hilbert space \(\calF\).

Then there exists \(c\in\calF\) such that \(F(x)=\dotp{x}{c}\) for every \(x\in\calF\)
If \(\set{\psi_n}_{n\geq 1}\) is an orthobasis for \(\calF\), then we can construct \(c\) above as \[ c\eqdef \sum_{n=1}^\infty F(\psi_n)\psi_n \]

Reproducing Kernel Hilbert Spaces

An RKHS is a Hilbert space \(\calH\) of real-valued functions \(f:\bbR^d\to\bbR\) in which the sampling operation \(\calS_\bftau:\calH\to\bbR:f\mapsto f(\bftau)\) is continuous for every \(\bftau\in\bbR^d\).

In other words, for each \(\bftau\in\bbR^d\), there exists \(k_\bftau\in\calH\) s.t. \[ f(\bftau) = {\dotp{f}{k_\bftau}}_\calH\text{ for all } f\in\calH \]
The kernel of an RKHS is \[ k:\bbR^d\times\bbR^d\to\bbR:(\bft,\bftau)\mapsto k_{\bftau}(\bft) \] where \(k_\bftau\) is the element of \(\calH\) that defines the sampling at \(\bftau\).
A (separable) Hilbert space with orthobasis \(\set{\psi_n}_{n\geq 1}\) is an RKHS iff \(\forall \bftau\in\bbR^d\) \(\sum_{n=1}^\infty\abs{\psi_{n}(\tau)}^2<\infty\)

Representer theorem

An RKHS is just the right space to solve our problem
If \(\calH\) is an RKHS, then \[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-f(\vecx_i)}^2+\lambda\norm[\calH]{f} \] has solution \[ f = \sum_{i=1}^n\alpha_i k_{\vecx_i}\textsf{ with } \bfalpha = (\matK+\lambda\matI)^{-1}\vecy\qquad \matK=\mat{c}{k(\vecx_i,\vecx_j)}_{1\leq i,j\leq n} \]