Gaussian Processes

Matthieu Bloch

Tuesday, November 15, 2022

Today in ECE 6555

Announcements
- 6 lectures left (including today)
- Kalman filtering project submission window extended to Wednesday November 16, 2022
- Next project coming up (particle filtering most likely)
- No office hours today
Last time
- Particle Filtering
- Gaussian processes
Today
- More Gaussian Processes
Questions?

What if we don't know part of the system?
- Need to specify what we mean by "don't know": statistics, functional form?
Two solutions to extend our results
1. Show robustness to uncertainty, i.e., show that results still hold with "unknown" disturbances
2. Integrate the ability to learn what we don't know
Which solution to adopt depends on the problem, e.g., can we afford to learn?

Gaussian processes are a powerful tool to model unknown functions and learn them from samples
A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution
A Gaussian process is completely specific by a mean function \(m(\vecx)\) and a covariance function \(k(\vecx,\vecx')\) of a real-valued process \(f(\vecx)\) such that \[ m(\vecx)\eqdef \E{f(\vecx)}\qquad k(\vecx,\vecx') = \E{(f(\vecx-m(\vecx)))(f(\vecx')-m(\vecx'))} \]
Possible to generalize to vector-valued functions (more on this later), often assume \(m(\vecx)=0\)

Example of kernel: squared exponential kernel \[ k(\vecx,\vecx') \eqdef \exp\left(-\frac{1}{2\sigma^2}\norm[2]{\vecx-\vecx'}^2\right) \]
- The kernel is a hyper-parameter of the model
- The kernel controls the smoothness of the functions we model (think of role of \(\sigma^2\))
- (Why can we use this to define a covariance matrix?)
Key benefit of GPs: incorporate knowledge of observations of the function
- Assume we know \(n\) observations \((\vecx_i,f_i)\)
- Assume we would like to approximate the function at \(n^*\) test points \((\vecx_j^*,f_j^*)\)
- The joint distribution is \[ \left[\begin{array}{c}\vecf\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]
- How do we use this to estimate \(f^*\)?

Suppose that \[ \matM = \left[\begin{array}{cc}\matM_{11}&\matM_{12}\\\matM_{21}&\matM_{22}\end{array}\right]. \] If \(\matM_{22}\) invertible, the Schur complement of \(\matM\) in \(\matM_{22}\) is \(\matS_{22} \eqdef \matM_{11}-\matM_{12}\matM_{22}^{-1}\matM_{21}\).

If \(\matM_{11}\) invertible, the Schur complement of \(\matM\) in \(\matM_{11}\) is \(\matS_{11} \eqdef \matM_{22}-\matM_{21}\matM_{11}^{-1}\matM_{12}\).
\[ \matM^{-1} = \left[\begin{array}{cc}\matS_{22}^{-1}&-\matS_{22}^{-1}\matM_{12}\matM_{22}^{-1}\\-\matM_{22}^{-1}\matM_{21}\matS_{22}^{-1}&\matM_{22}^{-1}+\matM_{22}^{-1}\matM_{21}\matS_{22}^{-1}\matM_{12}\matM_{22}^{-1}\end{array}\right]. \] \[ \matM^{-1} = \left[\begin{array}{cc}\matM_{11}^{-1}+\matM_{11}^{-1}\matM_{12}\matS_{11}^{-1}\matM_{21}\matM_{11}^{-1}&-\matM_{11}^{-1}\matM_{12}\matS_{11}^{-1}\\-\matS_{11}^{-1}\matM_{21}\matM_{11}^{-1}&\matS_{11}^{-1}\end{array}\right]. \]

\(n\) observations \(X \eqdef (\vecx_i,f_i)\), \(n^*\) test points \(X^* \eqdef (\vecx_j^*,f_j^*)\)
The joint distribution is \[ \left[\begin{array}{c}\vecf\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]
Assume zero mean

The distribution of \(\vecf^*\) conditioned on \(X\), \(X^*\) and \(\vecf\) is \[ \calN\left(\matK(X^*,X)\matK(X,X)^{-1}\vecf,\matK(X^*,X^*)-\matK(X^*,X)\matK(X,X)^{-1}\matK(X,X^*)\right) \] Hence we can estimate \(\vecf^*\) as \(\matK(X^*,X)\matK(X,X)^{-1}\vecf\) and we can quantify our uncertainty.

Often, we only observe \(y = f(x)+{\varepsilon}\) with \({\varepsilon}\sim\calN(0,\sigma^2)\) (i.i.d. across different measurements)
- We only observe \(\vecy = \vecf+\boldsymbol{\varepsilon}\) with \(\boldsymbol{\varepsilon}\sim\calN(0,\sigma^2\matI)\) so that
\[ \left[\begin{array}{c}\vecy\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)+\sigma^2\matI&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]
The distribution of \(\vecf^*\) conditioned on \(X\), \(X^*\) and \(\vecy\) is \[ \calN\left(\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\vecy,\matK(X^*,X^*)-\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\matK(X,X^*)\right) \] Hence we can estimate \(\vecf^*\) as \(\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\vecy\) and we can quantify our uncertainty.
Small simplification: single test point \(x^*\) \[ f^* = \veck_*^T(\matK + \sigma^2\matI)^{-1}\vecy\qquad \sigma_{f^*} = k(x^*,x^*) - \veck_*^T(\matK+\sigma^2\matI)^{-1}\veck_* \]
- Note that we can view this as a linear prediction \(f^* = \sum_{i=1}^n\alpha_ik(x_i,x^*)\)

Non zero-mean Gaussian processes
- Predict \(\vecf^*\) as
\[ \vecm(X^*) + \matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}(\vecy-\vecm(X) \]
- Variance is still \[ \matK(X^*,X^*)-\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\matK(X,X^*) \]
Do we need the probabilisti modeling?
- Recall our two approaches to the Kalman filter
- We can think of GP modeling completely deterministically using an appropriate space of functions

A vector space \(\calV\) over a field \(\bbK\) consists of a set \(\calV\) of vectors, a closed addition rule \(+\) and a closed scalar multiplication \(\cdot\) such that 8 axioms are satisfied:
1. \(\forall x,y\in\calV\) \(x+y=y+x\) (commutativity)
2. \(\forall x,y,z\in\calV\) \(x+(y+z)=(x+y)+z\) (associativity)
3. \(\exists 0\in\calV\) such that \(\forall x\in\calV\) \(x+0=x\) (identity element)
4. \(\forall x\in\calV\) \(\exists y\in\calV\) such that \(x+y=0\) (inverse element)
5. \(\forall x\in\calV\) \(1\cdot x= x\)
6. \(\forall \alpha, \beta\in\bbK\) \(\forall x\in\calV\) \(\alpha\cdot(\beta\bfx)=(\alpha\cdot\beta)\cdot\bfx\) (associativity)
7. \(\forall \alpha, \beta\in\bbK\) \(\forall x\in\calV\) \((\alpha+\beta)x = \alpha x+\beta x\) (distributivity)
8. \(\forall \alpha\in\bbK\) \(\forall x,y\in\calV\) \(\alpha(x+y) = \alpha x+\alpha y\) (distributivity)
- \(0\in\calV\) is unique
- Every \(x\in\calV\) has a unique inverse
- \(0\cdot x = 0\)
- The inverse of \(x\in\calV\) is \((-1)\cdot x\eqdef -x\)

A subset \(\calW\) of a vector space \(\calV\) is a vector subspace if \(\forall x,y\in\calW\forall \lambda,\mu\in\bbK\) \(\lambda x+\mu y \in\calW\)
If \(\calW\) is a vector subspace of a vector space \(\calV\), \(\calW\) is a vector space.

The properties of vector space seen thus far provide an algebraic structure
We are missing a topological structure to measure length and distance
A norm on a vector space \(\calV\) over \(\bbR\) is a function \(\norm{\cdot}:\calV\to\bbR\) that satisfies:
- Positive definiteness: \(\forall x\in\calV\) \(\norm{x}\geq 0\) with equality iff \(x=0\)
- Homogeneity: \(\forall x\in\calV\) \(\forall\alpha\in\bbR\) \(\norm{\alpha x}=\abs{\alpha}\norm{x}\)
- Subadditivity: \(\forall x,y\in\calV\) \(\norm{x+y}\leq \norm{x}+\norm{y}\)
\(\norm{x}\) measures a length, \(\norm{x-y}\) measures a distance
In addition to a topological and algebraic strucure, what if we want to do geometry?
An inner product space over \(\bbR\) is a vector space \(\calV\) equipped with a positive definite symmetric bilinear form \(\dotp{\cdot}{\cdot}:\calV\times\calV\to\bbR\) called an inner product
An inner product space is also called a pre-Hilbert space