Gaussian Processes

Matthieu Bloch

Tuesday, November 15, 2022

Today in ECE 6555

  • Announcements
    • 6 lectures left (including today)
    • Kalman filtering project submission window extended to Wednesday November 16, 2022
    • Next project coming up (particle filtering most likely)
    • No office hours today
  • Last time
    • Particle Filtering
    • Gaussian processes
  • Today
    • More Gaussian Processes
  • Questions?

Last time: Gaussian processes

  • What if we don't know part of the system?
    • Need to specify what we mean by "don't know": statistics, functional form?
  • Two solutions to extend our results
    1. Show robustness to uncertainty, i.e., show that results still hold with "unknown" disturbances
    2. Integrate the ability to learn what we don't know
  • Which solution to adopt depends on the problem, e.g., can we afford to learn?
  • Gaussian processes are a powerful tool to model unknown functions and learn them from samples

  • A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution

  • A Gaussian process is completely specific by a mean function \(m(\vecx)\) and a covariance function \(k(\vecx,\vecx')\) of a real-valued process \(f(\vecx)\) such that \[ m(\vecx)\eqdef \E{f(\vecx)}\qquad k(\vecx,\vecx') = \E{(f(\vecx-m(\vecx)))(f(\vecx')-m(\vecx'))} \]

  • Possible to generalize to vector-valued functions (more on this later), often assume \(m(\vecx)=0\)

Last time: Gaussian Processes

  • Example of kernel: squared exponential kernel \[ k(\vecx,\vecx') \eqdef \exp\left(-\frac{1}{2\sigma^2}\norm[2]{\vecx-\vecx'}^2\right) \]
    • The kernel is a hyper-parameter of the model
    • The kernel controls the smoothness of the functions we model (think of role of \(\sigma^2\))
    • (Why can we use this to define a covariance matrix?)
  • Key benefit of GPs: incorporate knowledge of observations of the function
    • Assume we know \(n\) observations \((\vecx_i,f_i)\)
    • Assume we would like to approximate the function at \(n^*\) test points \((\vecx_j^*,f_j^*)\)
    • The joint distribution is \[ \left[\begin{array}{c}\vecf\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]
    • How do we use this to estimate \(f^*\)?

Schur complement

  • Suppose that \[ \matM = \left[\begin{array}{cc}\matM_{11}&\matM_{12}\\\matM_{21}&\matM_{22}\end{array}\right]. \] If \(\matM_{22}\) invertible, the Schur complement of \(\matM\) in \(\matM_{22}\) is \(\matS_{22} \eqdef \matM_{11}-\matM_{12}\matM_{22}^{-1}\matM_{21}\).

    If \(\matM_{11}\) invertible, the Schur complement of \(\matM\) in \(\matM_{11}\) is \(\matS_{11} \eqdef \matM_{22}-\matM_{21}\matM_{11}^{-1}\matM_{12}\).

  • \[ \matM^{-1} = \left[\begin{array}{cc}\matS_{22}^{-1}&-\matS_{22}^{-1}\matM_{12}\matM_{22}^{-1}\\-\matM_{22}^{-1}\matM_{21}\matS_{22}^{-1}&\matM_{22}^{-1}+\matM_{22}^{-1}\matM_{21}\matS_{22}^{-1}\matM_{12}\matM_{22}^{-1}\end{array}\right]. \] \[ \matM^{-1} = \left[\begin{array}{cc}\matM_{11}^{-1}+\matM_{11}^{-1}\matM_{12}\matS_{11}^{-1}\matM_{21}\matM_{11}^{-1}&-\matM_{11}^{-1}\matM_{12}\matS_{11}^{-1}\\-\matS_{11}^{-1}\matM_{21}\matM_{11}^{-1}&\matS_{11}^{-1}\end{array}\right]. \]

Predicting unknown function values

  • \(n\) observations \(X \eqdef (\vecx_i,f_i)\), \(n^*\) test points \(X^* \eqdef (\vecx_j^*,f_j^*)\)
  • The joint distribution is \[ \left[\begin{array}{c}\vecf\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]
  • Assume zero mean
  • The distribution of \(\vecf^*\) conditioned on \(X\), \(X^*\) and \(\vecf\) is \[ \calN\left(\matK(X^*,X)\matK(X,X)^{-1}\vecf,\matK(X^*,X^*)-\matK(X^*,X)\matK(X,X)^{-1}\matK(X,X^*)\right) \] Hence we can estimate \(\vecf^*\) as \(\matK(X^*,X)\matK(X,X)^{-1}\vecf\) and we can quantify our uncertainty.

Predicting unknown function values with noise

  • Often, we only observe \(y = f(x)+{\varepsilon}\) with \({\varepsilon}\sim\calN(0,\sigma^2)\) (i.i.d. across different measurements)

    • We only observe \(\vecy = \vecf+\boldsymbol{\varepsilon}\) with \(\boldsymbol{\varepsilon}\sim\calN(0,\sigma^2\matI)\) so that
  • \[ \left[\begin{array}{c}\vecy\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)+\sigma^2\matI&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]

  • The distribution of \(\vecf^*\) conditioned on \(X\), \(X^*\) and \(\vecy\) is \[ \calN\left(\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\vecy,\matK(X^*,X^*)-\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\matK(X,X^*)\right) \] Hence we can estimate \(\vecf^*\) as \(\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\vecy\) and we can quantify our uncertainty.

  • Small simplification: single test point \(x^*\) \[ f^* = \veck_*^T(\matK + \sigma^2\matI)^{-1}\vecy\qquad \sigma_{f^*} = k(x^*,x^*) - \veck_*^T(\matK+\sigma^2\matI)^{-1}\veck_* \]

    • Note that we can view this as a linear prediction \(f^* = \sum_{i=1}^n\alpha_ik(x_i,x^*)\)

Remarks

  • Non zero-mean Gaussian processes
    • Predict \(\vecf^*\) as
    \[ \vecm(X^*) + \matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}(\vecy-\vecm(X) \]
    • Variance is still \[ \matK(X^*,X^*)-\matK(X^*,X)(\matK(X,X)+\sigma^2\matI)^{-1}\matK(X,X^*) \]
  • Do we need the probabilisti modeling?
    • Recall our two approaches to the Kalman filter
    • We can think of GP modeling completely deterministically using an appropriate space of functions

Vector spaces

  • A vector space \(\calV\) over a field \(\bbK\) consists of a set \(\calV\) of vectors, a closed addition rule \(+\) and a closed scalar multiplication \(\cdot\) such that 8 axioms are satisfied:
    1. \(\forall x,y\in\calV\) \(x+y=y+x\) (commutativity)
    2. \(\forall x,y,z\in\calV\) \(x+(y+z)=(x+y)+z\) (associativity)
    3. \(\exists 0\in\calV\) such that \(\forall x\in\calV\) \(x+0=x\) (identity element)
    4. \(\forall x\in\calV\) \(\exists y\in\calV\) such that \(x+y=0\) (inverse element)
    5. \(\forall x\in\calV\) \(1\cdot x= x\)
    6. \(\forall \alpha, \beta\in\bbK\) \(\forall x\in\calV\) \(\alpha\cdot(\beta\bfx)=(\alpha\cdot\beta)\cdot\bfx\) (associativity)
    7. \(\forall \alpha, \beta\in\bbK\) \(\forall x\in\calV\) \((\alpha+\beta)x = \alpha x+\beta x\) (distributivity)
    8. \(\forall \alpha\in\bbK\) \(\forall x,y\in\calV\) \(\alpha(x+y) = \alpha x+\alpha y\) (distributivity)
    • \(0\in\calV\) is unique
    • Every \(x\in\calV\) has a unique inverse
    • \(0\cdot x = 0\)
    • The inverse of \(x\in\calV\) is \((-1)\cdot x\eqdef -x\)

Vector spaces

  • A subset \(\calW\) of a vector space \(\calV\) is a vector subspace if \(\forall x,y\in\calW\forall \lambda,\mu\in\bbK\) \(\lambda x+\mu y \in\calW\)

  • If \(\calW\) is a vector subspace of a vector space \(\calV\), \(\calW\) is a vector space.

Norm and inner product

  • The properties of vector space seen thus far provide an algebraic structure

  • We are missing a topological structure to measure length and distance

  • A norm on a vector space \(\calV\) over \(\bbR\) is a function \(\norm{\cdot}:\calV\to\bbR\) that satisfies:
    • Positive definiteness: \(\forall x\in\calV\) \(\norm{x}\geq 0\) with equality iff \(x=0\)
    • Homogeneity: \(\forall x\in\calV\) \(\forall\alpha\in\bbR\) \(\norm{\alpha x}=\abs{\alpha}\norm{x}\)
    • Subadditivity: \(\forall x,y\in\calV\) \(\norm{x+y}\leq \norm{x}+\norm{y}\)
  • \(\norm{x}\) measures a length, \(\norm{x-y}\) measures a distance

  • In addition to a topological and algebraic strucure, what if we want to do geometry?

  • An inner product space over \(\bbR\) is a vector space \(\calV\) equipped with a positive definite symmetric bilinear form \(\dotp{\cdot}{\cdot}:\calV\times\calV\to\bbR\) called an inner product

  • An inner product space is also called a pre-Hilbert space