Particle Filtering

Matthieu Bloch

Thursday, November 10, 2022

Today in ECE 6555

  • Announcements
    • 7 lectures left (including today)
    • Kalman filtering mini project due today (late deadline November 13, 2022)
  • Last time
    • Importance sampling
  • Today
    • Particle Filtering
    • Gaussian processes
  • Questions?

Last time: Monte-Carlo approximations

  • Gaussian approximation often work well…
    • but cannot capture multi-modal distributions, discrete components, etc.
  • Particle filters are Monte-Carlo methods to approximate the solution of Bayesian Filtering
  • Monte Carlo approximation
    • The central difficulty in Bayesian inference is to compute \[ \E{g(\vecx)|\vecy_{1:T}} \eqdef \int g(\vecx)p(\vecx|\vecy_{1:T})d\vecx \] where \(g:\bbR^n\to\bbR^m\)
    • No closed-form expression in general
    • Approach
      • Draw \(n\) samples \(\vecx_i\sim p(\vecx|\vecy_{1:T})\)
      • Estimate \[ \E{g(\vecx)|\vecy_{1:T}} \approx \frac{1}{n}\sum_{i=1}^n g(\vecx_i) \]
    • Converge in \(\calO(n^{-1/2})\) regardless of dimension

Last time: Importance sampling

  • It might still be challenging to sample according to \(p(\vecx|\vecy_{1:T})\) (complicated functional form)
  • Importance sampling
    • Use an importance distribution \(\pi(\vecx|\vecy_{1:T})\) \[ \E{g(\vecx)|\vecy_{1:T}} \eqdef \int g(\vecx)p(\vecx|\vecy_{1:T})d\vecx = \int g(\vecx)\frac{p(\vecx|\vecy_{1:T})}{\pi(\vecx|\vecy_{1:T})}\pi(\vecx|\vecy_{1:T})d\vecx \] assuming \(\pi(\vecx|\vecy_{1:T}) \ll p(\vecx|\vecy_{1:T})\)
    • Draw \(n\) samples \(\vecx^{(i)}\sim \pi(\vecx|\vecy_{1:T})\)
    • Estimate \[ \E{g(\vecx)|\vecy_{1:T}} \approx \sum_{i=1}^n \tilde{w}^{(i)} g(\vecx^{(i)})\textsf{ where } \tilde{w}^{(i)} \eqdef \frac{1}{n}\frac{p(\vecx^{(i)}|\vecy_{1:T})}{\pi(\vecx^{(i)}|\vecy_{1:T})} \]
    • Challenge: assumes that we can evaluate \(p(\vecx^{(i)}|\vecy_{1:T})\)

Last time: Importance sampling

  • In general, \(p(y_{1:T}|\vecx^{(i)})\) easier to evaluate (measurement model)

  • Use Bayes' rule \[ p(\vecx^{(i)}|\vecy_{1:T}) = \frac{p(\vecy_{1:T}|\vecx^{(i)})p(\vecx^{(i)})}{\int p\vec(y_{1:T}|\vecx)p(\vecx)d\vecx} \] and note the denominator is still hard to compute!

  • Importance sampling to the rescue \[ \E{g(\vecx)|\vecy_{1:T}} \approx \sum_{i=1}^n \tilde{w}^{(i)} g(\vecx^{(i)})\textsf{ where } \tilde{w}^{(i)} \eqdef \frac{\frac{p(\vecy_{1:T}|\vecx^{(i)})p(\vecx^{(i)})}{\pi(\vecx^{(i)}|\vecy_{1:T})}}{\sum_{j=1}^n\frac{p(\vecy_{1:T}|\vecx_j)p(\vecx_j)}{\pi(\vecx_j|\vecy_{1:T})}} \]

Last time: Importance sampling algorithm

  • Assume measurement model \(p(\vecy_{1:T}|\vecx)\) and prior \(p(\vecx)\)
    1. Draw \(n\) samples \[ \vecx^{(i)}\sim \pi(\vecx|\vecy_{1:T})\quad i=1\dots n \]
    2. Compute unnormalized weights \[ v^{(i)} = \frac{p(\vecy_{1:T}|\vecx^{(i)})p(\vecx^{(i)})}{\pi(\vecx^{(i)}|\vecy_{1:T})} \]
    3. Compute normalized weights \[ w^{(i)} = \frac{v^{(i)}}{\sum_{j=1}^n v_j} \]
    4. Approximate posteriors \[ p(\vecx|\vecy_{1:T})\approx \sum_{i=1}^n w^{(i)}\delta(\vecx-\vecx^{(i)})\qquad \E{g(\vecx)|\vecy_{1:T}}\approx \sum_{i=1}^n w^{(i)} g(\vecx^{(i)}) \]

Last time: Sequential importance sampling

  • Sequential importance sampling is a variation that can be used for probabilistic state-space models with \[ \vecx_k\sim p(\vecx_k|\vecx_{k-1})\qquad \vecy_k\sim p(\vecy_k|\vecx_k) \]

  • Idea

    • Keep track of particles \(\set{(w^{(i)}_{k},\vecx^{(i)}_k):i=1\dots n}\) at every time \(k\) to compute \[ p(\vecx_k|\vecy_{1:k})\approx \sum_{i=1}^n w_k^{(i)}\delta(\vecx_k-\vecx_k^{(i)})\qquad \E{g(\vecx_k)|\vecy_{1:k}}\approx \sum_{i=1}^n w_k^{(i)} g(\vecx_k^{(i)}) \]
    • Use properties of state-space model to only keep track of current states at time \(k\), not entire history

Sequential importance sampling

  1. Draw \(n\) samples from the prior \[ \vecx_0^{(i)}\sim p(\vecx_0)\quad i=1\dots n \] and set \(w^{(i)}=\frac{1}{n}\) for \(i=1\dots n\)
  2. For each \(k=1\dots T\)
    1. Draw samples \(\vecx_k^{(i)}\) from importance distributions \[ \vecx_k^{(i)}\sim \pi(\vecx_k|\vecx^{(i)}_{0:k-1}\vecy_{1:k})\quad i=1\dots n \]
    2. Compute new weights \[ w_k^{(i)}\propto w_{k-1}^{(i)} \frac{p(\vecy_k|\vecx_k^{(i)})p(\vecx_k^{(i)}|\vecx_{k-1}^{(i)})}{\pi(\vecx_k^{(i)}|\vecx_{0:k-1}^{(i)},\vecy_{1:k})} \] and normalize
  • Try to choose \(\pi(\vecx_k|\vecx_{0:k-1},\vecy_{1:k})=\pi(\vecx_k|\vecx_{k-1},\vecy_{1:k})\)

Particle filter

  • Resampling might be needed when weights get too low (particles disappear)
  1. Draw \(n\) samples from the prior \(\vecx_0^{(i)}\sim p(\vecx_0)\quad i=1\dots n\) and set \(w^{(i)}=\frac{1}{n}\) for \(i=1\dots n\)
  2. For each \(k=1\dots T\)
    1. Draw samples \(\vecx_k^{(i)}\) from importance distributions \[ \vecx_k^{(i)}\sim \pi(\vecx_k|\vecx^{(i)}_{0:k-1}\vecy_{1:k})\quad i=1\dots n \]
    2. Compute new weights \[ w_k^{(i)}\propto w_{k-1}^{(i)} \frac{p(\vecy_k|\vecx_k^{(i)})p(\vecx_k^{(i)}|\vecx_{k-1}^{(i)})}{\pi(\vecx_k^{(i)}|\vecx_{0:k-1}^{(i)},\vecy_{1:k})} \] and normalize
    3. If effective number of particles get too low, resample and reset to uniform weights \[ n_\textsf{eff}\approx\frac{1}{\sum_{j=1}^n (w_k^{(i)})^2} \]

Gaussian processes

  • What if we don't know part of the system?
    • Need to specify what we mean by "don't know": statistics, functional form?
  • Two solutions to extend our results
    1. Show robustness to uncertainty, i.e., show that results still sort of hold with "unknown" distrubances
    2. Integrate the ability to learn what we don't know
  • Which solution to adopt depends on the problem, e.g., can we afford to learn?
  • Gaussian proceses have emerged has a powerful tool to model unknown functions and learn them from samples

  • A Gaussian process is a colelction fo random variables, any finite number of which have a joint Gaussian distribution

  • A Gaussian process is completely specific by a mean function \(m(\vecx)\) and a covariance function \(k(\vecx,\vecx')\) of a real-valued process \(f(\vecx)\) such that \[ m(\vecx)\eqdef \E{f(\vecx)}\qquad k(\vecx,\vecx') = \E{(f(\vecx-m(\vecx)))(f(\vecx')-m(\vecx'))} \]

  • Possible to generalize to vector-valued functions (more on this later), often assume \(m(\vecx)=0\)

Gaussian Processes

  • Example of kernel: squared exponential kernel \[ k(\vecx,\vecx') \eqdef \exp\left(-\frac{1}{2}\norm[2]{\vecx-\vecx'}^2\right) \]
    • The kernel is a hyperparameter of the model
    • The kernel controls the smoothness of the functions we model
  • Key benefit of GPs: incorporate knowledge of observations of the function
    • Assume we know \(n\) observations \((\vecx_i,f_i)\)
    • Assume we would like to approximate the function at \(n^*\) test points \((\vecx_j^*,f_j^*)\)
    • The joint distribution is \[ \left[\begin{array}{c}\vecf\\\vecf^*\end{array}\right]\sim\calN\left(\boldsymbol{0},\left[\begin{array}{cc}\matK(X,X)&\matK(X,X^*)\\\matK(X^*,X)&\matK(X^*X^*)\end{array}\right]\right) \]
    • How do we use this to estimate \(f^*\)?

Schur complement