Singular value decomposition

Dr. Matthieu R Bloch

Monday, November 15, 2021

Logistics

  • General announcements

    • Assignment 6 to be posted… (grading traffic jam)

    • 7 lectures left!

  • Assignment 5 and Midterm 2:

    • Grading starting, we’ll keep you posted

What’s on the agenda for today?

Image
Toddlers can do it!
  • Last time:

    • Singular value decomposition
  • Today

    • Application to solving least squares
  • Reading: lecture notes 13/14

Singular value decomposition

  • What happens for non-square matrice?

  • Let \(\matA\in\bbR^{m\times n}\) with \(\text{rank}(\matA)=r\). Then \(\matA=\matU\boldsymbol{\Sigma}\matV^T\) where

    • \(\matU\in\bbR^{m\times r}\) such that \(\matU^\intercal\matU=\bfI_r\) (orthonormal columns)
    • \(\matV\in\bbR^{n\times r}\) such that \(\matV^\intercal\matV=\bfI_r\) (orthonormal columns)
    • \(\boldsymbol{\Sigma}\in\bbR^{r\times r}\) is diagonal with positive entries \[ \boldsymbol{\Sigma}\eqdef\mat{cccc}{\sigma_1&0&0&\cdots\\0&\sigma_2&0&\cdots\\\vdots&&\ddots&\\0&\cdots&\cdots&\sigma_r} \] and \(\sigma_1\geq\sigma_2\geq\cdots\geq\sigma_r>0\). The \(\sigma_i\) are called the singular values
  • We say that \(\matA\) is full rank is \(r=\min(m,n)\)

  • We can write \(\matA=\sum_{i=1}^r\sigma_i\vecu_i\vecv_i^\intercal\)

  • Important properties of the SVD

    • The columns of \(\matV\) \(\set{\vecv_i}_{i=1}^r\) are eigenvectors of the psd matrix \(\matA^\intercal\matA\). \(\set{\sigma_i:1\leq i\leq n\text{ and } \sigma_i\neq 0}\) are the square roots of the non-zero eigenvalues of \(\matA^\intercal\matA\).

    • The columns of \(\matU\) \(\set{\vecu_i}_{i=1}^r\) are eigenvectors of the psd matrix \(\matA\matA^\intercal\). \(\set{\sigma_i:1\leq i\leq n\text{ and } \sigma_i\neq 0}\) are the square roots of the non-zero eigenvalues of \(\matA\matA^\intercal\).

    • The columns of \(\matV\) form an orthobasis for \(\text{row}(\matA)\)

    • The columns of \(\matU\) form an orthobasis for \(\text{col}(\matA)\)

    • Equivalent form of the SVD: \(\matA=\widetilde{\matU}\widetilde{\boldsymbol{\Sigma}}\widetilde{\matV}^T\) where

      • \(\widetilde{\matU}\in\bbR^{m\times m}\) is orthonormal
      • \(\widetilde{\matV}\in\bbR^{n\times n}\) is orthonormal
      • \(\widetilde{\boldsymbol{\Sigma}}\in\bbR^{m\times n}\) is

      \[ \widetilde{\boldsymbol{\Sigma}}\eqdef\mat{cc}{\boldsymbol{\Sigma}&\boldsymbol{0}\\\boldsymbol{0}&\boldsymbol{0}} \]

    SVD and least-squares

    • When we cannot solve \(\vecy=\matA\vecx\), we solve instead \[ \min_{\bfx\in\bbR^n}\norm[2]{\vecx}^2\text{ such that } \matA^\intercal\matA\vecx = \matA^\intercal\vecy \]

      • This allows us to pick the minimum norm solution among potentially infinitely many solutions of the normal equations.
    • Recall: when \(\matA\in\bbR^{m\times n}\) is of rank \(n\), then \(\bfx=\matA^\intercal(\matA\matA^\intercal)^{-1}\vecy\)

    • The solution of \[ \min_{\bfx\in\bbR^n}\norm[2]{\vecx}^2\text{ such that } \matA^\intercal\matA\vecx = \matA^\intercal\vecy \] is \[ \hat{\vecx} = \matV\boldsymbol{\Sigma}^{-1}\matU^\intercal\vecy = \sum_{i=1}^r\frac{1}{\sigma_i}\dotp{\vecy}{\vecu_i}\vecv_i \] where \(\matA=\matU\boldsymbol{\Sigma}\matV^T\) is the SVD of \(\matA\).

    Pseudo inverse

    • \(\matA^+ = \matV\boldsymbol{\Sigma}^{-1}\matU^\intercal\) is called the pseudo-inverse, Lanczos inverse, or Moore-Penrose inverse of \(\matA=\matU\boldsymbol{\Sigma}\matV^T\).

    • If \(\matA\) is square invertible then \(\matA^+=\matA\)

    • If \(m\geq n\) (tall and skinny matrix) of rank \(n\) then \(\matA^+ = (\matA^\intercal\matA)^{-1}\matA^\intercal\)

    • If \(m\geq m\) (short and fat matrix) of rank \(m\) then \(\matA^+ = \matA^\intercal(\matA\matA^\intercal)^{-1}\)

    • Note \(\matA^+\) is as “close” to an inverse of \(\matA\) as possible

    Stability of least squares

    • What if we observe \(\vecy = \matA\vecx_0+\vece\) and we apply the pseudo inverse? \(\hat{\vecx} = \matA^+\vecy\)

    • We can separate the error analysis into two components \[ \hat{\vecx}-\vecx_0 = \underbrace{\matA^+\matA\vecx_0-\vecx_0}_{\text{null space error}} + \underbrace{\matA^+\vece}_{\text{noise error}} \]

    • We will express the error in terms of the SVD \(\matA=\matU\boldsymbol{\Sigma}\matV^\intercal\) With

      • \(\set{\vecv_i}_{i=1}^r\) orthobasis of \(\text{row}(\matA)\), augmented by \(\set{\vecv_i}_{i=1}^{r+1}\in\ker{\matA}\) to form an orthobasis of \(\bbR^n\)
      • \(\set{\vecu_i}_{i=1}^r\) orthobasis of \(\text{col}(\matA)\), augmented by \(\set{\vecu}_{i=1}^{r+1}\in\ker{\matA^\intercal}\) to form an orthobasis of \(\bbR^m\)
    • The null space error is given by \[ \norm[2]{\matA^+\matA\vecx_0-\vecx_0}^2=\sum_{i=r+1}^n\abs{\dotp{\vecv_i}{x_0}}^2 \]

    • The noise error is given by \[ \norm[2]{\matA^+\vece}^2=\sum_{i=1}^r \frac{1}{\sigma_i^2}\abs{\dotp{\vece}{\vecu_i}}^2 \]

    Stable reconstruction by truncation

    • How do we mitigate the effect of small singular values in reconstruction? \[ \hat{\vecx} = \matV\boldsymbol{\Sigma}^{-1}\matU^\intercal\vecy = \sum_{i=1}^r\frac{1}{\sigma_i}\dotp{\vecy}{\vecu_i}\vecv_i \]

    • Truncate the SVD to \(r'<r\) \[ \matA_t\eqdef \sum_{i=1}^{r'}\sigma_i\vecu_i\vecv_i^\intercal\qquad\matA_t^+ = \sum_{i=1}^{r'}\frac{1}{\sigma_i}\vecu_i\vecv_i^\intercal \]

    • Reconstruct \(\hat{\vecx_t} = \sum_{i=1}^{r'}\frac{1}{\sigma_i}\dotp{\vecy}{\vecu_i}\vecv_i=\matA_t\)

    • Error analysis: \[ \norm[2]{\hat{\vecx}_t}^2 = \sum_{i=r+1}^n\abs{\dotp{\vecx_0}{\vecv_i}}^2+\sum_{i=r'+1}^r\abs{\dotp{\vecx_0}{\vecv_i}}^2+\sum_{i=1}^r'\frac{1}{\sigma_i^2}\abs{\dotp{\vece}{\vecu_i}}^2 \]

    Stable reconstruction by regularization

    • Regularization means changing the problem to solve \[ \min_{\vecx\in\bbR^n}\norm[2]{\vecy-\matA\vecx}^2+\lambda\norm[2]{\vecx}^2\qquad\ \lambda>0 \]

    • The solution is \[ \hat{\vecx} = (\matA^\intercal\matA+\lambda\matI)^{-1}\matA^\intercal\vecy = \matV(\boldsymbol{\Sigma}^2+\lambda\matI)^{-1}\boldsymbol{\Sigma}\matU^\intercal\vecy \]

    Numerical methods

    • We have seen several solutions to systems of linear equations \(\matA\vecx=\vecy\) so far

      • \(\matA\) full column rank: \(\hat{\bfx} = (\matA^\intercal\matA)^{-1}\matA^\intercal\bfy\)
      • \(\matA\) full row rank: \(\hat{\bfx} = \matA^\intercal(\matA\matA^\intercal)^{-1}\bfy\)
      • Ridge regression: \(\hat{\bfx} = (\matA^\intercal\matA+\delta\matI)^{-1}\matA^\intercal\bfy\)
      • Kernel regression: \(\hat{\bfx} = (\matK+\delta\matI)^{-1}\bfy\)
      • Ridge regression in Hilbert space: \(\hat{\bfx} = (\matA^\intercal\matA+\delta\matG)^{-1}\matA^\intercal\bfy\)
    • Extension: constrained least-squares \[ \min_{\vecx\in\bbR^n}\norm[2]{\vecy-\matA\vecx}^2\text{ s.t. } \vecx=\matB\vecalpha\text{ for some }\vecalpha \]

      • The solution is \(\hat{\bfx} = \matB(\matB^\intercal\matA^\intercal\matA\matB)^{-1}\matB^\intercal\matA^\intercal\bfy\)
    • All these problems involve a symmetric positive definite system of equations.

      • Many methods to achieve this based on matrix factorization

    Easy systems

    • Diagonal system
      • \(\bfA\in\bbR^{n\times n}\) invertible and diagonal
      • \(O(n)\) complexity
    • Orthogonal system
      • \(\bfA\in\bbR^{n\times n}\) invertible and orthogonal
      • \(O(n^2)\) complexity
    • Lower triangular system
      • \(\bfA\in\bbR^{n\times n}\) invertible and lower diagonal
      • \(O(n^2)\) complexity
    • General strategy: factorize \(\matA\) to recover some of the structures above