Mathematical Foundations of Machine Learning

Last time

Last class: Wednesday October 23, 2024
- We proved about the spectral theorem
- We talked about relating the spectral theorem to the stability of solutions of $y = A x$
  - ( $A$ square and postive definite)
Today: We will talk about the singular value decomposition

To be effectively prepared for today's class, you should have:
1. Gone over slides and read associated lecture notes here
2. Started working on Homework 5 (due Tuesday October 29, 2024)
Logistics: use office hours to prepare for Homework 5
- Jack Hill office hours: Wednesday 11:30am-12:30pm in TSRB and hybrid
- Anuvab Sen office hours: Thursday 12pm-1pm in TSRB and hybrid
Homework 6: due Thursday November 7, 2024

Wednesday October 30, 2024 no synchronous class
- I am traveling to a conference and will upload an asynchronous lecture

Midterm statistics

What can we do

We will be introducing recitations with GTAs every week
- Opportunity to go through examples
- Opportunity to work on a problem with some guidance
Think about your study habits

Spectral theorem

Every complex matrix $A$ has at least one complex eigenvector and every real symmetrix matrix has real eigenvalues and at least one real eigenvector.

Every matrix $A \in C^{n \times n}$ is unitarily similar to an upper triangular matrix, i.e., $A = V Δ V^{†}$ with $Δ$ upper triangular and $V^{†} = V^{- 1}$ .

Every hermitian matrix is unitarily similar to a real-valued diagonal matrix.

Note that if $A = V D V^{†}$ then $A = \sum_{i = 1}^{n} λ_{i} v_{i} v_{i}^{†}$

How about real-valued matrices $A \in R^{n \times n}$ ?

Singular value decomposition

What happens for non-square matrices?

Let $A \in R^{m \times n}$ with $rank (A) = r$ . Then $A = U Σ V^{T}$ where

$U \in R^{m \times r}$ such that $U^{⊺} U = I_{r}$ (orthonormal columns)
$V \in R^{n \times r}$ such that $V^{⊺} V = I_{r}$ (orthonormal columns)
$Σ \in R^{r \times r}$ is diagonal with positive entries

$Σ ≜ [\begin{array}{cccc} σ_{1} & 0 & 0 & \dots \\ 0 & σ_{2} & 0 & \dots \\ ⋮ & ⋱ \\ 0 & \dots & \dots & σ_{r} \end{array}]$ and $σ_{1} \geq σ_{2} \geq \dots \geq σ_{r} > 0$ . The $σ_{i}$ are called the singular values

We say that $A$ is full rank is $r = min (m, n)$
We can write $A = \sum_{i = 1}^{r} σ_{i} u_{i} v_{i}^{⊺}$

Important properties of the SVD

The columns of $V$ ${v_{i}}_{i = 1}^{r}$ are eigenvectors of the psd matrix $A^{⊺} A$ .
- ${σ_{i} : 1 \leq i \leq n and σ_{i} \neq 0}$ are the square roots of the non-zero eigenvalues of $A^{⊺} A$ .
The columns of $U$ ${u_{i}}_{i = 1}^{r}$ are eigenvectors of the psd matrix $A A^{⊺}$ .
- ${σ_{i} : 1 \leq i \leq n and σ_{i} \neq 0}$ are the square roots of the non-zero eigenvalues of $A A^{⊺}$ .
The columns of $V$ form an orthobasis for $row (A)$

The columns of $U$ form an orthobasis for $col (A)$

Equivalent form of the SVD: $A = \tilde{U} \tilde{Σ} {\tilde{V}}^{T}$ where
- $\tilde{U} \in R^{m \times m}$ is orthonormal
- $\tilde{V} \in R^{n \times n}$ is orthonormal
- $\tilde{Σ} \in R^{m \times n}$ is
$\tilde{Σ} ≜ [\begin{array}{cc} Σ & 0 \\ 0 & 0 \end{array}]$

SVD and least-squares

When we cannot solve $y = A x$ , we solve instead $min_{x \in R^{n}} {‖ x ‖}_{2}^{2} such that A^{⊺} A x = A^{⊺} y$
- This allows us to pick the minimum norm solution among potentially infinitely many solutions of the normal equations.
Recall: when $A \in R^{m \times n}$ is of rank $m$ , then $x = A^{⊺} (A A^{⊺})^{- 1} y$

The solution of $min_{x \in R^{n}} {‖ x ‖}_{2}^{2} such that A^{⊺} A x = A^{⊺} y$ is $\hat{x} = V Σ^{- 1} U^{⊺} y = \sum_{i = 1}^{r} \frac{1}{σ_{i}} {⟨ y, u_{i} ⟩}_{} v_{i}$ where $A = U Σ V^{T}$ is the SVD of $A$ .

Pseudo inverse

$A^{+} = V Σ^{- 1} U^{⊺}$ is called the pseudo-inverse, Lanczos inverse, or Moore-Penrose inverse of $A = U Σ V^{T}$ .

If $A$ is square invertible then $A^{+} = A$

If $m \geq n$ (tall and skinny matrix) of rank $n$ then $A^{+} = (A^{⊺} A)^{- 1} A^{⊺}$

If $m \geq n$ (short and fat matrix) of rank $m$ then $A^{+} = A^{⊺} (A A^{⊺})^{- 1}$

Note $A^{+}$ is as "close" to an inverse of $A$ as possible

Stability of least squares

What if we observe $y = A x_{0} + e$ and we apply the pseudo inverse? $\hat{x} = A^{+} y$

We can separate the error analysis into two components $\hat{x} - x_{0} = \underset{null space error}{\underset{⏟}{A^{+} A x_{0} - x_{0}}} + \underset{noise error}{\underset{⏟}{A^{+} e}}$
We will express the error in terms of the SVD $A = U Σ V^{⊺}$ With
- ${v_{i}}_{i = 1}^{r}$ orthobasis of $row (A)$ , augmented by ${v_{i}}_{i = 1}^{r + 1} \in \ker A$ to form an orthobasis of $R^{n}$
- ${u_{i}}_{i = 1}^{r}$ orthobasis of $col (A)$ , augmented by ${u}_{i = 1}^{r + 1} \in \ker A^{⊺}$ to form an orthobasis of $R^{m}$
The null space error is given by ${‖ A^{+} A x_{0} - x_{0} ‖}_{2}^{2} = \sum_{i = r + 1}^{n} {| {⟨ v_{i}, x_{0} ⟩}_{} |}^{2}$
The noise error is given by ${‖ A^{+} e ‖}_{2}^{2} = \sum_{i = 1}^{r} \frac{1}{σ_{i}^{2}} {| {⟨ e, u_{i} ⟩}_{} |}^{2}$

Stable reconstruction by truncation

How do we mitigate the effect of small singular values in reconstruction? $\hat{x} = V Σ^{- 1} U^{⊺} y = \sum_{i = 1}^{r} \frac{1}{σ_{i}} {⟨ y, u_{i} ⟩}_{} v_{i}$
Truncate the SVD to $r^{'} < r$ $A_{t} ≜ \sum_{i = 1}^{r^{'}} σ_{i} u_{i} v_{i}^{⊺} A_{t}^{+} = \sum_{i = 1}^{r^{'}} \frac{1}{σ_{i}} u_{i} v_{i}^{⊺}$
Reconstruct $\hat{x_{t}} = \sum_{i = 1}^{r^{'}} \frac{1}{σ_{i}} {⟨ y, u_{i} ⟩}_{} v_{i} = A_{t}$

Error analysis: ${‖ {\hat{x}}_{t} - x_{0} ‖}_{2}^{2} = \sum_{i = r + 1}^{n} {| {⟨ x_{0}, v_{i} ⟩}_{} |}^{2} + \sum_{i = r^{'} + 1}^{r} {| {⟨ x_{0}, v_{i} ⟩}_{} |}^{2} + {\sum_{i = 1}^{r}}^{'} \frac{1}{σ_{i}^{2}} {| {⟨ e, u_{i} ⟩}_{} |}^{2}$

Stable reconstruction by regularization

Regularization means changing the problem to solve $min_{x \in R^{n}} {‖ y - A x ‖}_{2}^{2} + λ {‖ x ‖}_{2}^{2} λ > 0$
The solution is $\hat{x} = (A^{⊺} A + λ I)^{- 1} A^{⊺} y = V (Σ^{2} + λ I)^{- 1} Σ U^{⊺} y$

Numerical methods

We have seen several solutions to systems of linear equations $A x = y$ so far
- $A$ full column rank: $\hat{x} = (A^{⊺} A)^{- 1} A^{⊺} y$
- $A$ full row rank: $\hat{x} = A^{⊺} (A A^{⊺})^{- 1} y$
- Ridge regression: $\hat{x} = (A^{⊺} A + δ I)^{- 1} A^{⊺} y$
- Kernel regression: $\hat{x} = (K + δ I)^{- 1} y$
- Ridge regression in Hilbert space: $\hat{x} = (A^{⊺} A + δ G)^{- 1} A^{⊺} y$
Extension: constrained least-squares $min_{x \in R^{n}} {‖ y - A x ‖}_{2}^{2} s.t. x = B α for some α$
- The solution is $\hat{x} = B (B^{⊺} A^{⊺} A B)^{- 1} B^{⊺} A^{⊺} y$
All these problems involve a symmetric positive definite system of equations.
- Many methods to achieve this based on matrix factorization

Next time

Next class: Wednesday October 30, 2024 (asynchronous)

To be effectively prepared for next class, you should:
1. Go over today's slides and read associated lecture notes
2. Work on Homework 5
Optional
- Export slides for next lecture as PDF (be on the lookout for an announcement when they're ready)