Dr. Matthieu R Bloch

Wednesday October 06, 2021

**Assignment 4**assigned Tuesday, October 5, 2021Includes a (small) programming component

Due

**October 14, 2021**(soft deadline, hard deadline on October 16)

**Last time**: Least-square regression**Today**Solving linear least-square regression

Extension to infinite dimension

**Reading:**Romberg, lecture notes 8

- Any solution \(\bftheta^*\) to the problem \(\min_{\bftheta\in\bbR^d} \norm[2]{\bfy-\matX\bftheta}^2\) must satisfy \[ \matX^\intercal\matX\bftheta^* = \matX^\intercal\vecy \] This system is called
*normal equations* **Facts:**for any matrix \(\bfA\in\bbR^{m\times n}\)\(\ker{\bfA^\intercal\bfA}=\ker{\bfA}\)

\(\text{col}(\bfA^\intercal\bfA)=\text{row}(\bfA)\)

\(\text{row}(\bfA)\) and \(\ker{\bfA}\) are orthogonal complements

We can say a lot more about the normal equations

- There is always a solution
- If \(\textsf{rank}(\bfX)=d\), there is a unique solution: \((\matA^\intercal\matA)^{-1}\matA^\intercal \bfy\)
- if \(\textsf{rank}(\bfX)<d\) there are infinitely many non-trivial solution
- if \(\textsf{rank}(\bfX)=n\), there exists a solution \(\bftheta^*\) for which \(\bfy=\bfX\bftheta^*\)

In machine learning, there are often infinitely many solutions

One reasonable to choose a solution among infinitely many is the

*minimum energy*principle \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX^\intercal\bfX\bftheta = \bfX^\intercal\bfy \]- We will see the solution is always unique using the SVD

For now, assume that \(\textsf{rank}(\bfX)=d\), so that the problem becomes \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX\bftheta = \bfy \]

- The solution is \(\bftheta^*=\bfA^\intercal(\bfA\bfA^\intercal)^{-1}\bfy\)

- Recall the problem \[
\min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX^\intercal\bfX\bftheta = \bfX^\intercal\bfy
\]
- There are infinitely many solution if \(\ker{\bfX}\) is non trivial
- The space of solution is unbounded!
- Even if \(\ker{\bfX}=\set{0}\), the system can be poorly conditioned

**Regularization**with \(\lambda>0\) consists in solving \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\bfX\bftheta}^2 + \lambda\norm[2]{\bftheta}^2 \]- This problem
*always*has a unique solution

- This problem
- The solution is \(\bftheta^*=(\bfX^\intercal\bfX+\lambda\bfI)^{-1}\bfX^\intercal\bfy = \bfX^\intercal(\bfX\bfX^\intercal+\lambda\bfI)^{-1}\bfy\)
- Note that \(\bftheta^*\) is the row space of \(\matX\) \[ \bftheta^* = \matX\bfalpha\textsf{ with } \bfalpha =(\bfX\bfX^\intercal+\lambda\bfI)^{-1}\bfy \]