Mathematical Foundations of Machine Learning

Prof. Matthieu Bloch

Wednesday, September 18, 2024 (v1.0)

Last time

Image
  • Last class: Monday September 16, 2024
    • We talked about least square regression
  • To be effectively prepared for today's class, you should have:
    1. Come to class last week
    2. Gone over slides and read associated lecture notes
  • Logistics
    • Office hours
      • Jack Hill: 11:30am-12:30pm on Wednesdays in TSRB and hybrid
      • Anuvab: 12pm-1pm on Thursdays in TSRB and hybrid
    • Gradescope
      • Make sure you assign each solution to each question
      • Make sure you do some quality control on your submission

Solving the least-squares problem

Any solution \(\bftheta^*\) to the problem \(\min_{\bftheta\in\bbR^d} \norm[2]{\bfy-\matX\bftheta}^2\) must satisfy \[ \matX^\intercal\matX\bftheta^* = \matX^\intercal\vecy \] This system is called normal equations

  • Facts: for any matrix \(\bfA\in\bbR^{m\times n}\)

    • \(\ker{\bfA^\intercal\bfA}=\ker{\bfA}\)

    • \(\text{col}(\bfA^\intercal\bfA)=\text{row}(\bfA)\)

    • \(\text{row}(\bfA)\) and \(\ker{\bfA}\) are orthogonal complements

  • We can say a lot more about the normal equations

    1. There is always a solution
    2. If \(\textsf{rank}(\bfX)=d\), there is a unique solution: \((\matX^\intercal\matX)^{-1}\matX^\intercal \bfy\)
    3. if \(\textsf{rank}(\bfX)<d\) there are infinitely many non-trivial solution
    4. if \(\textsf{rank}(\bfX)=n\), there exists a solution \(\bftheta^*\) for which \(\bfy=\bfX\bftheta^*\)
  • In machine learning, there are often infinitely many solutions

Minimum norm 2 solutions

  • Reasonable approach: choose a solution among infinitely many with the minimum energy principle \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX^\intercal\bfX\bftheta = \bfX^\intercal\bfy \]

    • We will see the solution is always unique using the SVD
  • For now, assume that \(\textsf{rank}(\bfX)=n\), so that the problem becomes \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX\bftheta = \bfy \]

If \(\textsf{rank}(\bfX)=n\) the solution is \(\bftheta^*=\matX^\intercal(\matX\matX^\intercal)^{-1}\bfy\)

Regularization

  • Recall the problem \(\min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\) such that \(\bfX^\intercal\bfX\bftheta = \bfX^\intercal\bfy\)
    • There are infinitely many solution if \(\ker{\bfX}\) is non trivial
    • The space of solution is unbounded!
    • Even if \(\ker{\bfX}=\set{0}\), the system can be poorly conditioned
  • Regularization with \(\lambda>0\) consists in solving \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\bfX\bftheta}^2 + \lambda\norm[2]{\bftheta}^2 \]
    • This problem always has a unique solution

The solution is \(\bftheta^*=(\bfX^\intercal\bfX+\lambda\bfI)^{-1}\bfX^\intercal\bfy = \bfX^\intercal(\bfX\bfX^\intercal+\lambda\bfI)^{-1}\bfy\)

  • Note that \(\bftheta^*\) is in the row space of \(\matX\) \[ \bftheta^* = \matX^\intercal\bfalpha\textsf{ with } \bfalpha =(\bfX\bfX^\intercal+\lambda\bfI)^{-1}\bfy \]

Ridge regression

  • We can adapt the regularization approach to the situation of a finite dimension Hilbert space \(\calF\) \[ \min_{f\in\calF}\sum_{i=1}^n(y_i-f(\bfx_i))^2 + \lambda\norm[\calF]{f}^2 \]

    • We are penalizing the norm of the entire function \(f\)
  • Using a basis for the space \(\set{\psi_i}_{i=1}^d\) , and constructing \(\boldsymbol{\Psi}\) as earlier, we obtain \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\boldsymbol{\Psi}\bftheta}^2 + \lambda \bftheta^\intercal\matG\bftheta \] with \(\matG\) the Gram matrix for the basis.

  • If \(\boldsymbol{\Psi}^\intercal \boldsymbol{\Psi}+\lambda\matG\) is invertible, we find the solution as \[ \bftheta^* = (\boldsymbol{\Psi}^\intercal \boldsymbol{\Psi}+\lambda\matG)^{-1}\boldsymbol{\Psi}^\intercal \bfy \] and we can reconstruct the function as \(f(\bfx) = \sum_{i=1}^d\theta_i^*\psi_{i}(\bfx)\).

  • If \(\bfG\) is well conditioned, the resulting function is not too sensitive to the choice of the basis

Least-Squares in infinite dimension Hilbert spaces

  • In \(\bbR^d\), the problem \(\min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\bfX\bftheta}^2 + \lambda\norm[2]{\bftheta}^2\) has a solution

    • \(\bftheta^* = \matX^\intercal\bfalpha\) with \(\bfalpha =(\bfX\bfX^\intercal+\lambda\bfI)^{-1}\bfy\)
    • \(\matX\matX^\intercal\in\bbR^{n\times n}\) is dimension independent!
    • We will be able to extend this to infinite dimensional Hilbert spaces!
  • Let \(\calF\) be a Hilbert space and let \(f\in\calF\) be the function we are trying to estimate

    • We will estimate \(f\in\calF\) using noisy observations \(\dotp{f}{x_i}\) with \(\set{x_i}_{i=1}^n\) elements of \(\calF\)

    • This is the equivalent of saying \(\bfy = \bfA\bfx+\bfn\) in finite dimension

\[ \min_{f\in\calF}\sum_{i=1}^n\abs{y_i-{\dotp{f}{x_i}}_{\calH}}^2+\lambda\norm[\calH]{f} \] has solution \[ f = \sum_{i=1}^n\alpha_i x_i\textsf{ with } \bfalpha = (\matK+\lambda\matI)^{-1}\vecy\qquad \matK=\mat{c}{\dotp{x_i}{x_j}}_{1\leq i,j\leq n} \]

  • This happens in Reproducing Kernel Hilber Space (RKHS)

Next time

  • Next class: Monday September 23, 2024
  • To be effectively prepared for next class, you should:
    1. Go over today's slides and read associated lecture notes
    2. Work on Homework 3 (to be posted soon)
  • Optional
    • Export slides for next lecture as PDF (be on the lookout for an announcement when they're ready)