Mathematical Foundations of Machine Learning

Prof. Matthieu Bloch

Monday September 16, 2024 (v1.1)

Last time

Image
  • Last class: Wednesday, September 11, 2024
    • We wrapped up our discussion of infinite dimensional spaces
    • Take away: separable Hilbert spaces are like \(\ell_2\)
  • To be effectively prepared for today's class, you should have:
    1. Come to class last week
    2. Gone over slides and read associated lecture notes
    3. Have submitted Homework 2
  • Logistics
    • Office hours
      • Jack Hill: 11:30am-12:30pm on Wednesdays in TSRB and hybrid
      • Anuvab: 12pm-1pm on Thursdays in TSRB and hybrid
    • Gradescope
      • Make sure you assign each solution to each question
      • Make sure you do some quality control on your submission

Regression

  • A fundamental problem in supervised machine learning can be cast as follows \[ \textsf{Given dataset }\calD\eqdef\{(\vecx_i,y_i)\}_{i=1}^n\textsf{, how do we find $f$ such that $f(\bfx_i)\approx y_i$ for all }i? \]

    • Often \(\vecx_i\in\bbR^d\), but sometimes \(\vecx_i\) is a weirder object (think tRNA string)
    • if \(y_i\in\calY\subseteq\bbR\) with \(\card{\calY}<\infty\), the problem is called classification
    • if \(y_i\in\calY=\bbR\), the problem is called regression
  • We need to introduce several ingredients to make the question well defined

    1. We need a class \(\calF\) to which \(f\) should belong
    2. We need a loss function \(\ell:\bbR\times\bbR\to\bbR^+\) to measure the quality of our approximation
  • We can then formulate the question as \[ \min_{f\in\calF}\sum_{i=1}^n\ell(f(\bfx_i),y_i) \]

  • We will focus quite a bit on the square loss \(\ell(u,v)\eqdef (u-v)^2\), called least-square regression

Least square linear regression

  • A common choice of \(\calF\) is the set of continuous linear functions.

    • \(f:\bbR^d\to\bbR\) is linear iff \[ \forall \bfx,\bfy\in\bbR^d,\lambda,\mu\in\bbR\quad f(\lambda\bfx + \mu\bfy) = \lambda f(\bfx)+\mu f(\bfy) \]

    • We will see that every continuous linear function on \(\bbR^d\) is actually an inner product, i.e., \[ \exists \bftheta_f\in\bbR^d\textsf{ s.t. } f(\bfx)=\bftheta_f^\intercal\bfx \quad\forall \bfx\in\bbR^d \]

Least square affine regression

  • Canonical form I

    • Stack \(\bfx_i\) as row vectors into a matrix \(\bfX\in\bbR^{n\times d}\), stack \(y_i\) as elements of column vector \(\bfy\in\bbR^n\) \[ \min_{\bftheta\in\bbR^d} \norm[2]{\bfy-\matX\bftheta}^2\textsf{ with } \matX\eqdef\mat{c}{-\vecx_1^\intercal-\\\vdots\\-\vecx_n^\intercal-} \]
  • Canonical form II

    • Allow for affine functions (not just linear)

    • Add a 1 to every \(\vecx_i\) \[ \min_{\bftheta\in\bbR^{d+1}} \norm[2]{\bfy-\matX\bftheta}^2\textsf{ with } \matX\eqdef\mat{c}{1-\vecx_1^\intercal-\\\vdots\\1-\vecx_n^\intercal-} \]

Nonlinear regression using a basis

  • Let \(\calF\) be an \(d\)-dimensional subspace of a vector space with basis \(\set{\psi_i}_{i=1}^d\)

    • We model \(f(\bfx) = \sum_{i=1}^d\theta_i\psi_i(\bfx)\)
  • The problem becomes \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\boldsymbol{\Psi}\bftheta}^2\textsf{ with }\boldsymbol{\Psi}\eqdef \mat{c}{-\psi(\bfx_1)^\intercal-\\\vdots\\-\psi(\bfx_n)^\intercal-}\eqdef\mat{cccc}{\psi_1(\bfx_1)&\psi_2(\bfx_1)&\cdots&\psi_d(\bfx_1)\\ \vdots&\vdots&\vdots&\vdots\\ \psi_1(\bfx_n)&\psi_2(\bfx_n)&\cdots&\psi_d(\bfx_n) } \]

  • We are recovering a nonlinear function of a continuous variable

    • This is the exact same computational framework as linear regression.

Solving the least-squares problem

Any solution \(\bftheta^*\) to the problem \(\min_{\bftheta\in\bbR^d} \norm[2]{\bfy-\matX\bftheta}^2\) must satisfy \[ \matX^\intercal\matX\bftheta^* = \matX^\intercal\vecy \] This system is called normal equations

  • Facts: for any matrix \(\bfA\in\bbR^{m\times n}\)

    • \(\ker{\bfA^\intercal\bfA}=\ker{\bfA}\)

    • \(\text{col}(\bfA^\intercal\bfA)=\text{row}(\bfA)\)

    • \(\text{row}(\bfA)\) and \(\ker{\bfA}\) are orthogonal complements

  • We can say a lot more about the normal equations

    1. There is always a solution
    2. If \(\textsf{rank}(\bfX)=d\), there is a unique solution: \((\matX^\intercal\matX)^{-1}\matX^\intercal \bfy\)
    3. if \(\textsf{rank}(\bfX)<d\) there are infinitely many non-trivial solution
    4. if \(\textsf{rank}(\bfX)=n\), there exists a solution \(\bftheta^*\) for which \(\bfy=\bfX\bftheta^*\)
  • In machine learning, there are often infinitely many solutions

Minimum norm 2 solutions

  • Reasonable approach: choose a solution among infinitely many with the minimum energy principle \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX^\intercal\bfX\bftheta = \bfX^\intercal\bfy \]

    • We will see the solution is always unique using the SVD
  • For now, assume that \(\textsf{rank}(\bfX)=n\), so that the problem becomes \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX\bftheta = \bfy \]

If \(\textsf{rank}(\bfX)=n\) the solution is \(\bftheta^*=\matX^\intercal(\matX\matX^\intercal)^{-1}\bfy\)

Regularization

  • Recall the problem \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX^\intercal\bfX\bftheta = \bfX^\intercal\bfy \]
    • There are infinitely many solution if \(\ker{\bfX}\) is non trivial
    • The space of solution is unbounded!
    • Even if \(\ker{\bfX}=\set{0}\), the system can be poorly conditioned
  • Regularization with \(\lambda>0\) consists in solving \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\bfX\bftheta}^2 + \lambda\norm[2]{\bftheta}^2 \]
    • This problem always has a unique solution

The solution is \(\bftheta^*=(\bfX^\intercal\bfX+\lambda\bfI)^{-1}\bfX^\intercal\bfy = \bfX^\intercal(\bfX\bfX^\intercal+\lambda\bfI)^{-1}\bfy\)

  • Note that \(\bftheta^*\) is in the row space of \(\matX\) \[ \bftheta^* = \matX^\intercal\bfalpha\textsf{ with } \bfalpha =(\bfX\bfX^\intercal+\lambda\bfI)^{-1}\bfy \]

Next time

  • Next class: Wednesday September 18, 2024
  • To be effectively prepared for next class, you should:
    1. Go over today's slides and read associated lecture notes
    2. Work on Homework 3 (due date: Wednesday September 25, 2024)
  • Optional
    • Export slides for next lecture as PDF (be on the lookout for an announcement when they're ready)