Prof. Matthieu Bloch
Monday September 16, 2024 (v1.1)
A fundamental problem in supervised machine learning can be cast as follows \[ \textsf{Given dataset }\calD\eqdef\{(\vecx_i,y_i)\}_{i=1}^n\textsf{, how do we find $f$ such that $f(\bfx_i)\approx y_i$ for all }i? \]
We need to introduce several ingredients to make the question well defined
We can then formulate the question as \[ \min_{f\in\calF}\sum_{i=1}^n\ell(f(\bfx_i),y_i) \]
We will focus quite a bit on the square loss \(\ell(u,v)\eqdef (u-v)^2\), called least-square regression
A common choice of \(\calF\) is the set of continuous linear functions.
\(f:\bbR^d\to\bbR\) is linear iff \[ \forall \bfx,\bfy\in\bbR^d,\lambda,\mu\in\bbR\quad f(\lambda\bfx + \mu\bfy) = \lambda f(\bfx)+\mu f(\bfy) \]
We will see that every continuous linear function on \(\bbR^d\) is actually an inner product, i.e., \[ \exists \bftheta_f\in\bbR^d\textsf{ s.t. } f(\bfx)=\bftheta_f^\intercal\bfx \quad\forall \bfx\in\bbR^d \]
Canonical form I
Canonical form II
Allow for affine functions (not just linear)
Add a 1 to every \(\vecx_i\) \[ \min_{\bftheta\in\bbR^{d+1}} \norm[2]{\bfy-\matX\bftheta}^2\textsf{ with } \matX\eqdef\mat{c}{1-\vecx_1^\intercal-\\\vdots\\1-\vecx_n^\intercal-} \]
Let \(\calF\) be an \(d\)-dimensional subspace of a vector space with basis \(\set{\psi_i}_{i=1}^d\)
The problem becomes \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\boldsymbol{\Psi}\bftheta}^2\textsf{ with }\boldsymbol{\Psi}\eqdef \mat{c}{-\psi(\bfx_1)^\intercal-\\\vdots\\-\psi(\bfx_n)^\intercal-}\eqdef\mat{cccc}{\psi_1(\bfx_1)&\psi_2(\bfx_1)&\cdots&\psi_d(\bfx_1)\\ \vdots&\vdots&\vdots&\vdots\\ \psi_1(\bfx_n)&\psi_2(\bfx_n)&\cdots&\psi_d(\bfx_n) } \]
We are recovering a nonlinear function of a continuous variable
Any solution \(\bftheta^*\) to the problem \(\min_{\bftheta\in\bbR^d} \norm[2]{\bfy-\matX\bftheta}^2\) must satisfy \[ \matX^\intercal\matX\bftheta^* = \matX^\intercal\vecy \] This system is called normal equations
Facts: for any matrix \(\bfA\in\bbR^{m\times n}\)
\(\ker{\bfA^\intercal\bfA}=\ker{\bfA}\)
\(\text{col}(\bfA^\intercal\bfA)=\text{row}(\bfA)\)
\(\text{row}(\bfA)\) and \(\ker{\bfA}\) are orthogonal complements
We can say a lot more about the normal equations
In machine learning, there are often infinitely many solutions
Reasonable approach: choose a solution among infinitely many with the minimum energy principle \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX^\intercal\bfX\bftheta = \bfX^\intercal\bfy \]
For now, assume that \(\textsf{rank}(\bfX)=n\), so that the problem becomes \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bftheta}^2\text{ such that } \bfX\bftheta = \bfy \]
If \(\textsf{rank}(\bfX)=n\) the solution is \(\bftheta^*=\matX^\intercal(\matX\matX^\intercal)^{-1}\bfy\)
The solution is \(\bftheta^*=(\bfX^\intercal\bfX+\lambda\bfI)^{-1}\bfX^\intercal\bfy = \bfX^\intercal(\bfX\bfX^\intercal+\lambda\bfI)^{-1}\bfy\)