Matthieu Bloch
Regression: \(y_i\in\calY=\bbR\)
Linear regression: \(\calH\) is the set of affine functions \[f(\bfx)\eqdef \bfbeta^\intercal\bfx+\beta_0\textsf{ with }\bfbeta\eqdef[\beta_1,\cdots,\beta_d]^\intercal\]
Least square regression: loss function is sum of square errors \[\mathrm{SSE}(\bfbeta,\beta_0)\eqdef\sum_{i=1}^N(y_i-\bfbeta^\intercal\bfx_i -\beta_0)^2\]
Change of notation \[\bftheta\eqdef\left[\begin{array}{c} \beta_0\\ \beta_1\\ \vdots\\ \beta_d\end{array}\right]\in\bbR^{d+1}\qquad \bfy \eqdef\left[\begin{array}{c} y_1\\ y_2\\ \vdots\\ y_N \end{array}\right]\in\bbR^{N} \qquad \bfX\eqdef\left[\begin{array}{c}1&-\bfx_1^\intercal-\\1&-\bfx_2^\intercal-\\\vdots&\vdots\\1&-\bfx_N^\intercal-\\\end{array}\right]\in\bbR^{N\times (d+1)}\]
Rewrite sum of square error as \(\mathrm{SSE}(\bftheta)\eqdef \norm[2]{\bfy-\bfX\bftheta}^2\)
If \(\bfX^\intercal\bfX\) is non singular the minimizer of the SSE is \[\hat{\theta} = (\bfX^\intercal\bfX)^{-1}\bfX^\intercal\bfy\]
As for classification, linear methods have their limit
Create non-linear estimators using non-linear feature map \(\Phi:\bbR^d\to\bbR^\ell:\bfx\mapsto\Phi(\bfx)\)
Regression model becomes \[ \bfy = \bfbeta^\intercal\Phi(\bfx)+\beta_0\textsf{ with }\bfbeta\in\bbR^\ell \]
Least square estimate of cubic polynomial \(f\) with \(d=1\)
Overfitting describes the situation when fitting the data well no longer ensures that the out-of-sample (generalization) error is small
Overfitting occurs as the number of features \(d\) begins to approach the number of observations \(N\)
Idea: introduce a penalty term to “regularize” the vector \(\bftheta\): \[\bftheta = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2+\norm[2]{\bfGamma\bftheta}^2\quad\textsf{where}\quad\bfGamma\in\bbR^{(d+1)\times(d+1)}\]
The minimizer of the least-square problem with Thykonov regularization is \[\hat{\theta} = (\bfX^\intercal\bfX+\bfGamma^\intercal\bfGamma)^{-1}\bfX^\intercal\bfy\]
With \(\bfGamma=\sqrt{\lambda} \mathbf{I}\) for some \(\lambda>0\), we obtain \(\hat{\theta} = (\bfX^\intercal\bfX+{\lambda}\mathbf{I})^{-1}\bfX^\intercal\bfy\)
Ridge regression does not penalize \(\beta_0\) and corresponds to \[\bfGamma = \left[\begin{array}{cccc}0 &0 & \cdots & 0\\0&\sqrt{\lambda}&\cdots&0\\\vdots&\ddots&\ddots&\vdots\\0&\cdots&\cdots&\sqrt{\lambda} \end{array}\right]\]
The minimizer of the least-square problem with Thykonov regularization is the solution of \[\argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2\textsf{ such that }\norm[2]{\bfGamma\bftheta}^2\leq \tau\] for some \(\tau>0\)
Tikhonov regularization is a shrinkage estimator, which “shrinks” naive estimate towards some guess
Let \(\{x_i\}_{i=1}^N\) be iid samples drawn according to unknown distribution with variance \(\sigma^2\). Consider the estimator of the variance \(\hat{\sigma}^2\eqdef \frac{1}{N}\sum_{i=1}^N(\bfx_i-\hat{\mu})^2\). Then \(\E{\hat{\sigma}^2}=\frac{N-1}{N}\sigma^2\).
Least absolute shrinkage and selection operator (LASSO)
In constrained form \[\bftheta = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2\textsf{ such that }\norm[1]{\bftheta}\leq \tau\]