Regression and regularization

Matthieu Bloch

From classification to regression

Supervised learning with labeled dataset \(\{(\bfx_i,y_i)\}_{i=1}^N\), \(\bfx_i\in\bbR^d\)
- Classification: \(y_i\in\calY\subset\bbR\) with \(\card{\calY}\eqdef K<\infty\)
- Regression: \(y_i\in\calY=\bbR\)
- Regression model: \(y=f(\bfx)+n\)
  - \(f\in\calH\), where \(\calH\) is a class of functions (polynomials, splines, kernels, etc.)
  - \(n\) is some random noise
Linear regression: \(\calH\) is the set of affine functions \[f(\bfx)\eqdef \bfbeta^\intercal\bfx+\beta_0\textsf{ with }\bfbeta\eqdef[\beta_1,\cdots,\beta_d]^\intercal\]
Least square regression: loss function is sum of square errors \[\mathrm{SSE}(\bfbeta,\beta_0)\eqdef\sum_{i=1}^N(y_i-\bfbeta^\intercal\bfx_i -\beta_0)^2\]

Widely used in applied mathematics
- Legendre: Nouvelles méthodes pour la détermination des orbites des comètes, 1805
- Gauss: Theoria Motus, 1809, claim to discovery in 1795
Change of notation \[\bftheta\eqdef\left[\begin{array}{c} \beta_0\\ \beta_1\\ \vdots\\ \beta_d\end{array}\right]\in\bbR^{d+1}\qquad \bfy \eqdef\left[\begin{array}{c} y_1\\ y_2\\ \vdots\\ y_N \end{array}\right]\in\bbR^{N} \qquad \bfX\eqdef\left[\begin{array}{c}1&-\bfx_1^\intercal-\\1&-\bfx_2^\intercal-\\\vdots&\vdots\\1&-\bfx_N^\intercal-\\\end{array}\right]\in\bbR^{N\times (d+1)}\]
Rewrite sum of square error as \(\mathrm{SSE}(\bftheta)\eqdef \norm[2]{\bfy-\bfX\bftheta}^2\)
If \(\bfX^\intercal\bfX\) is non singular the minimizer of the SSE is \[\hat{\theta} = (\bfX^\intercal\bfX)^{-1}\bfX^\intercal\bfy\]

As for classification, linear methods have their limit
Create non-linear estimators using non-linear feature map \(\Phi:\bbR^d\to\bbR^\ell:\bfx\mapsto\Phi(\bfx)\)
Regression model becomes \[ \bfy = \bfbeta^\intercal\Phi(\bfx)+\beta_0\textsf{ with }\bfbeta\in\bbR^\ell \]
Least square estimate of cubic polynomial \(f\) with \(d=1\)

Overfitting describes the situation when fitting the data well no longer ensures that the out-of-sample (generalization) error is small
Overfitting can happen in many situations:
1. When there are too many degrees of freedom in model (the model learn the noise, not \(f\))
2. When the hypothesis set contains simpler functions than target function \(f\) but too few samples
Overfitting occurs as the number of features \(d\) begins to approach the number of observations \(N\)
Consider the true model \(y=x^2\). The example peforms least square estimate with polynomial features of degree at most \(d\) using \(N=5\) sample points.
Solution to combat overfitting: regularization
- Regularization aims at penalizing models that are likely to be overfitting

Idea: introduce a penalty term to “regularize” the vector \(\bftheta\): \[\bftheta = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2+\norm[2]{\bfGamma\bftheta}^2\quad\textsf{where}\quad\bfGamma\in\bbR^{(d+1)\times(d+1)}\]
The minimizer of the least-square problem with Thykonov regularization is \[\hat{\theta} = (\bfX^\intercal\bfX+\bfGamma^\intercal\bfGamma)^{-1}\bfX^\intercal\bfy\]
With \(\bfGamma=\sqrt{\lambda} \mathbf{I}\) for some \(\lambda>0\), we obtain \(\hat{\theta} = (\bfX^\intercal\bfX+{\lambda}\mathbf{I})^{-1}\bfX^\intercal\bfy\)
Ridge regression does not penalize \(\beta_0\) and corresponds to \[\bfGamma = \left[\begin{array}{cccc}0 &0 & \cdots & 0\\0&\sqrt{\lambda}&\cdots&0\\\vdots&\ddots&\ddots&\vdots\\0&\cdots&\cdots&\sqrt{\lambda} \end{array}\right]\]

The minimizer of the least-square problem with Thykonov regularization is the solution of \[\argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2\textsf{ such that }\norm[2]{\bfGamma\bftheta}^2\leq \tau\] for some \(\tau>0\)
Illustration: shrinking the estimator

Tikhonov regularization is a shrinkage estimator, which “shrinks” naive estimate towards some guess
Let \(\{x_i\}_{i=1}^N\) be iid samples drawn according to unknown distribution with variance \(\sigma^2\). Consider the estimator of the variance \(\hat{\sigma}^2\eqdef \frac{1}{N}\sum_{i=1}^N(\bfx_i-\hat{\mu})^2\). Then \(\E{\hat{\sigma}^2}=\frac{N-1}{N}\sigma^2\).
Alternative regularizers
- Akaike information criterion (AIC): penalty that is an increasing function of the number of estimated parameters
- Bayesian information criterion (BIC): also an increasing function of the number of estimated parameters

Least absolute shrinkage and selection operator (LASSO)
\[\hat{\bftheta} = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2+\lambda\norm[1]{\bftheta}\]
- coordinates are shrunk by the same amount, promotes sparsity
- more computationally tractable than \(\norm[0]{\bftheta}\)
In constrained form \[\bftheta = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2\textsf{ such that }\norm[1]{\bftheta}\leq \tau\]
No closed form solution but very powerful when
- \(n \ll d\) (susceptible to overfitting)
- \(\bfX\) has non trivial null space and no obvious way to find best solution