Regression and regularization

Matthieu Bloch

From classification to regression

  • Supervised learning with labeled dataset \(\{(\bfx_i,y_i)\}_{i=1}^N\), \(\bfx_i\in\bbR^d\)
    • Classification: \(y_i\in\calY\subset\bbR\) with \(\card{\calY}\eqdef K<\infty\)
    • Regression: \(y_i\in\calY=\bbR\)

    • Regression model: \(y=f(\bfx)+n\)
      • \(f\in\calH\), where \(\calH\) is a class of functions (polynomials, splines, kernels, etc.)
      • \(n\) is some random noise
  • Linear regression: \(\calH\) is the set of affine functions \[f(\bfx)\eqdef \bfbeta^\intercal\bfx+\beta_0\textsf{ with }\bfbeta\eqdef[\beta_1,\cdots,\beta_d]^\intercal\]

  • Least square regression: loss function is sum of square errors \[\mathrm{SSE}(\bfbeta,\beta_0)\eqdef\sum_{i=1}^N(y_i-\bfbeta^\intercal\bfx_i -\beta_0)^2\]

Linear Least Square Regression

  • Widely used in applied mathematics
    • Legendre: Nouvelles méthodes pour la détermination des orbites des comètes, 1805
    • Gauss: Theoria Motus, 1809, claim to discovery in 1795
  • Change of notation \[\bftheta\eqdef\left[\begin{array}{c} \beta_0\\ \beta_1\\ \vdots\\ \beta_d\end{array}\right]\in\bbR^{d+1}\qquad \bfy \eqdef\left[\begin{array}{c} y_1\\ y_2\\ \vdots\\ y_N \end{array}\right]\in\bbR^{N} \qquad \bfX\eqdef\left[\begin{array}{c}1&-\bfx_1^\intercal-\\1&-\bfx_2^\intercal-\\\vdots&\vdots\\1&-\bfx_N^\intercal-\\\end{array}\right]\in\bbR^{N\times (d+1)}\]

  • Rewrite sum of square error as \(\mathrm{SSE}(\bftheta)\eqdef \norm[2]{\bfy-\bfX\bftheta}^2\)

  • If \(\bfX^\intercal\bfX\) is non singular the minimizer of the SSE is \[\hat{\theta} = (\bfX^\intercal\bfX)^{-1}\bfX^\intercal\bfy\]

Non linear regression

  • As for classification, linear methods have their limit

  • Create non-linear estimators using non-linear feature map \(\Phi:\bbR^d\to\bbR^\ell:\bfx\mapsto\Phi(\bfx)\)

  • Regression model becomes \[ \bfy = \bfbeta^\intercal\Phi(\bfx)+\beta_0\textsf{ with }\bfbeta\in\bbR^\ell \]

  • Least square estimate of cubic polynomial \(f\) with \(d=1\)

Overfitting

  • Overfitting describes the situation when fitting the data well no longer ensures that the out-of-sample (generalization) error is small

  • Overfitting can happen in many situations:
    1. When there are too many degrees of freedom in model (the model learn the noise, not \(f\))
    2. When the hypothesis set contains simpler functions than target function \(f\) but too few samples
  • Overfitting occurs as the number of features \(d\) begins to approach the number of observations \(N\)

  • Consider the true model \(y=x^2\). The example peforms least square estimate with polynomial features of degree at most \(d\) using \(N=5\) sample points.
  • Solution to combat overfitting: regularization
    • Regularization aims at penalizing models that are likely to be overfitting

Thykonov regularization

  • Idea: introduce a penalty term to “regularize” the vector \(\bftheta\): \[\bftheta = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2+\norm[2]{\bfGamma\bftheta}^2\quad\textsf{where}\quad\bfGamma\in\bbR^{(d+1)\times(d+1)}\]

  • The minimizer of the least-square problem with Thykonov regularization is \[\hat{\theta} = (\bfX^\intercal\bfX+\bfGamma^\intercal\bfGamma)^{-1}\bfX^\intercal\bfy\]

  • With \(\bfGamma=\sqrt{\lambda} \mathbf{I}\) for some \(\lambda>0\), we obtain \(\hat{\theta} = (\bfX^\intercal\bfX+{\lambda}\mathbf{I})^{-1}\bfX^\intercal\bfy\)

  • Ridge regression does not penalize \(\beta_0\) and corresponds to \[\bfGamma = \left[\begin{array}{cccc}0 &0 & \cdots & 0\\0&\sqrt{\lambda}&\cdots&0\\\vdots&\ddots&\ddots&\vdots\\0&\cdots&\cdots&\sqrt{\lambda} \end{array}\right]\]

Constrained optimization view

  • The minimizer of the least-square problem with Thykonov regularization is the solution of \[\argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2\textsf{ such that }\norm[2]{\bfGamma\bftheta}^2\leq \tau\] for some \(\tau>0\)

  • Illustration: shrinking the estimator

Shrinkage estimators

  • Tikhonov regularization is a shrinkage estimator, which “shrinks” naive estimate towards some guess

  • Let \(\{x_i\}_{i=1}^N\) be iid samples drawn according to unknown distribution with variance \(\sigma^2\). Consider the estimator of the variance \(\hat{\sigma}^2\eqdef \frac{1}{N}\sum_{i=1}^N(\bfx_i-\hat{\mu})^2\). Then \(\E{\hat{\sigma}^2}=\frac{N-1}{N}\sigma^2\).

  • Alternative regularizers
    • Akaike information criterion (AIC): penalty that is an increasing function of the number of estimated parameters
    • Bayesian information criterion (BIC): also an increasing function of the number of estimated parameters

LASSO

  • Least absolute shrinkage and selection operator (LASSO)

  • \[\hat{\bftheta} = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2+\lambda\norm[1]{\bftheta}\]
    • coordinates are shrunk by the same amount, promotes sparsity
    • more computationally tractable than \(\norm[0]{\bftheta}\)
  • In constrained form \[\bftheta = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2\textsf{ such that }\norm[1]{\bftheta}\leq \tau\]

  • No closed form solution but very powerful when
    • \(n \ll d\) (susceptible to overfitting)
    • \(\bfX\) has non trivial null space and no obvious way to find best solution