Regression and regularization

Matthieu Bloch

General approach to regression

  • A general approach to regression is to solve \[ \hat{\bftheta} = \argmin_{\bftheta} L(\bftheta,\bfX,\bfy)+\lambda r(\bftheta) \] where \(L:\bbR^{d+1}\times\bbR^{N\times (d+1)}\times\bbR^N\to\bbR\) is a loss function and \(r:\bbR^{d+1}\to\bbR\) is a regularizer

  • Loss functions ensures data fidelity while regularizers limit complexity

  • Example of loss functions \[\begin{aligned} L_{\mathrm{AE}}(r)\eqdef\abs{r}\qquad\textsf{(mean absolute error)}\\ L_{\mathrm{H}}(r)=\begin{cases}\frac{1}{2}r^2\textsf{ if }\abs{r}\leq c\\c\abs{r}-\frac{c^2}{2}\textsf{ else}\end{cases}\qquad\textsf{(Huber loss)}\\ L_{\epsilon}(r)=\begin{cases}0\textsf{ if }\abs{r}\leq \epsilon\\\abs{r}-\epsilon\textsf{ else}\end{cases}\qquad\textsf{$\epsilon$-insensitive loss}\\ \end{aligned} \]

Regularized robust regression

  • Combine \(\epsilon\)-insensitive loss with \(\ell_2\) regularizer \[\hat{\bfbeta},\hat{\beta}_0=\argmin_{\bfbeta,\beta_0}\sum_{i=1}^NL_{\epsilon}(y_i-(\bfbeta^\intercal\bfx_i+\beta_0))+\frac{\lambda}{2}\norm[2]{\bfbeta}^2\]
    • Insensitive as long as prediction is within margin of \(\epsilon\) of the true value
    • This looks like SVMs… can we kernelize?
  • \[\argmin_{\bfalpha,\bfalpha^*}\sum_{i=1}^{N}\left((\epsilon-y_i)\alpha_i+(\epsilon+y_i)\alpha_i^*\right)+\frac{1}{2}\sum_{i,j}(\alpha_i-\alpha_i^*)(\alpha_j-\alpha_j^*)\bfx_j^\intercal\bfx_i\] \[\textsf{such that }0\leq \alpha_i^*,\alpha_i\leq\frac{1}{\lambda},\quad \sum_{i=1}^N(\alpha_i^*-\alpha_i)=0,\quad \alpha_i^*\alpha_i=0\] The solution is then \(\hat{y} = \sum_{i=1}^N(\alpha_i-\alpha_i^*)\bfx_i^\intercal\bfx-\hat{\beta}_0\)

Kernelized regularized robust regression

  • Recap: kernelizing an algorithm
    1. Show that the training process only involves the training data via inner products, e.g., \(\bfx_i^\intercal\bfx_j\)
    2. Show that applying the decision rule only involves inner products, e.g., \(\bfw^\intercal\bfx\)
    3. Replace all inner products with evaluations of the kernel function \(k(\cdot,\cdot)\)
  • \[\argmin_{\bfalpha,\bfalpha^*}\sum_{i=1}^{N}\left((\epsilon-y_i)\alpha_i+(\epsilon+y_i)\alpha_i^*\right)+\frac{1}{2}\sum_{i,j}(\alpha_i-\alpha_i^*)(\alpha_j-\alpha_j^*)k(\bfx_j\bfx_i)\] \[\textsf{such that }0\leq \alpha_i^*,\alpha_i\leq\frac{1}{\lambda},\quad \sum_{i=1}^N(\alpha_i^*-\alpha_i)=0\] The solution is then \(\hat{y} = \sum_{i=1}^N(\alpha_i-\alpha_i^*)k(\bfx_i,\bfx)+\beta_0\).

Kernelized LASSO

  • Lasso regression \[\hat{\bftheta} = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2+\norm[1]{\bftheta}\]

  • Not easy to kernerlize directly

  • Look instead of specific \(\bftheta\) of the form \(\bftheta\eqdef \sum_{i}\alpha_i\bfx_i\) and solve \[\hat{\bfalpha} = \argmin_{\bfalpha} \norm[2]{\bfy-\bfX\bfalpha}^2+\norm[1]{\bfalpha}\]
    • This can be kernelized but promotes sparsity in \(\bfalpha\), not \(\bftheta\)
    • Not really LASSO anymore

Kernelized ridge regression

  • \[\hat{\bfbeta},\hat{\beta}_0=\argmin_{\bfbeta,\beta_0}\sum_{i=1}^N\norm[2]{y_i-\bar{y}-\bfbeta^\intercal(\bfx_i-\bar{\bfx})}^2+\frac{\lambda}{2}\norm[2]{\bfbeta}^2\] with \(\bar{x}=\frac{1}{N}\sum_i\bfx_i\) and \(\bar{y}=\frac{1}{N}\sum_{i}y_i\). Solution is then \(\hat{y}=\bar{y}+\hat{\bfbeta}^\intercal(\bfx-\bar{\bfx})\).

  • \[\hat{\bfbeta}=(\bfA^\intercal\bfA+\lambda\bfI)^{-1}\bfA^\intercal\tilde{\bfy}\] \[\bfA\eqdef\left[\begin{array}{c}(\bfx_1-\bar{\bfx})^\intercal\\(\bfx_2-\bar{\bfx})^\intercal\\\vdots\\(\bfx_N-\bar{\bfx})^\intercal\end{array}\right]\qquad\tilde{\bfy}\eqdef\left[\begin{array}{c}y_1-\bar{y}\\y_2-\bar{y}\\\vdots\\y_N-\bar{y}\end{array}\right]\]

  • Not direct to kernelize because \(\bfA^\intercal\bfA\) does not contain inner products

Kernelized ridge regression

  • \[(\bfP+\bfQ\bfR\bfS)^{-1} = \bfP^{-1}-\bfP^{-1}\bfQ(\bfR^{-1}+\bfS\bfP^{-1}\bfQ)^{-1}\bfS\bfP^{-1}\]

  • \[\hat{\bfbeta} = \frac{1}{\lambda}(\bfA^\intercal-\bfA^\intercal(\lambda\bfI+\bfK)^{-1}\bfK)\tilde{\bfy}\textsf{ with }\bfK_{i,j}\eqdef(\bfx_i-\bar{\bfx})^\intercal(\bfx_j-\bar{\bfx})\] \[\hat{\bfy} = \bar{\bfy}+\frac{1}{\lambda}\tilde{\bfy}^\intercal(\bfI-\bfK(\lambda\bfI+\bfK)^{-1}\bfI)\bfk(\bfx)\textsf{ with }\bfk(\bfx) =\left[\begin{array}{c}(\bfx_1-\bar{\bfx})^\intercal(\bfx-\bar{\bfx})\\\vdots\\(\bfx_N-\bar{\bfx})^\intercal(\bfx-\bar{\bfx})\end{array}\right]\]

Kernelized ridge regression

  • For many kernels, \(\Phi(\bfx)\) already contains a constant component, we often omit \(\beta_0\)

  • \[\hat{\bfy} = {\bfy}^\intercal(\lambda\bfI+\bfK)^{-1}\bfk(\bfx)\] \[\bfk(\bfx) =\left[\begin{array}{c}k(\bfx_1,\bfx)\\\vdots\\k(\bfx_N,\bfx)\end{array}\right]\qquad \bfK_{i,j}=k(\bfx_i,\bfx_j)\]

  • Illustration: Gaussian kernels

Regularization and classification

  • Question: does the viewpoint of loss function + regularization apply to classification?

  • Logistic regression: \(\min_{\theta}-\ell(\theta)\) can be regularized as \(\min_{\theta}-\ell(\theta) + \lambda\norm[2]{\theta}^2\)
    • makes the Hessian matrix well conditioned
    • super useful when the number of observations is small
    • also helpful when data is not separable
  • New way to think about classification: solve the problem \[ \argmin_{\bfw,b}\frac{1}{N}\sum_{i=1}^N \indic{y_i(\bfw^\intercal\bfx_i+b)<0} \]
    • very hard analytically and computationally
  • Modified problem with upper bound \(\phi(t)\) on \(\indic{t<0}\). \[ \argmin_{\bfw,b}\frac{1}{N}\sum_{i=1}^N \phi\left(y_i(\bfw^\intercal\bfx_i+b)\right) \]

Hinge loss

  • Take \(\phi(t)\eqdef \max(0,1-t)\eqdef (1-t)_+\)

  • Add regularization to prevent overfitting \[ \argmin_{\bfw,b}\frac{1}{N}\sum_{i=1}^N \left(y_i(\bfw^\intercal\bfx_i+b)\right)_+ + \frac{\lambda}{2}\norm[2]{\bfw}^2 \]

  • Question: what classifier is this?
    • Answer: Soft-margin hyperplanes!