Regression and regularization

Matthieu Bloch

General approach to regression

A general approach to regression is to solve \[ \hat{\bftheta} = \argmin_{\bftheta} L(\bftheta,\bfX,\bfy)+\lambda r(\bftheta) \] where $L:\bbR^{d+1}\times\bbR^{N\times (d+1)}\times\bbR^N\to\bbR$ is a loss function and $r:\bbR^{d+1}\to\bbR$ is a regularizer
Loss functions ensures data fidelity while regularizers limit complexity
Example of loss functions \[\begin{aligned} L_{\mathrm{AE}}(r)\eqdef\abs{r}\qquad\textsf{(mean absolute error)}\\ L_{\mathrm{H}}(r)=\begin{cases}\frac{1}{2}r^2\textsf{ if }\abs{r}\leq c\\c\abs{r}-\frac{c^2}{2}\textsf{ else}\end{cases}\qquad\textsf{(Huber loss)}\\ L_{\epsilon}(r)=\begin{cases}0\textsf{ if }\abs{r}\leq \epsilon\\\abs{r}-\epsilon\textsf{ else}\end{cases}\qquad\textsf{$\epsilon$-insensitive loss}\\ \end{aligned} \]

Combine $\epsilon$-insensitive loss with $\ell_2$ regularizer \[\hat{\bfbeta},\hat{\beta}_0=\argmin_{\bfbeta,\beta_0}\sum_{i=1}^NL_{\epsilon}(y_i-(\bfbeta^\intercal\bfx_i+\beta_0))+\frac{\lambda}{2}\norm[2]{\bfbeta}^2\]
- Insensitive as long as prediction is within margin of $\epsilon$ of the true value
- This looks like SVMs… can we kernelize?
\[\argmin_{\bfalpha,\bfalpha^*}\sum_{i=1}^{N}\left((\epsilon-y_i)\alpha_i+(\epsilon+y_i)\alpha_i^*\right)+\frac{1}{2}\sum_{i,j}(\alpha_i-\alpha_i^*)(\alpha_j-\alpha_j^*)\bfx_j^\intercal\bfx_i\] \[\textsf{such that }0\leq \alpha_i^*,\alpha_i\leq\frac{1}{\lambda},\quad \sum_{i=1}^N(\alpha_i^*-\alpha_i)=0,\quad \alpha_i^*\alpha_i=0\] The solution is then $\hat{y} = \sum_{i=1}^N(\alpha_i-\alpha_i^*)\bfx_i^\intercal\bfx-\hat{\beta}_0$

Recap: kernelizing an algorithm
1. Show that the training process only involves the training data via inner products, e.g., $\bfx_i^\intercal\bfx_j$
2. Show that applying the decision rule only involves inner products, e.g., $\bfw^\intercal\bfx$
3. Replace all inner products with evaluations of the kernel function $k(\cdot,\cdot)$
\[\argmin_{\bfalpha,\bfalpha^*}\sum_{i=1}^{N}\left((\epsilon-y_i)\alpha_i+(\epsilon+y_i)\alpha_i^*\right)+\frac{1}{2}\sum_{i,j}(\alpha_i-\alpha_i^*)(\alpha_j-\alpha_j^*)k(\bfx_j\bfx_i)\] \[\textsf{such that }0\leq \alpha_i^*,\alpha_i\leq\frac{1}{\lambda},\quad \sum_{i=1}^N(\alpha_i^*-\alpha_i)=0\] The solution is then $\hat{y} = \sum_{i=1}^N(\alpha_i-\alpha_i^*)k(\bfx_i,\bfx)+\beta_0$.

Lasso regression \[\hat{\bftheta} = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2+\norm[1]{\bftheta}\]
Not easy to kernerlize directly
Look instead of specific $\bftheta$ of the form $\bftheta\eqdef \sum_{i}\alpha_i\bfx_i$ and solve \[\hat{\bfalpha} = \argmin_{\bfalpha} \norm[2]{\bfy-\bfX\bfalpha}^2+\norm[1]{\bfalpha}\]
- This can be kernelized but promotes sparsity in $\bfalpha$, not $\bftheta$
- Not really LASSO anymore

\[\hat{\bfbeta},\hat{\beta}_0=\argmin_{\bfbeta,\beta_0}\sum_{i=1}^N\norm[2]{y_i-\bar{y}-\bfbeta^\intercal(\bfx_i-\bar{\bfx})}^2+\frac{\lambda}{2}\norm[2]{\bfbeta}^2\] with $\bar{x}=\frac{1}{N}\sum_i\bfx_i$ and $\bar{y}=\frac{1}{N}\sum_{i}y_i$. Solution is then $\hat{y}=\bar{y}+\hat{\bfbeta}^\intercal(\bfx-\bar{\bfx})$.
\[\hat{\bfbeta}=(\bfA^\intercal\bfA+\lambda\bfI)^{-1}\bfA^\intercal\tilde{\bfy}\] \[\bfA\eqdef\left[\begin{array}{c}(\bfx_1-\bar{\bfx})^\intercal\\(\bfx_2-\bar{\bfx})^\intercal\\\vdots\\(\bfx_N-\bar{\bfx})^\intercal\end{array}\right]\qquad\tilde{\bfy}\eqdef\left[\begin{array}{c}y_1-\bar{y}\\y_2-\bar{y}\\\vdots\\y_N-\bar{y}\end{array}\right]\]
Not direct to kernelize because $\bfA^\intercal\bfA$ does not contain inner products

\[(\bfP+\bfQ\bfR\bfS)^{-1} = \bfP^{-1}-\bfP^{-1}\bfQ(\bfR^{-1}+\bfS\bfP^{-1}\bfQ)^{-1}\bfS\bfP^{-1}\]
\[\hat{\bfbeta} = \frac{1}{\lambda}(\bfA^\intercal-\bfA^\intercal(\lambda\bfI+\bfK)^{-1}\bfK)\tilde{\bfy}\textsf{ with }\bfK_{i,j}\eqdef(\bfx_i-\bar{\bfx})^\intercal(\bfx_j-\bar{\bfx})\] \[\hat{\bfy} = \bar{\bfy}+\frac{1}{\lambda}\tilde{\bfy}^\intercal(\bfI-\bfK(\lambda\bfI+\bfK)^{-1}\bfI)\bfk(\bfx)\textsf{ with }\bfk(\bfx) =\left[\begin{array}{c}(\bfx_1-\bar{\bfx})^\intercal(\bfx-\bar{\bfx})\\\vdots\\(\bfx_N-\bar{\bfx})^\intercal(\bfx-\bar{\bfx})\end{array}\right]\]

For many kernels, $\Phi(\bfx)$ already contains a constant component, we often omit $\beta_0$
\[\hat{\bfy} = {\bfy}^\intercal(\lambda\bfI+\bfK)^{-1}\bfk(\bfx)\] \[\bfk(\bfx) =\left[\begin{array}{c}k(\bfx_1,\bfx)\\\vdots\\k(\bfx_N,\bfx)\end{array}\right]\qquad \bfK_{i,j}=k(\bfx_i,\bfx_j)\]
Illustration: Gaussian kernels

Question: does the viewpoint of loss function + regularization apply to classification?
Logistic regression: $\min_{\theta}-\ell(\theta)$ can be regularized as $\min_{\theta}-\ell(\theta) + \lambda\norm[2]{\theta}^2$
- makes the Hessian matrix well conditioned
- super useful when the number of observations is small
- also helpful when data is not separable
New way to think about classification: solve the problem \[ \argmin_{\bfw,b}\frac{1}{N}\sum_{i=1}^N \indic{y_i(\bfw^\intercal\bfx_i+b)<0} \]
- very hard analytically and computationally
Modified problem with upper bound $\phi(t)$ on $\indic{t<0}$. \[ \argmin_{\bfw,b}\frac{1}{N}\sum_{i=1}^N \phi\left(y_i(\bfw^\intercal\bfx_i+b)\right) \]

Take $\phi(t)\eqdef \max(0,1-t)\eqdef (1-t)_+$
Add regularization to prevent overfitting \[ \argmin_{\bfw,b}\frac{1}{N}\sum_{i=1}^N \left(y_i(\bfw^\intercal\bfx_i+b)\right)_+ + \frac{\lambda}{2}\norm[2]{\bfw}^2 \]
Question: what classifier is this?
- Answer: Soft-margin hyperplanes!