Matthieu Bloch
A general approach to regression is to solve \[ \hat{\bftheta} = \argmin_{\bftheta} L(\bftheta,\bfX,\bfy)+\lambda r(\bftheta) \] where \(L:\bbR^{d+1}\times\bbR^{N\times (d+1)}\times\bbR^N\to\bbR\) is a loss function and \(r:\bbR^{d+1}\to\bbR\) is a regularizer
Loss functions ensures data fidelity while regularizers limit complexity
Example of loss functions \[\begin{aligned} L_{\mathrm{AE}}(r)\eqdef\abs{r}\qquad\textsf{(mean absolute error)}\\ L_{\mathrm{H}}(r)=\begin{cases}\frac{1}{2}r^2\textsf{ if }\abs{r}\leq c\\c\abs{r}-\frac{c^2}{2}\textsf{ else}\end{cases}\qquad\textsf{(Huber loss)}\\ L_{\epsilon}(r)=\begin{cases}0\textsf{ if }\abs{r}\leq \epsilon\\\abs{r}-\epsilon\textsf{ else}\end{cases}\qquad\textsf{$\epsilon$-insensitive loss}\\ \end{aligned} \]
\[\argmin_{\bfalpha,\bfalpha^*}\sum_{i=1}^{N}\left((\epsilon-y_i)\alpha_i+(\epsilon+y_i)\alpha_i^*\right)+\frac{1}{2}\sum_{i,j}(\alpha_i-\alpha_i^*)(\alpha_j-\alpha_j^*)\bfx_j^\intercal\bfx_i\] \[\textsf{such that }0\leq \alpha_i^*,\alpha_i\leq\frac{1}{\lambda},\quad \sum_{i=1}^N(\alpha_i^*-\alpha_i)=0,\quad \alpha_i^*\alpha_i=0\] The solution is then \(\hat{y} = \sum_{i=1}^N(\alpha_i-\alpha_i^*)\bfx_i^\intercal\bfx-\hat{\beta}_0\)
\[\argmin_{\bfalpha,\bfalpha^*}\sum_{i=1}^{N}\left((\epsilon-y_i)\alpha_i+(\epsilon+y_i)\alpha_i^*\right)+\frac{1}{2}\sum_{i,j}(\alpha_i-\alpha_i^*)(\alpha_j-\alpha_j^*)k(\bfx_j\bfx_i)\] \[\textsf{such that }0\leq \alpha_i^*,\alpha_i\leq\frac{1}{\lambda},\quad \sum_{i=1}^N(\alpha_i^*-\alpha_i)=0\] The solution is then \(\hat{y} = \sum_{i=1}^N(\alpha_i-\alpha_i^*)k(\bfx_i,\bfx)+\beta_0\).
Lasso regression \[\hat{\bftheta} = \argmin_{\bftheta} \norm[2]{\bfy-\bfX\bftheta}^2+\norm[1]{\bftheta}\]
Not easy to kernerlize directly
\[\hat{\bfbeta},\hat{\beta}_0=\argmin_{\bfbeta,\beta_0}\sum_{i=1}^N\norm[2]{y_i-\bar{y}-\bfbeta^\intercal(\bfx_i-\bar{\bfx})}^2+\frac{\lambda}{2}\norm[2]{\bfbeta}^2\] with \(\bar{x}=\frac{1}{N}\sum_i\bfx_i\) and \(\bar{y}=\frac{1}{N}\sum_{i}y_i\). Solution is then \(\hat{y}=\bar{y}+\hat{\bfbeta}^\intercal(\bfx-\bar{\bfx})\).
\[\hat{\bfbeta}=(\bfA^\intercal\bfA+\lambda\bfI)^{-1}\bfA^\intercal\tilde{\bfy}\] \[\bfA\eqdef\left[\begin{array}{c}(\bfx_1-\bar{\bfx})^\intercal\\(\bfx_2-\bar{\bfx})^\intercal\\\vdots\\(\bfx_N-\bar{\bfx})^\intercal\end{array}\right]\qquad\tilde{\bfy}\eqdef\left[\begin{array}{c}y_1-\bar{y}\\y_2-\bar{y}\\\vdots\\y_N-\bar{y}\end{array}\right]\]
\[(\bfP+\bfQ\bfR\bfS)^{-1} = \bfP^{-1}-\bfP^{-1}\bfQ(\bfR^{-1}+\bfS\bfP^{-1}\bfQ)^{-1}\bfS\bfP^{-1}\]
\[\hat{\bfbeta} = \frac{1}{\lambda}(\bfA^\intercal-\bfA^\intercal(\lambda\bfI+\bfK)^{-1}\bfK)\tilde{\bfy}\textsf{ with }\bfK_{i,j}\eqdef(\bfx_i-\bar{\bfx})^\intercal(\bfx_j-\bar{\bfx})\] \[\hat{\bfy} = \bar{\bfy}+\frac{1}{\lambda}\tilde{\bfy}^\intercal(\bfI-\bfK(\lambda\bfI+\bfK)^{-1}\bfI)\bfk(\bfx)\textsf{ with }\bfk(\bfx) =\left[\begin{array}{c}(\bfx_1-\bar{\bfx})^\intercal(\bfx-\bar{\bfx})\\\vdots\\(\bfx_N-\bar{\bfx})^\intercal(\bfx-\bar{\bfx})\end{array}\right]\]
For many kernels, \(\Phi(\bfx)\) already contains a constant component, we often omit \(\beta_0\)
\[\hat{\bfy} = {\bfy}^\intercal(\lambda\bfI+\bfK)^{-1}\bfk(\bfx)\] \[\bfk(\bfx) =\left[\begin{array}{c}k(\bfx_1,\bfx)\\\vdots\\k(\bfx_N,\bfx)\end{array}\right]\qquad \bfK_{i,j}=k(\bfx_i,\bfx_j)\]
Illustration: Gaussian kernels
Question: does the viewpoint of loss function + regularization apply to classification?
Modified problem with upper bound \(\phi(t)\) on \(\indic{t<0}\). \[ \argmin_{\bfw,b}\frac{1}{N}\sum_{i=1}^N \phi\left(y_i(\bfw^\intercal\bfx_i+b)\right) \]
Take \(\phi(t)\eqdef \max(0,1-t)\eqdef (1-t)_+\)
Add regularization to prevent overfitting \[ \argmin_{\bfw,b}\frac{1}{N}\sum_{i=1}^N \left(y_i(\bfw^\intercal\bfx_i+b)\right)_+ + \frac{\lambda}{2}\norm[2]{\bfw}^2 \]