Logistic regression, gradient descent, Perceptron Learning Algorithm

Matthieu R Bloch

Logistic regression

Assume that \(\eta(\bfx)\) is of the form \(\frac{1}{1+\exp(-(\bfw^\intercal\bfx+b))}\)
Estimate \(\hat{\bfw}\) and \(\hat{b}\) from the data directly
Plugin the result to obtain \(\hat{\eta}(\bfx)=\frac{1}{1+\exp(-(\hat{\bfw}^\intercal\bfx+\hat{b}))}\)
The function \(x\mapsto \frac{1}{1+e^{-x}}\) is called the logistic function
The binary logistic regression is \(h^{\text{LC}}(\bfx)=\indic{\hat{\eta}(\bfx)\geq \frac{1}{2}} = \indic{\hat{\bfw}^\intercal\bfx+\hat{b}\geq 0}\) (linear)
How do we estimate \(\hat{\bfw}\) and \(\hat{b}\)?
- From LDA analysis: \(\hat{\bfw}=\hat{\Sigma}^{-1}(\hat{\boldsymbol{\mu}}_1-\hat{\boldsymbol{\mu}}_0)\), \(b=\frac{1}{2}\hat{\boldsymbol{\mu}}_0^\intercal\hat{\Sigma}^{-1}\hat{\boldsymbol{\mu}}_0-\frac{1}{2}\hat{\boldsymbol{\mu}}_1^\intercal\hat{\Sigma}^{-1}\hat{\boldsymbol{\mu}}_1+\log\frac{\hat{\pi}_1}{\hat{\pi}_0}\)
- Direct estimation of \((\hat{\bfw},b)\) from maximum likelihood

We have a parametric density model for \(p_{\theta}(y|\bfx)=\hat{\eta}(\bfx)\)
- Standard trick: \(\tilde{\bfx} = [1,\,\bfx^\intercal]^\intercal\) and \(\theta = [b\,\bfw^\intercal]^\intercal\)
- This allows us to lump the offset and write \[\eta(\bfx) = \frac{1}{1+\exp(-\theta^\intercal \tilde{\bfx})}\]
Given our dataset \(\{(\tilde{\bfx}_i,y_i)\}_{i=1}^N\) the likelihood is \(\mathcal{L}(\theta)\eqdef \prod_{i=1}^N \P[\theta]{y_i|\tilde{\bfx}_i}\)
For \(K=2\) with \(\calY=\{0,1\}\) we obtain \[\mathcal{L}(\theta)\eqdef \prod_{i=1}^N \eta(\tilde{\bfx}_i)^{y_i}(1-\eta(\tilde{\bfx}_i))^{1-y_i}\]
\[\ell(\theta)= \sum_{i=1}^N \left(y_i\log\eta(\tilde{\bfx}_i)+(1-y_i)\log(1-\eta(\tilde{\bfx}_i))\right)\]
\[\ell(\theta)= \sum_{i=1}^N \left(y_i\theta^{\intercal}\tilde{\bfx}_i -\log(1+e^{\theta^\intercal\tilde{\bfx}_i})\right)\]

A necessary condition for optimality is \(\nabla_{\theta} \ell(\theta)=\mathbf{0}\)
Here this means \(\sum_{i=1}^N\tilde{\bfx}_i\left(y_i-\frac{1}{1+\exp(-\theta^\intercal\tilde{\bfx}_i)}\right)=\mathbf{0}\)
- System of \(d+1\) non linear equations!
- No closed form solution in general
Need to use numerical algorithm to find the solution of \(\argmin_{\theta}-\ell(\theta)\)
- Gradient descent
- Stochastic gradient descent
- Newton’s method
- There are many more, especially useful in high dimension
- Provable convergence when \(-\ell\) is convex

Consider the canonical problem \[\min_{\bfx\in\bbR^d}f(\bfx)\text{ with }f:\bbR^d\to\bbR\]
Find minimum by find iteratively by “rolling downhill”
- Start from point \(\bfx^{(0)}\)
- \(\bfx^{(1)} = \bfx^{(0)} - \eta\nabla f(\bfx)|_{\bfx=\bfx^{(0)}}\); \(\eta\) is the step size
- \(\bfx^{(2)} = \bfx^{(1)} - \eta\nabla f(\bfx)|_{\bfx=\bfx^{(1)}}\)
- \(\cdots\)
- Choice of step size \(\eta\) really matters: too small and convergence takes forever, too big and might never converge
Many variants of gradient descent
- Momentum: \(v_{t} = \gamma v_{t-1}+\eta\nabla f(\bfx)|_{\bfx=\bfx^{(t)}}\) and \(\bfx^{(t+1)} = \bfx^{(t)} - v_{t}\)
- Accelerated: \(v_{t} = \gamma v_{t-1}+\eta\nabla f(\bfx)|_{\bfx=\bfx^{(t)}-\gamma v_{t-1}}\) and \(\bfx^{(t+1)} = \bfx^{(t)} - v_{t}\)
In practice, gradient has to be evaluated from data

Newton-Raphson method uses the second derivative to automatically adapt step size
\[ \bfx^{(j+1)} = \bfx^{(j)} - [\nabla^2 f(\bfx)]^{-1}\nabla f(\bfx)|_{\bfx=\bfx_j}\]
Hessian matrix \[\nabla^2 f(\bfx)=\left[\begin{array}{cccc} \frac{\partial^2 f}{\partial x_1^2}& \frac{\partial^2 f}{\partial x_1\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_1\partial x_d}\\\frac{\partial^2 f}{\partial x_1\partial x_2}& \frac{\partial^2 f}{\partial x_2^2}&\cdots & \frac{\partial^2 f}{\partial x_2\partial x_d}\\ \vdots&\vdots& \ddots &\vdots\\ \frac{\partial^2 f}{\partial x_d\partial x_1}& \frac{\partial^2 f}{\partial x_d\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_d^2}\end{array}\right]\]
Newton’s method is much faster when the dimension \(d\) is small but impractical when \(d\) is large

Often have a loss function of the form \(\ell(\theta) = \sum_{i=1}^N\ell_i(\theta)\) where \(\ell_i(\theta)=f(\bfx_i,y_i,\theta)\)
The gradient is \(\nabla_\theta\ell(\theta)=\sum_{i=1}^N\nabla\ell_i(\theta)\) and gradient descent update is \[\theta^{(j+1)} = \theta^{(j)} - \eta \sum_{i=1}^N\nabla \ell_i(\theta)\]
- Problematic if dataset if huge of if not all data is availabel
- Use iterative technique instead \[\theta^{(j+1)} = \theta^{(j)} - \eta \nabla \ell_i(\theta)\]
Tons of variations of principle
- Batch, minibatch, Adagrad, RMSprop, Adam, etc.

Dataset \(\{\bfx_i,y_i\}_{i=1}^N\) is linearly separable if there exists \(\bfw\in\bbR^d\) and \(b\in\bbR\) such that \[\forall i\in\intseq{1}{N}\quad y_i=\sgn{\bfw^\intercal\bfx+b} \qquad y_i\in \{\pm 1\}\] By definition \(\sgn{x}=+1\) if \(x> 0\) and \(-1\) else. The affine set \(\{\bfx:\bfw^\intercal\bfx+b=0\}\) is called a separating hyperplane.
A bit of geometry…
Consider the hyperplane \(\calH\eqdef \{\bfx:\bfw^\intercal\bfx+b=0\}\). The vector \(\bfw\) is orthogonal to all vectors parallel to the hyperplane. For \(\bfz\in\bbR^d\), the distance of \(\bfz\) to the hyperplane is \[d(\bfz,\calH)=\frac{\abs{\bfw^\intercal\bfz+b }}{\norm[2]{\bfw}}\]
How can we find a separating hyperplane? (assuming it exists…)

Consider \(\bfx = [1\,x_1,\cdots,x_d]^\intercal\in\bbR^{d+1}\) and hyperplane of the form \(\theta^\intercal\bfx = 0\).
Data \(\calD\eqdef\{(\bfx_i,y_i)\}_{i=1}^N\) such that \(y_i\in\{\pm 1\}\)
Perceptron Learning Algorithm (PLA) invented by Rosenblatt in 1958 to find separating hyperplanes
- Start from guess for \(\theta^{(0)}\) and go over data points in sequence to update \[\theta^{(j+1)}=\begin{cases} \theta^{(j)}+y_i\bfx_i\text{ if }y_i\neq \sgn{{\theta^{(j)}}^\intercal\bfx_i}\\ \theta^{(j)}\text{ else} \end{cases}\]
- Geometric intuition behind operation
- Stochastic gradient descent view
PLA finds a non parametric linear classifier
- Can be viewed as single layer NN
PLA finds a separating hyperplane if the data is linearly separable