Logistic regression, gradient descent, Perceptron Learning Algorithm

Matthieu R Bloch

Logistic regression

  • Assume that \(\eta(\bfx)\) is of the form \(\frac{1}{1+\exp(-(\bfw^\intercal\bfx+b))}\)

  • Estimate \(\hat{\bfw}\) and \(\hat{b}\) from the data directly

  • Plugin the result to obtain \(\hat{\eta}(\bfx)=\frac{1}{1+\exp(-(\hat{\bfw}^\intercal\bfx+\hat{b}))}\)

  • The function \(x\mapsto \frac{1}{1+e^{-x}}\) is called the logistic function

  • The binary logistic regression is \(h^{\text{LC}}(\bfx)=\indic{\hat{\eta}(\bfx)\geq \frac{1}{2}} = \indic{\hat{\bfw}^\intercal\bfx+\hat{b}\geq 0}\) (linear)

  • How do we estimate \(\hat{\bfw}\) and \(\hat{b}\)?
    • From LDA analysis: \(\hat{\bfw}=\hat{\Sigma}^{-1}(\hat{\boldsymbol{\mu}}_1-\hat{\boldsymbol{\mu}}_0)\), \(b=\frac{1}{2}\hat{\boldsymbol{\mu}}_0^\intercal\hat{\Sigma}^{-1}\hat{\boldsymbol{\mu}}_0-\frac{1}{2}\hat{\boldsymbol{\mu}}_1^\intercal\hat{\Sigma}^{-1}\hat{\boldsymbol{\mu}}_1+\log\frac{\hat{\pi}_1}{\hat{\pi}_0}\)
    • Direct estimation of \((\hat{\bfw},b)\) from maximum likelihood

MLE for logistic regression

  • We have a parametric density model for \(p_{\theta}(y|\bfx)=\hat{\eta}(\bfx)\)
    • Standard trick: \(\tilde{\bfx} = [1,\,\bfx^\intercal]^\intercal\) and \(\theta = [b\,\bfw^\intercal]^\intercal\)
    • This allows us to lump the offset and write \[\eta(\bfx) = \frac{1}{1+\exp(-\theta^\intercal \tilde{\bfx})}\]
  • Given our dataset \(\{(\tilde{\bfx}_i,y_i)\}_{i=1}^N\) the likelihood is \(\mathcal{L}(\theta)\eqdef \prod_{i=1}^N \P[\theta]{y_i|\tilde{\bfx}_i}\)

  • For \(K=2\) with \(\calY=\{0,1\}\) we obtain \[\mathcal{L}(\theta)\eqdef \prod_{i=1}^N \eta(\tilde{\bfx}_i)^{y_i}(1-\eta(\tilde{\bfx}_i))^{1-y_i}\]

  • \[\ell(\theta)= \sum_{i=1}^N \left(y_i\log\eta(\tilde{\bfx}_i)+(1-y_i)\log(1-\eta(\tilde{\bfx}_i))\right)\]

  • \[\ell(\theta)= \sum_{i=1}^N \left(y_i\theta^{\intercal}\tilde{\bfx}_i -\log(1+e^{\theta^\intercal\tilde{\bfx}_i})\right)\]

Finding the MLE

  • A necessary condition for optimality is \(\nabla_{\theta} \ell(\theta)=\mathbf{0}\)

  • Here this means \(\sum_{i=1}^N\tilde{\bfx}_i\left(y_i-\frac{1}{1+\exp(-\theta^\intercal\tilde{\bfx}_i)}\right)=\mathbf{0}\)
    • System of \(d+1\) non linear equations!
    • No closed form solution in general
  • Need to use numerical algorithm to find the solution of \(\argmin_{\theta}-\ell(\theta)\)
    • Gradient descent
    • Stochastic gradient descent
    • Newton’s method
    • There are many more, especially useful in high dimension
    • Provable convergence when \(-\ell\) is convex

Gradient descent

  • Consider the canonical problem \[\min_{\bfx\in\bbR^d}f(\bfx)\text{ with }f:\bbR^d\to\bbR\]

  • Find minimum by find iteratively by “rolling downhill”
    • Start from point \(\bfx^{(0)}\)
    • \(\bfx^{(1)} = \bfx^{(0)} - \eta\nabla f(\bfx)|_{\bfx=\bfx^{(0)}}\); \(\eta\) is the step size
    • \(\bfx^{(2)} = \bfx^{(1)} - \eta\nabla f(\bfx)|_{\bfx=\bfx^{(1)}}\)
    • \(\cdots\)

    • Choice of step size \(\eta\) really matters: too small and convergence takes forever, too big and might never converge

  • Many variants of gradient descent
    • Momentum: \(v_{t} = \gamma v_{t-1}+\eta\nabla f(\bfx)|_{\bfx=\bfx^{(t)}}\) and \(\bfx^{(t+1)} = \bfx^{(t)} - v_{t}\)
    • Accelerated: \(v_{t} = \gamma v_{t-1}+\eta\nabla f(\bfx)|_{\bfx=\bfx^{(t)}-\gamma v_{t-1}}\) and \(\bfx^{(t+1)} = \bfx^{(t)} - v_{t}\)
  • In practice, gradient has to be evaluated from data

Newton’s method

  • Newton-Raphson method uses the second derivative to automatically adapt step size

  • \[ \bfx^{(j+1)} = \bfx^{(j)} - [\nabla^2 f(\bfx)]^{-1}\nabla f(\bfx)|_{\bfx=\bfx_j}\]

  • Hessian matrix \[\nabla^2 f(\bfx)=\left[\begin{array}{cccc} \frac{\partial^2 f}{\partial x_1^2}& \frac{\partial^2 f}{\partial x_1\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_1\partial x_d}\\\frac{\partial^2 f}{\partial x_1\partial x_2}& \frac{\partial^2 f}{\partial x_2^2}&\cdots & \frac{\partial^2 f}{\partial x_2\partial x_d}\\ \vdots&\vdots& \ddots &\vdots\\ \frac{\partial^2 f}{\partial x_d\partial x_1}& \frac{\partial^2 f}{\partial x_d\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_d^2}\end{array}\right]\]

  • Newton’s method is much faster when the dimension \(d\) is small but impractical when \(d\) is large

Stochastic gradient descent

  • Often have a loss function of the form \(\ell(\theta) = \sum_{i=1}^N\ell_i(\theta)\) where \(\ell_i(\theta)=f(\bfx_i,y_i,\theta)\)

  • The gradient is \(\nabla_\theta\ell(\theta)=\sum_{i=1}^N\nabla\ell_i(\theta)\) and gradient descent update is \[\theta^{(j+1)} = \theta^{(j)} - \eta \sum_{i=1}^N\nabla \ell_i(\theta)\]
    • Problematic if dataset if huge of if not all data is availabel
    • Use iterative technique instead \[\theta^{(j+1)} = \theta^{(j)} - \eta \nabla \ell_i(\theta)\]
  • Tons of variations of principle
    • Batch, minibatch, Adagrad, RMSprop, Adam, etc.

Linearly separable datasets

  • Dataset \(\{\bfx_i,y_i\}_{i=1}^N\) is linearly separable if there exists \(\bfw\in\bbR^d\) and \(b\in\bbR\) such that \[\forall i\in\intseq{1}{N}\quad y_i=\sgn{\bfw^\intercal\bfx+b} \qquad y_i\in \{\pm 1\}\] By definition \(\sgn{x}=+1\) if \(x> 0\) and \(-1\) else. The affine set \(\{\bfx:\bfw^\intercal\bfx+b=0\}\) is called a separating hyperplane.

  • A bit of geometry…
  • Consider the hyperplane \(\calH\eqdef \{\bfx:\bfw^\intercal\bfx+b=0\}\). The vector \(\bfw\) is orthogonal to all vectors parallel to the hyperplane. For \(\bfz\in\bbR^d\), the distance of \(\bfz\) to the hyperplane is \[d(\bfz,\calH)=\frac{\abs{\bfw^\intercal\bfz+b }}{\norm[2]{\bfw}}\]

  • How can we find a separating hyperplane? (assuming it exists…)

Perceptron learning algorithm

  • Consider \(\bfx = [1\,x_1,\cdots,x_d]^\intercal\in\bbR^{d+1}\) and hyperplane of the form \(\theta^\intercal\bfx = 0\).
  • Data \(\calD\eqdef\{(\bfx_i,y_i)\}_{i=1}^N\) such that \(y_i\in\{\pm 1\}\)

  • Perceptron Learning Algorithm (PLA) invented by Rosenblatt in 1958 to find separating hyperplanes

    • Start from guess for \(\theta^{(0)}\) and go over data points in sequence to update \[\theta^{(j+1)}=\begin{cases} \theta^{(j)}+y_i\bfx_i\text{ if }y_i\neq \sgn{{\theta^{(j)}}^\intercal\bfx_i}\\ \theta^{(j)}\text{ else} \end{cases}\]
    • Geometric intuition behind operation
    • Stochastic gradient descent view
  • PLA finds a non parametric linear classifier
    • Can be viewed as single layer NN
  • PLA finds a separating hyperplane if the data is linearly separable