Perceptron Learning Algorithm

Matthieu R Bloch

Linearly separable datasets

  • Dataset \(\{\bfx_i,y_i\}_{i=1}^N\) is linearly separable if there exists \(\bfw\in\bbR^d\) and \(b\in\bbR\) such that \[\forall i\in\intseq{1}{N}\quad y_i=\sgn{\bfw^\intercal\bfx+b} \qquad y_i\in \{\pm 1\}\] By definition \(\sgn{x}=+1\) if \(x> 0\) and \(-1\) else. The affine set \(\{\bfx:\bfw^\intercal\bfx+b=0\}\) is called a separating hyperplane.

  • A bit of geometry…
  • Consider the hyperplane \(\calH\eqdef \{\bfx:\bfw^\intercal\bfx+b=0\}\). The vector \(\bfw\) is orthogonal to all vectors parallel to the hyperplane. For \(\bfz\in\bbR^d\), the distance of \(\bfz\) to the hyperplane is \[d(\bfz,\calH)=\frac{\abs{\bfw^\intercal\bfz+b }}{\norm[2]{\bfw}}\]

  • How can we find a separating hyperplane? (assuming it exists…)

Perceptron learning algorithm

  • Consider \(\bfx = [1\,x_1,\cdots,x_d]^\intercal\in\bbR^{d+1}\) and hyperplane of the form \(\theta^\intercal\bfx = 0\).
  • Data \(\calD\eqdef\{(\bfx_i,y_i)\}_{i=1}^N\) such that \(y_i\in\{\pm 1\}\)

  • Perceptron Learning Algorithm (PLA) invented by Rosenblatt in 1958 to find separating hyperplanes

    • Start from guess for \(\theta^{(0)}\) and go over data points in sequence to update \[\theta^{(j+1)}=\begin{cases} \theta^{(j)}+y_i\bfx_i\text{ if }y_i\neq \sgn{{\theta^{(j)}}^\intercal\bfx_i}\\ \theta^{(j)}\text{ else} \end{cases}\]
    • Geometric intuition behind operation
    • Stochastic gradient descent view
  • PLA finds a non parametric linear classifier
    • Can be viewed as single layer NN
  • PLA finds a separating hyperplane if the data is linearly separable

Maximum margin hyperplane

  • “All separating hyperplanes are equal but some are more equal than others”

  • Margin \(\rho(\bfw,b) \eqdef \min_i \frac{\abs{\bfw^\intercal\bfx_i+b}}{\norm[2]{\bfw}}\); the maximum margin hyperplane is the solution of \[(\bfw^*,b^*)=\argmax_{\bfw,b}\rho(\bfw,b)\]
    • Larger margin leads to better generalization
  • The canonical form \((\bfw,b)\) of a separating plane is such that \[\forall i\,\, y_i(\bfw^\intercal\bfx_i+b) \geq 1\text{ and }\exists i^* \text{ s.t. }y_{i^*}(\bfw^\intercal\bfx_{i^*}+b) = 1\]

  • For canonical hyperplanes, the optimization problem is \[\argmin_{\bfw,b}\frac{1}{2}\norm[2]{\bfw}^2\text{ s.t. }\forall i\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1\]
    • this is a constrained quadratic program, we know how to solve this really well
    • Will come back when we talk about support vector machines

Optimal soft-margin hyperplane

  • What if our data is not linearly separable?
    • The constraint \(\forall i\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1\) cannot be satisfied
    • Introduce slack variables \(\xi_i>0\) such that \(\forall i\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1-\xi_i\)
  • The optimal soft-margin hyperplane is the solution of the following \[\argmin_{\bfw,b,\boldsymbol{\xi}}\frac{1}{2}\norm[2]{\bfw}^2+\frac{C}{N}\sum_{i=1}^N\xi_i\text{ s.t. }\forall i\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1-\xi_i\text{ and }\xi_i\geq 0\]

  • \(C>0\) is a cost set by the user, which controls the influence of outliers