Perceptron Learning Algorithm

Matthieu R Bloch

Linearly separable datasets

Dataset \(\{\bfx_i,y_i\}_{i=1}^N\) is linearly separable if there exists \(\bfw\in\bbR^d\) and \(b\in\bbR\) such that \[\forall i\in\intseq{1}{N}\quad y_i=\sgn{\bfw^\intercal\bfx+b} \qquad y_i\in \{\pm 1\}\] By definition \(\sgn{x}=+1\) if \(x> 0\) and \(-1\) else. The affine set \(\{\bfx:\bfw^\intercal\bfx+b=0\}\) is called a separating hyperplane.
A bit of geometry…
Consider the hyperplane \(\calH\eqdef \{\bfx:\bfw^\intercal\bfx+b=0\}\). The vector \(\bfw\) is orthogonal to all vectors parallel to the hyperplane. For \(\bfz\in\bbR^d\), the distance of \(\bfz\) to the hyperplane is \[d(\bfz,\calH)=\frac{\abs{\bfw^\intercal\bfz+b }}{\norm[2]{\bfw}}\]
How can we find a separating hyperplane? (assuming it exists…)

Consider \(\bfx = [1\,x_1,\cdots,x_d]^\intercal\in\bbR^{d+1}\) and hyperplane of the form \(\theta^\intercal\bfx = 0\).
Data \(\calD\eqdef\{(\bfx_i,y_i)\}_{i=1}^N\) such that \(y_i\in\{\pm 1\}\)
Perceptron Learning Algorithm (PLA) invented by Rosenblatt in 1958 to find separating hyperplanes
- Start from guess for \(\theta^{(0)}\) and go over data points in sequence to update \[\theta^{(j+1)}=\begin{cases} \theta^{(j)}+y_i\bfx_i\text{ if }y_i\neq \sgn{{\theta^{(j)}}^\intercal\bfx_i}\\ \theta^{(j)}\text{ else} \end{cases}\]
- Geometric intuition behind operation
- Stochastic gradient descent view
PLA finds a non parametric linear classifier
- Can be viewed as single layer NN
PLA finds a separating hyperplane if the data is linearly separable

“All separating hyperplanes are equal but some are more equal than others”
Margin \(\rho(\bfw,b) \eqdef \min_i \frac{\abs{\bfw^\intercal\bfx_i+b}}{\norm[2]{\bfw}}\); the maximum margin hyperplane is the solution of \[(\bfw^*,b^*)=\argmax_{\bfw,b}\rho(\bfw,b)\]
- Larger margin leads to better generalization
The canonical form \((\bfw,b)\) of a separating plane is such that \[\forall i\,\, y_i(\bfw^\intercal\bfx_i+b) \geq 1\text{ and }\exists i^* \text{ s.t. }y_{i^*}(\bfw^\intercal\bfx_{i^*}+b) = 1\]
For canonical hyperplanes, the optimization problem is \[\argmin_{\bfw,b}\frac{1}{2}\norm[2]{\bfw}^2\text{ s.t. }\forall i\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1\]
- this is a constrained quadratic program, we know how to solve this really well
- Will come back when we talk about support vector machines

What if our data is not linearly separable?
- The constraint \(\forall i\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1\) cannot be satisfied
- Introduce slack variables \(\xi_i>0\) such that \(\forall i\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1-\xi_i\)
The optimal soft-margin hyperplane is the solution of the following \[\argmin_{\bfw,b,\boldsymbol{\xi}}\frac{1}{2}\norm[2]{\bfw}^2+\frac{C}{N}\sum_{i=1}^N\xi_i\text{ s.t. }\forall i\quad y_i(\bfw^\intercal\bfx_i+b)\geq 1-\xi_i\text{ and }\xi_i\geq 0\]
\(C>0\) is a cost set by the user, which controls the influence of outliers