Matthieu R Bloch
Assume that \(\eta(\bfx)\) is of the form \(\frac{1}{1+\exp(-(\bfw^\intercal\bfx+b))}\)
Estimate \(\hat{\bfw}\) and \(\hat{b}\) from the data directly
Plugin the result to obtain \(\hat{\eta}(\bfx)=\frac{1}{1+\exp(-(\hat{\bfw}^\intercal\bfx+\hat{b}))}\)
The function \(x\mapsto \frac{1}{1+e^{-x}}\) is called the logistic function
The binary logistic regression is \(h^{\text{LC}}(\bfx)=\indic{\hat{\eta}(\bfx)\geq \frac{1}{2}} = \indic{\hat{\bfw}^\intercal\bfx+\hat{b}\geq 0}\) (linear)
Given our dataset \(\{(\tilde{\bfx}_i,y_i)\}_{i=1}^N\) the likelihood is \(\mathcal{L}(\theta)\eqdef \prod_{i=1}^N \P[\theta]{y_i|\tilde{\bfx}_i}\)
For \(K=2\) with \(\calY=\{0,1\}\) we obtain \[\mathcal{L}(\theta)\eqdef \prod_{i=1}^N \eta(\tilde{\bfx}_i)^{y_i}(1-\eta(\tilde{\bfx}_i))^{1-y_i}\]
\[\ell(\theta)= \sum_{i=1}^N \left(y_i\log\eta(\tilde{\bfx}_i)+(1-y_i)\log(1-\eta(\tilde{\bfx}_i))\right)\]
\[\ell(\theta)= \sum_{i=1}^N \left(y_i\theta^{\intercal}\tilde{\bfx}_i -\log(1+e^{\theta^\intercal\tilde{\bfx}_i})\right)\]
A necessary condition for optimality is \(\nabla_{\theta} \ell(\theta)=\mathbf{0}\)
Consider the canonical problem \[\min_{\bfx\in\bbR^d}f(\bfx)\text{ with }f:\bbR^d\to\bbR\]
\(\cdots\)
Choice of step size \(\eta\) really matters: too small and convergence takes forever, too big and might never converge
In practice, gradient has to be evaluated from data
Newton-Raphson method uses the second derivative to automatically adapt step size
\[ \bfx^{(j+1)} = \bfx^{(j)} - [\nabla^2 f(\bfx)]^{-1}\nabla f(\bfx)|_{\bfx=\bfx_j}\]
Hessian matrix \[\nabla^2 f(\bfx)=\left[\begin{array}{cccc} \frac{\partial^2 f}{\partial x_1^2}& \frac{\partial^2 f}{\partial x_1\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_1\partial x_d}\\\frac{\partial^2 f}{\partial x_1\partial x_2}& \frac{\partial^2 f}{\partial x_2^2}&\cdots & \frac{\partial^2 f}{\partial x_2\partial x_d}\\ \vdots&\vdots& \ddots &\vdots\\ \frac{\partial^2 f}{\partial x_d\partial x_1}& \frac{\partial^2 f}{\partial x_d\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_d^2}\end{array}\right]\]
Newton’s method is much faster when the dimension \(d\) is small but impractical when \(d\) is large
Often have a loss function of the form \(\ell(\theta) = \sum_{i=1}^N\ell_i(\theta)\) where \(\ell_i(\theta)=f(\bfx_i,y_i,\theta)\)
Dataset \(\{\bfx_i,y_i\}_{i=1}^N\) is linearly separable if there exists \(\bfw\in\bbR^d\) and \(b\in\bbR\) such that \[\forall i\in\intseq{1}{N}\quad y_i=\sgn{\bfw^\intercal\bfx+b} \qquad y_i\in \{\pm 1\}\] By definition \(\sgn{x}=+1\) if \(x> 0\) and \(-1\) else. The affine set \(\{\bfx:\bfw^\intercal\bfx+b=0\}\) is called a separating hyperplane.
Consider the hyperplane \(\calH\eqdef \{\bfx:\bfw^\intercal\bfx+b=0\}\). The vector \(\bfw\) is orthogonal to all vectors parallel to the hyperplane. For \(\bfz\in\bbR^d\), the distance of \(\bfz\) to the hyperplane is \[d(\bfz,\calH)=\frac{\abs{\bfw^\intercal\bfz+b }}{\norm[2]{\bfw}}\]
Data \(\calD\eqdef\{(\bfx_i,y_i)\}_{i=1}^N\) such that \(y_i\in\{\pm 1\}\)
Perceptron Learning Algorithm (PLA) invented by Rosenblatt in 1958 to find separating hyperplanes
PLA finds a separating hyperplane if the data is linearly separable