Plugin classifiers

Matthieu R Bloch

Plugin methods

The Bayes classifier requires the knowledge of $P_X$ and $P_{Y|X}$
We can learn $P_X$ and $P_{Y|X}$ from the data itself and plug the result into the Bayes classifier
The Bayes classifier only requires $P_{Y|X}$
- Generative classifier: learn $P_X$ and $P_{Y|X}$ (can compute $P_{X|Y}$)
- Discriminative classifier: learn $P_{Y|X}$ directly
Models can be of two types
- Parametric: strong assumption about underlying distribution (e.g., Gaussian, Multinomial, etc.)
- Non-Parametric: no assumption about form of underlying distribution
All possible combinations are possible
Question: what are $K$-NN classifiers?

Consider a parametric density $p_{\theta}(x)$ with unknown $\theta\in\bbR^d$
Assume that we have i.i.d. generated data points $\{x_i\}_{i=1}^N$
The likelihood is $\mathcal{L}(\theta)\eqdef\P[\theta]{\{x_i\}_{i=1}^N}=\prod_{i=1}^N p_{\theta}(x_i)$
The log-likelihood is $\ell(\theta)\eqdef\log\mathcal{L}(\theta)\eqdef\log\P[\theta]{\{x_i\}_{i=1}^N}=\sum_{i=1}^N \log p_{\theta}(x_i)$
The maximum likelihood estimator (MLE) is $\theta_{\text{ML}} = \argmax_{\theta}\mathcal{L}(\theta)$
- Sometimes, there exists a closed form solution
- More often than not, we need to resort to numerical solutions
Assume $y=\theta^\intercal\bfx+n$ where $n\sim\calN(0,\sigma^2)$. Then \[\theta_{\text{MLE}} = \argmin_{\theta}\sum_{i=1}^N \abs{y_i-\theta^\intercal\bfx_i}^2\]

Consider $\bfx={[x_1,\cdots,x_d]}^\intercal\in\bbR^d$ (random) feature vector and $y$ the label
- Naive asssumption: Given $y$, the features $\{x_i\}_{i=1}^d$ of $\bfx$ are independent, i.e., $P_{\bfx|y} = \prod_{i=1}^d P_{x_i|y}$
- Main benefit: only need univariate densities $P_{x_i|y}$, combine discrete/continous features
- Assumption be completely wrong (Example: bivariate Gaussian case)
Procedure
- Estimate a priori class probabilities: $\pi_k$ for $0\leq k\leq K-1$
- Estimate class conditional densities $p_{x_i|y}(x|k)$ for $1\leq i\leq d$ and $0\leq k\leq K-1$
The maximum likelihood estimate of $\pi_k$ is $\hat{\pi}_k = \frac{N_k}{N}$ where $N_k\eqdef \card{\{y_i:y_i=k\}}$
Assume $j$th feature $x_j$ takes $J$ distinct values $\{0,\dots,J-1\}$. The maximum likelihood estimate of $P_{x_j|y}(\ell|k)$ is $\widehat{P}_{x_j|y}(\ell|k) = \frac{N^{(j)}_{\ell,k}}{N_k}$ where $N^{(j)}_{\ell,k}\eqdef \card{\{\bfx:y=k\text{ and } x_j=\ell\}}$
The naive bayes estimator is $h^{\text{NB}}(\bfx)=\argmax_{k} \hat{\pi}_k\prod_{j=1}^d \widehat{P}_{x_j|y}(x_j|k)$
$P_{\bfx|y}$ often modeled with parametric models:
- Continuous features: often Gaussian, use ML estimate
- Discrete (categorical) features: often Multinomial, use ML estimate

Bag of word model for document classification
- Ignore positions of the words, focus on number of occurrences
- Remove uninformative words, e.g., “the,” “an”, “of” that do are unlikely to matter (stop words)
Naive Bayes
- Assume all features are conditionally independent given the class
Document classification
- Think of document as \[ \bfx = \left[\text{word 1},\text{word 2},\cdots,\text{word $n$}\right]^\intercal\]
- Dictionary contains words labeled by $j$
For every class $k$ $\P{\text{word i} = j|k} = \mu_{jk}$
Likelihood of document $\bfx$ in class $k$ is \[\P{\bfx|k} = \prod_{i=1}^n \prod_j \mu_{jk}^{\indic{x_i=j}}=\prod_{j}\mu_{jk}^{N_{j}}\] where $N_j$ is the number of occurrences of word $j$ in the document.

Estimate parameters
- Compute the ML estimate of the document classes $\hat{\pi}_k = \frac{N_k}{N}$
- Compute the ML estimate of the probability that word $j$ occurs in class $k$ across all documents: \[\hat{\mu}_{j,k}=\frac{\sum_{\ell}\ell N^{(j)}_{\ell,k}}{\sum_{j=1}^d\sum_{\ell}\ell N^{(j)}_{\ell,k}}\]
Run classifier: $\hat{h}^{\text{NB}}=\argmax_{k}\hat{\pi}_k\prod_{j=1}^d\left(\hat{\mu}_{j,k}\right)^{x_j}$
Weakness of approach: some words may not show up at training but show up at testing
- Use Laplace smoothing \[\hat{\mu}_{j,k}=\frac{1+\sum_{\ell}\ell N^{(j)}_{\ell,k}}{d+\sum_{j=1}^d\sum_{\ell}\ell N^{(j)}_{\ell,k}}\]

Consider $\bfx={[x_1,\cdots,x_d]}^\intercal\in\bbR^d$ (random) feature vector and $y$ the label
- Assumption: Given $y$, the feature vector have a Gaussian distribution $P_{\bfx|y}\sim\calN(\boldsymbol{\mu}_k,\Sigma)$
- The mean is class dependent but the covariance matrix is not \[\phi(\bfx;\boldsymbol{\mu},\Sigma)\eqdef\frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}(\bfx-\boldsymbol{\mu})^\intercal\Sigma^{-1}(x-\boldsymbol{\mu})\right)\]
Estimate parameters from data
- $\hat{\pi}_k=\frac{N_k}{N}$
- $\hat{\boldsymbol{\mu}}_k=\frac{1}{N_k}\sum_{i:y_i=k}\bfx_i$
- $\hat{\Sigma}=\frac{1}{N}\sum_{k=0}^{K-1}\sum_{i:y_i=k}(\bfx_i-\hat{\boldsymbol{\mu}}_k)(\bfx_i-\hat{\boldsymbol{\mu}}_k)^\intercal$ (recall assumption about covariance matrix)
The LDA classifier is \[h^{\text{LDA}}(\bfx)=\argmin_k \left(\frac{1}{2}(\bfx-\hat{\boldsymbol{\mu}}_k)^\intercal\hat{\Sigma}^{-1}(\bfx-\hat{\boldsymbol{\mu}}_k)-\log\hat{\pi}_k\right)\] For $K=2$, the LDA is a linear classifier

Generative model rarely accurate
Number of parameters to estimate:
- $K-1$ class priors, $Kd$ means, $\frac{1}{2}d(d+1)$ elements of covariance matrix
- Works well if $N\gg d$
- Works poorly if $N\ll d$ without other tricks (dimensionality reduction, structured covariance)
Biggest concern: “one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling p(xly)].”, Vapnik, 1998
Revisit binary classifier with LDA \[\eta_1(\bfx) = \frac{\pi_1 \phi(\bfx;\boldsymbol{\mu}_1,\Sigma)}{\pi_1 \phi(\bfx;\boldsymbol{\mu}_1,\Sigma)+\pi_0 \phi(\bfx;\boldsymbol{\mu}_0,\Sigma)}=\frac{1}{1+\exp(-(\bfw^\intercal\bfx+\bfb))}\]
We no not need to estimate the full joint distribution!