Plugin classifiers

Matthieu R Bloch

Plugin methods

  • The Bayes classifier requires the knowledge of \(P_X\) and \(P_{Y|X}\)

  • We can learn \(P_X\) and \(P_{Y|X}\) from the data itself and plug the result into the Bayes classifier

  • The Bayes classifier only requires \(P_{Y|X}\)
    • Generative classifier: learn \(P_X\) and \(P_{Y|X}\) (can compute \(P_{X|Y}\))
    • Discriminative classifier: learn \(P_{Y|X}\) directly
  • Models can be of two types
    • Parametric: strong assumption about underlying distribution (e.g., Gaussian, Multinomial, etc.)
    • Non-Parametric: no assumption about form of underlying distribution
  • All possible combinations are possible

  • Question: what are \(K\)-NN classifiers?

Maximum likelihood estimation

  • Consider a parametric density \(p_{\theta}(x)\) with unknown \(\theta\in\bbR^d\)

  • Assume that we have i.i.d. generated data points \(\{x_i\}_{i=1}^N\)

  • The likelihood is \(\mathcal{L}(\theta)\eqdef\P[\theta]{\{x_i\}_{i=1}^N}=\prod_{i=1}^N p_{\theta}(x_i)\)

  • The log-likelihood is \(\ell(\theta)\eqdef\log\mathcal{L}(\theta)\eqdef\log\P[\theta]{\{x_i\}_{i=1}^N}=\sum_{i=1}^N \log p_{\theta}(x_i)\)

  • The maximum likelihood estimator (MLE) is \(\theta_{\text{ML}} = \argmax_{\theta}\mathcal{L}(\theta)\)
    • Sometimes, there exists a closed form solution
    • More often than not, we need to resort to numerical solutions
  • Assume \(y=\theta^\intercal\bfx+n\) where \(n\sim\calN(0,\sigma^2)\). Then \[\theta_{\text{MLE}} = \argmin_{\theta}\sum_{i=1}^N \abs{y_i-\theta^\intercal\bfx_i}^2\]

Naive Bayes

  • Consider \(\bfx={[x_1,\cdots,x_d]}^\intercal\in\bbR^d\) (random) feature vector and \(y\) the label
    • Naive asssumption: Given \(y\), the features \(\{x_i\}_{i=1}^d\) of \(\bfx\) are independent, i.e., \(P_{\bfx|y} = \prod_{i=1}^d P_{x_i|y}\)
    • Main benefit: only need univariate densities \(P_{x_i|y}\), combine discrete/continous features
    • Assumption be completely wrong (Example: bivariate Gaussian case)
  • Procedure
    • Estimate a priori class probabilities: \(\pi_k\) for \(0\leq k\leq K-1\)
    • Estimate class conditional densities \(p_{x_i|y}(x|k)\) for \(1\leq i\leq d\) and \(0\leq k\leq K-1\)
  • The maximum likelihood estimate of \(\pi_k\) is \(\hat{\pi}_k = \frac{N_k}{N}\) where \(N_k\eqdef \card{\{y_i:y_i=k\}}\)

  • Assume \(j\)th feature \(x_j\) takes \(J\) distinct values \(\{0,\dots,J-1\}\). The maximum likelihood estimate of \(P_{x_j|y}(\ell|k)\) is \(\widehat{P}_{x_j|y}(\ell|k) = \frac{N^{(j)}_{\ell,k}}{N_k}\) where \(N^{(j)}_{\ell,k}\eqdef \card{\{\bfx:y=k\text{ and } x_j=\ell\}}\)

  • The naive bayes estimator is \(h^{\text{NB}}(\bfx)=\argmax_{k} \hat{\pi}_k\prod_{j=1}^d \widehat{P}_{x_j|y}(x_j|k)\)

  • \(P_{\bfx|y}\) often modeled with parametric models:
    • Continuous features: often Gaussian, use ML estimate
    • Discrete (categorical) features: often Multinomial, use ML estimate

Naive Bayes and Bag of Words

  • Bag of word model for document classification
    • Ignore positions of the words, focus on number of occurrences
    • Remove uninformative words, e.g., “the,” “an”, “of” that do are unlikely to matter (stop words)
  • Naive Bayes
    • Assume all features are conditionally independent given the class
  • Document classification
    • Think of document as \[ \bfx = \left[\text{word 1},\text{word 2},\cdots,\text{word $n$}\right]^\intercal\]
    • Dictionary contains words labeled by \(j\)
  • For every class \(k\) \(\P{\text{word i} = j|k} = \mu_{jk}\)

  • Likelihood of document \(\bfx\) in class \(k\) is \[\P{\bfx|k} = \prod_{i=1}^n \prod_j \mu_{jk}^{\indic{x_i=j}}=\prod_{j}\mu_{jk}^{N_{j}}\] where \(N_j\) is the number of occurrences of word \(j\) in the document.

Naive Bayes and Bag of Words

  • Estimate parameters
    • Compute the ML estimate of the document classes \(\hat{\pi}_k = \frac{N_k}{N}\)
    • Compute the ML estimate of the probability that word \(j\) occurs in class \(k\) across all documents: \[\hat{\mu}_{j,k}=\frac{\sum_{\ell}\ell N^{(j)}_{\ell,k}}{\sum_{j=1}^d\sum_{\ell}\ell N^{(j)}_{\ell,k}}\]
  • Run classifier: \(\hat{h}^{\text{NB}}=\argmax_{k}\hat{\pi}_k\prod_{j=1}^d\left(\hat{\mu}_{j,k}\right)^{x_j}\)

  • Weakness of approach: some words may not show up at training but show up at testing
    • Use Laplace smoothing \[\hat{\mu}_{j,k}=\frac{1+\sum_{\ell}\ell N^{(j)}_{\ell,k}}{d+\sum_{j=1}^d\sum_{\ell}\ell N^{(j)}_{\ell,k}}\]

Linear Discriminant Analysis (LDA)

  • Consider \(\bfx={[x_1,\cdots,x_d]}^\intercal\in\bbR^d\) (random) feature vector and \(y\) the label
    • Assumption: Given \(y\), the feature vector have a Gaussian distribution \(P_{\bfx|y}\sim\calN(\boldsymbol{\mu}_k,\Sigma)\)
    • The mean is class dependent but the covariance matrix is not \[\phi(\bfx;\boldsymbol{\mu},\Sigma)\eqdef\frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}(\bfx-\boldsymbol{\mu})^\intercal\Sigma^{-1}(x-\boldsymbol{\mu})\right)\]
  • Estimate parameters from data
    • \(\hat{\pi}_k=\frac{N_k}{N}\)
    • \(\hat{\boldsymbol{\mu}}_k=\frac{1}{N_k}\sum_{i:y_i=k}\bfx_i\)
    • \(\hat{\Sigma}=\frac{1}{N}\sum_{k=0}^{K-1}\sum_{i:y_i=k}(\bfx_i-\hat{\boldsymbol{\mu}}_k)(\bfx_i-\hat{\boldsymbol{\mu}}_k)^\intercal\) (recall assumption about covariance matrix)
  • The LDA classifier is \[h^{\text{LDA}}(\bfx)=\argmin_k \left(\frac{1}{2}(\bfx-\hat{\boldsymbol{\mu}}_k)^\intercal\hat{\Sigma}^{-1}(\bfx-\hat{\boldsymbol{\mu}}_k)-\log\hat{\pi}_k\right)\] For \(K=2\), the LDA is a linear classifier

Linear Discriminant Analysis (ct’d)

  • Generative model rarely accurate

  • Number of parameters to estimate:
    • \(K-1\) class priors, \(Kd\) means, \(\frac{1}{2}d(d+1)\) elements of covariance matrix
    • Works well if \(N\gg d\)
    • Works poorly if \(N\ll d\) without other tricks (dimensionality reduction, structured covariance)
  • Biggest concern: “one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling p(xly)].”, Vapnik, 1998

  • Revisit binary classifier with LDA \[\eta_1(\bfx) = \frac{\pi_1 \phi(\bfx;\boldsymbol{\mu}_1,\Sigma)}{\pi_1 \phi(\bfx;\boldsymbol{\mu}_1,\Sigma)+\pi_0 \phi(\bfx;\boldsymbol{\mu}_0,\Sigma)}=\frac{1}{1+\exp(-(\bfw^\intercal\bfx+\bfb))}\]

  • We no not need to estimate the full joint distribution!