Matthieu R Bloch
The Bayes classifier requires the knowledge of \(P_X\) and \(P_{Y|X}\)
We can learn \(P_X\) and \(P_{Y|X}\) from the data itself and plug the result into the Bayes classifier
All possible combinations are possible
Question: what are \(K\)-NN classifiers?
Consider a parametric density \(p_{\theta}(x)\) with unknown \(\theta\in\bbR^d\)
Assume that we have i.i.d. generated data points \(\{x_i\}_{i=1}^N\)
The likelihood is \(\mathcal{L}(\theta)\eqdef\P[\theta]{\{x_i\}_{i=1}^N}=\prod_{i=1}^N p_{\theta}(x_i)\)
The log-likelihood is \(\ell(\theta)\eqdef\log\mathcal{L}(\theta)\eqdef\log\P[\theta]{\{x_i\}_{i=1}^N}=\sum_{i=1}^N \log p_{\theta}(x_i)\)
Assume \(y=\theta^\intercal\bfx+n\) where \(n\sim\calN(0,\sigma^2)\). Then \[\theta_{\text{MLE}} = \argmin_{\theta}\sum_{i=1}^N \abs{y_i-\theta^\intercal\bfx_i}^2\]
The maximum likelihood estimate of \(\pi_k\) is \(\hat{\pi}_k = \frac{N_k}{N}\) where \(N_k\eqdef \card{\{y_i:y_i=k\}}\)
Assume \(j\)th feature \(x_j\) takes \(J\) distinct values \(\{0,\dots,J-1\}\). The maximum likelihood estimate of \(P_{x_j|y}(\ell|k)\) is \(\widehat{P}_{x_j|y}(\ell|k) = \frac{N^{(j)}_{\ell,k}}{N_k}\) where \(N^{(j)}_{\ell,k}\eqdef \card{\{\bfx:y=k\text{ and } x_j=\ell\}}\)
The naive bayes estimator is \(h^{\text{NB}}(\bfx)=\argmax_{k} \hat{\pi}_k\prod_{j=1}^d \widehat{P}_{x_j|y}(x_j|k)\)
For every class \(k\) \(\P{\text{word i} = j|k} = \mu_{jk}\)
Likelihood of document \(\bfx\) in class \(k\) is \[\P{\bfx|k} = \prod_{i=1}^n \prod_j \mu_{jk}^{\indic{x_i=j}}=\prod_{j}\mu_{jk}^{N_{j}}\] where \(N_j\) is the number of occurrences of word \(j\) in the document.
Run classifier: \(\hat{h}^{\text{NB}}=\argmax_{k}\hat{\pi}_k\prod_{j=1}^d\left(\hat{\mu}_{j,k}\right)^{x_j}\)
The LDA classifier is \[h^{\text{LDA}}(\bfx)=\argmin_k \left(\frac{1}{2}(\bfx-\hat{\boldsymbol{\mu}}_k)^\intercal\hat{\Sigma}^{-1}(\bfx-\hat{\boldsymbol{\mu}}_k)-\log\hat{\pi}_k\right)\] For \(K=2\), the LDA is a linear classifier
Generative model rarely accurate
Biggest concern: “one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling p(xly)].”, Vapnik, 1998
Revisit binary classifier with LDA \[\eta_1(\bfx) = \frac{\pi_1 \phi(\bfx;\boldsymbol{\mu}_1,\Sigma)}{\pi_1 \phi(\bfx;\boldsymbol{\mu}_1,\Sigma)+\pi_0 \phi(\bfx;\boldsymbol{\mu}_0,\Sigma)}=\frac{1}{1+\exp(-(\bfw^\intercal\bfx+\bfb))}\]
We no not need to estimate the full joint distribution!