Nearest Neighbors Classifiers

Matthieu R Bloch

Nearest neighbor classifier

  • Back to our training dataset \(\calD\eqdef\{(\bfx_1,y_1),\cdots,(\bfx_N,y_N)\}\)

  • The nearest-neighbor (NN) classifier is \(h^{\text{NN}}(\bfx)\eqdef y_{\text{NN}(\bfx)}\) where \(\text{NN}(\bfx)\eqdef \argmin_i \norm{\bfx_i-\bfx}\)

  • Risk of NN classifier conditioned on \(\bfx\) and \(\bfx_{\text{NN}(\bfx)}\) \[ R_{\text{NN}}(\bfx,\bfx_{\text{NN}(\bfx)}) = \sum_{k}\eta_k(\bfx_{\text{NN}(\bfx)})(1-\eta_k(\bfx))= \sum_{k}\eta_k(\bfx)(1-\eta_k(\bfx_{\text{NN}(\bfx)})). \]
    • How well does the average risk \(R_{\text{NN}}=R(h^{\text{NN}})\) compare to the Bayes risk for large \(N\)?
  • Let \(\bfx\), \(\{\bfx_i\}_{i=1}^N\) be i.i.d. \(\sim P_{\bfx}\) in a separable metric space \(\calX\). Let \(\bfx_{\text{NN}(\bfx)}\) be the nearest neighbor of \(\bfx\). Then \(\bfx_{\text{NN}(\bfx)} \to \bfx\) with probability one as \(N\to\infty\)

  • Let \(\calX\) be a separable metric space. Let \(p(\bfx|y=0)\), \(p(\bfx|y=1)\) be such that, with probability one, \(\bfx\) is either a continuity point of \(p(\bfx|y=0)\) and \(p(\bfx|y=1)\) or a point of non-zero probability measure. As \(N\to\infty\), \[R(h^{\text{B}}) \leq R(h^{\text{NN}})\leq 2R(h^{\text{B}})(1-R(h^{\text{B}}))\]

K Nearest neighbors classifier

  • Can drive the risk of the NN classifier to the Bayes risk by increasing the size of the neighborhood
    • Assign label to \(\bfx\) by taking majority vote among \(K\) nearest neighbors \(h^\text{$K$-NN}\) \[\lim_{N\to\infty}\E{R(h^{\text{$K$-NN}})}\leq \left(1+\sqrt{\frac{2}{K}}\right)R(h^{\text{B}})\]
  • Let \(\hat{h}_N\) be a classifier learned from \(N\) data points; \(\hat{h}_N\) is consistent if \(\E{R(\hat{h}_N)}\to R_B\) as \(N\to\infty\).

  • If \(N\to\infty\), \(K\to\infty\), \(K/N\to 0\), then \(h^{\text{$K$-NN}}\) is consistent

  • Choosing \(K\) is a problem of model selection
    • Do not choose \(K\) by minimizing the empirical risk on training: \[\widehat{R}_N(h^{\text{$1$-NN}}) = \frac{1}{N}\sum_{i=1}^N\indic{h_1(\bfx_i)=y_i}=0\]
    • Need to rely on estimates from model selection techniques (more later!)

K Nearest neighbors classifier

  • Given enough data, a \(K\)-NN classifier will do just as well as pretty much any other method

  • The number of samples \(N\) required can be huge (especially in high-dimension)

  • The choice of \(K\) matters a lot, model selection is important
    • Need to use cross-validation
  • Finding the nearest neighbors out of millions of datapoints is still computationally hard
    • \(K\)-d trees help, but still expensive in high dimension when \(N\approx d\)
  • We will discuss other classifiers that make more assumptions about the underlying data
    • In general, more structure \(\approx\) simpler model \(\approx\) computational benefits