Nearest Neighbors Classifiers

Matthieu R Bloch

Nearest neighbor classifier

Back to our training dataset $\calD\eqdef\{(\bfx_1,y_1),\cdots,(\bfx_N,y_N)\}$
The nearest-neighbor (NN) classifier is $h^{\text{NN}}(\bfx)\eqdef y_{\text{NN}(\bfx)}$ where $\text{NN}(\bfx)\eqdef \argmin_i \norm{\bfx_i-\bfx}$
Risk of NN classifier conditioned on $\bfx$ and $\bfx_{\text{NN}(\bfx)}$ \[ R_{\text{NN}}(\bfx,\bfx_{\text{NN}(\bfx)}) = \sum_{k}\eta_k(\bfx_{\text{NN}(\bfx)})(1-\eta_k(\bfx))= \sum_{k}\eta_k(\bfx)(1-\eta_k(\bfx_{\text{NN}(\bfx)})). \]
- How well does the average risk $R_{\text{NN}}=R(h^{\text{NN}})$ compare to the Bayes risk for large $N$?
Let $\bfx$, $\{\bfx_i\}_{i=1}^N$ be i.i.d. $\sim P_{\bfx}$ in a separable metric space $\calX$. Let $\bfx_{\text{NN}(\bfx)}$ be the nearest neighbor of $\bfx$. Then $\bfx_{\text{NN}(\bfx)} \to \bfx$ with probability one as $N\to\infty$
Let $\calX$ be a separable metric space. Let $p(\bfx|y=0)$, $p(\bfx|y=1)$ be such that, with probability one, $\bfx$ is either a continuity point of $p(\bfx|y=0)$ and $p(\bfx|y=1)$ or a point of non-zero probability measure. As $N\to\infty$, \[R(h^{\text{B}}) \leq R(h^{\text{NN}})\leq 2R(h^{\text{B}})(1-R(h^{\text{B}}))\]

Can drive the risk of the NN classifier to the Bayes risk by increasing the size of the neighborhood
- Assign label to $\bfx$ by taking majority vote among $K$ nearest neighbors $h^\text{$K$-NN}$ \[\lim_{N\to\infty}\E{R(h^{\text{$K$-NN}})}\leq \left(1+\sqrt{\frac{2}{K}}\right)R(h^{\text{B}})\]
Let $\hat{h}_N$ be a classifier learned from $N$ data points; $\hat{h}_N$ is consistent if $\E{R(\hat{h}_N)}\to R_B$ as $N\to\infty$.
If $N\to\infty$, $K\to\infty$, $K/N\to 0$, then $h^{\text{$K$-NN}}$ is consistent
Choosing $K$ is a problem of model selection
- Do not choose $K$ by minimizing the empirical risk on training: \[\widehat{R}_N(h^{\text{$1$-NN}}) = \frac{1}{N}\sum_{i=1}^N\indic{h_1(\bfx_i)=y_i}=0\]
- Need to rely on estimates from model selection techniques (more later!)

Given enough data, a $K$-NN classifier will do just as well as pretty much any other method
The number of samples $N$ required can be huge (especially in high-dimension)
The choice of $K$ matters a lot, model selection is important
- Need to use cross-validation
Finding the nearest neighbors out of millions of datapoints is still computationally hard
- $K$-d trees help, but still expensive in high dimension when $N\approx d$
We will discuss other classifiers that make more assumptions about the underlying data
- In general, more structure $\approx$ simpler model $\approx$ computational benefits