Matthieu Bloch
Many possibilities to assign a rank \(r(j)\) to feature \(j\)
Misclassification rate: \[r(j)\eqdef\frac{1}{N}\sum_{i=1}^N\indic{y_i\neq\theta(x_{i,j})}\quad\textsf{for some classifier}\quad \theta\]
Two-sample t-test statistics: \[r(j)\eqdef \frac{\abs{\overline{x_j^{+}}-\overline{x_j^{-}}}}{s/\sqrt{n}}\] where \(\overline{x_j^{(\pm)}}\) are class means for feature \(j\) and \(s\) is the pooled sample standard deviation
Margin: for separable data, compute \[ r(j)\eqdef \min_{k:y_k=+1,\ell:y_\ell=-1}\abs{x_{k,j}-x_{l,j}}\] or order statistics if data is not separable (to ignore outliers)
Again many possibilities
Correlation coefficient: \[r(j)\eqdef \abs{\rho(j)}\textsf{ with }\rho(j)\eqdef\frac{\textsf{Cov}(x_j,y)}{\sqrt{\Var{x_j}\Var{y}}}\] with sample estimates
Mutual information: \[I(X;Y)=\sum_{x,y}p(x,y)\log \frac{p(x,y)}{p(x)p(y)}\] measures how much \(x\) tells us about \(y\). Could rank features with \[r(j)\eqdef I(x_j;y)\] using estimates of probabilities
Ranking is legitimate but can lead to selection of redundant features
One solution is incremental maximization
Biggest drawback to filtering is inability to capture interactions between features