Model Selection

Model selection

A model is a mathematical representation of a function such as a classifier, regression function, etc.
- Might have several free parameters not determined by the learning algorithm
- Choice of free parameters has huge impact on performance
- Choosing the value of free parameters is the problem of model selection

Method	Parameter
Polynomial regression	polynomial degree $d$
Ridge regression	$λ$
SVMs	margin violation constraint $C$
Kernel methods	kernel choice
$K$ nearest neighbor	$K$

We need to select appropriate values for the free parameters
- All we have is the training data
- We must use the training data to select the parameters
Free parameters usually control the balance between underfitting and overfitting
- Left free because we don’t want to let the training data influence their selection (almost always leads to overfitting)
- Example 1: what happens if we let training data determine the degree in polynomial regression?
- Example 2: what happens if we let training data set the number neighbors in NN classification?
We have considered two approaches so far
- VC approach: $R (h) = {\hat{R}}_{N} (h) + excess risk$
- Bias-variance approach: $R (h) = {bias}^{2} + variance$
Validation approach: try to estimate $R (h)$ directly

In addition to training data $D$ , suppose we have access to another validation set $V ≜ {(x_{i}, y_{i}}_{i = 1}^{K}$
Assume $h$ selected from training set $D$ and use the validation to form an estimate ${\hat{R}}_{val} (h) ≜ \frac{1}{K} \sum_{i = 1}^{K} ℓ (h (x_{i}), y_{i})$
How accurate is the estimate? $E_{} [{\hat{R}}_{val} (h)] = R (h) Var ({\hat{R}}_{val} (h)) = \frac{σ^{2}}{K} with σ^{2} ≜ Var (ℓ (h (x), y))$
- In general ${\hat{R}}_{val} (h) = R (h) \pm O (\frac{1}{\sqrt{K}})$
Question: where is the validation set coming from?
- Split the original set of $N$ data points into training ( $N - K$ points) and validation ( $K$ points)?
- $D ≜ {(x_{i}, y_{i})}_{i = 1}^{N - K}$ and $V ≜ {(x_{i}, y_{i})}_{i = K + 1}^{N}$
- Small $K$ leads to poor estimation accuracy
- Large $K$ leads to high estimation accuracy… of what?

How is validation different from testing?
- ${\hat{R}}_{val} (h)$ can be used to make learning choices
- If an estimate of $R (h)$ affects learning, it is no longer testing
- A test set is unbiased, a validation test has an optimistic bias
- Assume two hypotheses $h_{1}$ and $h_{2}$ such that $E_{} [R (h_{1})] = E_{} [R (h_{2})] = p$ .
- Assume ${\hat{R}}_{val} (h_{1})$ and ${\hat{R}}_{val} (h_{2})$ independent and distributed according to uniform distribution in $[p - η; p + η]$ .
- Pick $h = {argmin}_{h_{1}, h_{2}} {\hat{R}}_{val} (h)$ . Then $E_{} [R (h)] < p$
Using validation for model selection
Effect of bias

Three estimates of the risk $R (h)$
- ${\hat{R}}_{train} (h)$ : totally contaminated
- ${\hat{R}}_{test} (h)$ : totally clean
- ${\hat{R}}_{validation} (h)$ : partially contaminated
Dilemma: we would like $R (h) \approx R (h^{-}) \approx {\hat{R}}_{val} (h^{-})$
Can we do this?
- Leave one out cross validation
- $k$ -fold cross validation
Remarks
- For $k$ -fold cross validation, the estimate depends on the particular choice of partition
- It is common to form several estimates based on different random partitions and then average them
- When using $k$ -fold cross validation for classification, you should ensure that each of the sets $D_{j}$ contains training data from each class in the same proportion as in the full data set (stratification)