Model Selection

Matthieu Bloch

April 9, 2020

Model selection

  • A model is a mathematical representation of a function such as a classifier, regression function, etc.
    • Might have several free parameters not determined by the learning algorithm
    • Choice of free parameters has huge impact on performance
    • Choosing the value of free parameters is the problem of model selection
Method Parameter
Polynomial regression polynomial degree \(d\)
Ridge regression \(\lambda\)
SVMs margin violation constraint \(C\)
Kernel methods kernel choice
\(K\) nearest neighbor \(K\)

Model selection issue

  • We need to select appropriate values for the free parameters
    • All we have is the training data
    • We must use the training data to select the parameters
  • Free parameters usually control the balance between underfitting and overfitting
    • Left free because we don’t want to let the training data influence their selection (almost always leads to overfitting)
    • Example 1: what happens if we let training data determine the degree in polynomial regression?
    • Example 2: what happens if we let training data set the number neighbors in NN classification?
  • We have considered two approaches so far
    • VC approach: \(R(h)= \widehat{R}_N(h)+\textsf{excess risk}\)
    • Bias-variance approach: \(R(h)=\textsf{bias}^2+\textsf{variance}\)
  • Validation approach: try to estimate \(R(h)\) directly

Principle of validation

  • In addition to training data \(\calD\), suppose we have access to another validation set \(\calV\eqdef\{(\bfx_i,y_i\}_{i=1}^K\)

  • Assume \(h\) selected from training set \(\calD\) and use the validation to form an estimate \[\widehat{R}_{\textsf{val}}(h)\eqdef\frac{1}{K}\sum_{i=1}^K\ell(h(\bfx_i),y_i)\]

  • How accurate is the estimate? \[\E{\widehat{R}_{\textsf{val}}(h)} = R(h)\qquad \Var{\widehat{R}_{\textsf{val}}(h)}=\frac{\sigma^2}{K}\textsf{ with }\sigma^2\eqdef\Var{\ell(h(\bfx),y)}\]
    • In general \(\widehat{R}_{\textsf{val}}(h)=R(h)\pm \calO(\frac{1}{\sqrt{K}})\)
  • Question: where is the validation set coming from?
    • Split the original set of \(N\) data points into training (\(N-K\) points) and validation (\(K\) points)?
    • \(\calD\eqdef\{(\bfx_i,y_i)\}_{i=1}^{N-K}\) and \(\calV\eqdef\{(\bfx_i,y_i)\}_{i=K+1}^{N}\)
    • Small \(K\) leads to poor estimation accuracy
    • Large \(K\) leads to high estimation accuracy… of what?

Validation vs testing

  • How is validation different from testing?
    • \(\widehat{R}_{\textsf{val}}(h)\) can be used to make learning choices
    • If an estimate of \(R(h)\) affects learning, it is no longer testing
    • A test set is unbiased, a validation test has an optimistic bias
    • Assume two hypotheses \(h_1\) and \(h_2\) such that \(\E{R(h_1)}=\E{R(h_2)}=p\).
    • Assume \(\widehat{R}_{\textsf{val}}(h_1)\) and \(\widehat{R}_{\textsf{val}}(h_2)\) independent and distributed according to uniform distribution in \([p-\eta;p+\eta]\).
    • Pick \(h=\argmin_{h_1,h_2}\widehat{R}_{\textsf{val}}(h)\). Then \(\E{R(h)}<p\)
  • Using validation for model selection

  • Effect of bias

Data contamination

  • Three estimates of the risk \(R(h)\)
    • \(\widehat{R}_{\textsf{train}}(h)\): totally contaminated
    • \(\widehat{R}_{\textsf{test}}(h)\): totally clean
    • \(\widehat{R}_{\textsf{validation}}(h)\): partially contaminated
  • Dilemma: we would like \(R(h)\approx R(h^{-})\approx\widehat{R}_{\textsf{val}}(h^{-})\)

  • Can we do this?
    • Leave one out cross validation
    • \(k\)-fold cross validation
  • Remarks
    • For \(k\)-fold cross validation, the estimate depends on the particular choice of partition
    • It is common to form several estimates based on different random partitions and then average them
    • When using \(k\)-fold cross validation for classification, you should ensure that each of the sets \(\calD_j\) contains training data from each class in the same proportion as in the full data set (stratification)