A model is a mathematical representation of a function such as a classifier, regression function, etc.
Might have several free parameters not determined by the learning algorithm
Choice of free parameters has huge impact on performance
Choosing the value of free parameters is the problem of model selection
Method
Parameter
Polynomial regression
polynomial degree
Ridge regression
SVMs
margin violation constraint
Kernel methods
kernel choice
nearest neighbor
Model selection issue
We need to select appropriate values for the free parameters
All we have is the training data
We must use the training data to select the parameters
Free parameters usually control the balance between underfitting and overfitting
Left free because we don’t want to let the training data influence their selection (almost always leads to overfitting)
Example 1: what happens if we let training data determine the degree in polynomial regression?
Example 2: what happens if we let training data set the number neighbors in NN classification?
We have considered two approaches so far
VC approach:
Bias-variance approach:
Validation approach: try to estimate directly
Principle of validation
In addition to training data , suppose we have access to another validation set
Assume selected from training set and use the validation to form an estimate
How accurate is the estimate?
In general
Question: where is the validation set coming from?
Split the original set of data points into training ( points) and validation ( points)?
and
Small leads to poor estimation accuracy
Large leads to high estimation accuracy… of what?
Validation vs testing
How is validation different from testing?
can be used to make learning choices
If an estimate of affects learning, it is no longer testing
A test set is unbiased, a validation test has an optimistic bias
Assume two hypotheses and such that .
Assume and independent and distributed according to uniform distribution in .
Pick . Then
Using validation for model selection
Effect of bias
Data contamination
Three estimates of the risk
: totally contaminated
: totally clean
: partially contaminated
Dilemma: we would like
Can we do this?
Leave one out cross validation
-fold cross validation
Remarks
For -fold cross validation, the estimate depends on the particular choice of partition
It is common to form several estimates based on different random partitions and then average them
When using -fold cross validation for classification, you should ensure that each of the sets contains training data from each class in the same proportion as in the full data set (stratification)