Bias-Variance Tradeoff

Matthieu Bloch

April 7, 2020

Approximation-generalization tradeoff

For fixed hypothesis \(\calH\) and dataset \(\calD\), goal is to find function \(h\in\calH\) that minimizes true risk \(R(h)\)
Hope to approximate the Bayes classifier (classification) or underlying function (regression)
Tension between two desirable characteristics
- more complex \(\calH\) leads to better chance of approximating ideal classifier/function
- less complex \(\calH\) leads to better chance of generalizing to unseen data
Regularization plays a similar role, by biasing answer away from complex classifiers/functions
Complexity must be carefully limited to avoid overfitting

VC generalization bound \[R(h)\leq \widehat{R}_N(h)+\epsilon(\calH,N)\quad\textsf{w.h.p.}\]
Bias-variance decomposition \[R(h) \approx \textsf{bias}^2+\textsf{variance}\]
- bias: how well \(\calH\) can approximate true \(h^*\)
- variance: how likely we are to pick a good \(h\in\calH\)
- Generalizes more easily to regression than VC dimension
Setup for analysis:
- \(h:\bbR^d\to\bbR\): unknown target function
- Dataset \(\calD=\{(\bfx_i,y_i)\}_{i=1}^N\), \(\bfx_i\in\bbR^d\), \(y_i=h(\bfx_i)+\varepsilon_i\in\bbR\) (regression, \(\varepsilon_i\sim\calN(0,\sigma_\varepsilon^2)\))
- \(\hat{h}_{\calD}:\bbR^d\to\bbR\): choice of function in \(\calH\) selected using \(\calD\)
- Squared error for fixed \(\hat{h}_{\calD}\) \[R(\hat{h}_{\calD})=\E[XY]{(\hat{h}_{\calD}(X)-Y)^2}\]

\[\E[\calD]{R(\hat{h}_{\calD})}=\sigma^2_{\varepsilon} + \E[X]{\Var{\hat{h}_{\calD}(X)}|X} + \E[X]{\textsf{Bias}(\hat{h}_\calD(X))^2|X}\] with \[\Var{\hat{h}_\calD(X)}\eqdef \E[\calD]{\left(\hat{h}_\calD(X)-\E[\calD]{\hat{h}_\calD(X)}\right)^2}\] \[\textsf{Bias}(\hat{h}_\calD(X))\eqdef \E[\calD]{\hat{h}_\calD(X)}-h(X)\]
Intuition: role of bias and variance
Demo: learning a sinusoid

VC bound: keep the model complexity small enough compared to \(N\) and we can learn any \(h^*\)
- analysis independent of learning algorithm
- more useful as a conceptual tool than as a practical technique
Bias-variance decomposition: for any specific \(h\), we do best by matching the model complexity to the data resources (not to \(h^*\))
- analysis dependent on learning algorithm
- tension between increasing the model complexity reduces bias and decreasing the model complexity reduces variance
- more useful as a conceptual tool than as a practical technique
Illustration of learning curves

Bias-variance decomposition more useful as a conceptual tool than as a practical technique
Bias-variance decomposition still gives useful insights to develop improved learning models
- Reduce variance (without significantly increasing the bias)
  - limiting model complexity (e.g. polynomial order in regression), regularization
  - Not necessarily intuitive
- Reduce bias (without significantly increasing the variance)
  - exploit prior information to steer the model in the correct direction
  - typically application specific
Example
- Least-squares is an unbiased estimator, but can have high variance
- Tikhonov regularization deliberately introduces bias into the estimator
- The trick is figuring out just how much bias to introduce