Bias-Variance Tradeoff

Matthieu Bloch

April 7, 2020

Approximation-generalization tradeoff

  • For fixed hypothesis \(\calH\) and dataset \(\calD\), goal is to find function \(h\in\calH\) that minimizes true risk \(R(h)\)

  • Hope to approximate the Bayes classifier (classification) or underlying function (regression)

  • Tension between two desirable characteristics
    • more complex \(\calH\) leads to better chance of approximating ideal classifier/function
    • less complex \(\calH\) leads to better chance of generalizing to unseen data
  • Regularization plays a similar role, by biasing answer away from complex classifiers/functions

  • Complexity must be carefully limited to avoid overfitting

Quantifying the tradeoff

  • VC generalization bound \[R(h)\leq \widehat{R}_N(h)+\epsilon(\calH,N)\quad\textsf{w.h.p.}\]

  • Bias-variance decomposition \[R(h) \approx \textsf{bias}^2+\textsf{variance}\]
    • bias: how well \(\calH\) can approximate true \(h^*\)
    • variance: how likely we are to pick a good \(h\in\calH\)
    • Generalizes more easily to regression than VC dimension
  • Setup for analysis:
    • \(h:\bbR^d\to\bbR\): unknown target function
    • Dataset \(\calD=\{(\bfx_i,y_i)\}_{i=1}^N\), \(\bfx_i\in\bbR^d\), \(y_i=h(\bfx_i)+\varepsilon_i\in\bbR\) (regression, \(\varepsilon_i\sim\calN(0,\sigma_\varepsilon^2)\))
    • \(\hat{h}_{\calD}:\bbR^d\to\bbR\): choice of function in \(\calH\) selected using \(\calD\)
    • Squared error for fixed \(\hat{h}_{\calD}\) \[R(\hat{h}_{\calD})=\E[XY]{(\hat{h}_{\calD}(X)-Y)^2}\]

Bias-variance decomposition

  • \[\E[\calD]{R(\hat{h}_{\calD})}=\sigma^2_{\varepsilon} + \E[X]{\Var{\hat{h}_{\calD}(X)}|X} + \E[X]{\textsf{Bias}(\hat{h}_\calD(X))^2|X}\] with \[\Var{\hat{h}_\calD(X)}\eqdef \E[\calD]{\left(\hat{h}_\calD(X)-\E[\calD]{\hat{h}_\calD(X)}\right)^2}\] \[\textsf{Bias}(\hat{h}_\calD(X))\eqdef \E[\calD]{\hat{h}_\calD(X)}-h(X)\]

  • Intuition: role of bias and variance

  • Demo: learning a sinusoid

Wrapping up

  • VC bound: keep the model complexity small enough compared to \(N\) and we can learn any \(h^*\)
    • analysis independent of learning algorithm
    • more useful as a conceptual tool than as a practical technique
  • Bias-variance decomposition: for any specific \(h\), we do best by matching the model complexity to the data resources (not to \(h^*\))
    • analysis dependent on learning algorithm
    • tension between increasing the model complexity reduces bias and decreasing the model complexity reduces variance
    • more useful as a conceptual tool than as a practical technique
  • Illustration of learning curves

RECAP: Insights to develop good model

  • Bias-variance decomposition more useful as a conceptual tool than as a practical technique

  • Bias-variance decomposition still gives useful insights to develop improved learning models
    • Reduce variance (without significantly increasing the bias)
      • limiting model complexity (e.g. polynomial order in regression), regularization
      • Not necessarily intuitive
    • Reduce bias (without significantly increasing the variance)
      • exploit prior information to steer the model in the correct direction
      • typically application specific
  • Example
    • Least-squares is an unbiased estimator, but can have high variance
    • Tikhonov regularization deliberately introduces bias into the estimator
    • The trick is figuring out just how much bias to introduce