Introduction to VC dimension

Matthieu R Bloch

RECAP: Dichotomy, growth function, and Break point

  • For a dataset \(\calD\eqdef \{\bfx_i\}_{i=1}^N\) and set of hypotheses \(\calH\), the set of dichotomies generated by \(\calH\) on \(\calD\) is \[\calH(\{\bfx_i\}_{i=1}^N)\eqdef \{\{h(\bfx_i)\}_{i=1}^N:h\in\calH\}\]

  • By definition \(\card{\calH(\{\bfx_i\}_{i=1}^N)}\leq 2^N\) and in general \(\card{\calH(\{\bfx_i\}_{i=1}^N)}\ll\card{\calH}\)

  • For a set of hypotheses \(\calH\), the growth function of \(\calH\) is \[m_\calH(N)\eqdef\max_{\{\bfx_i\}_{i=1}^N}\card{\calH(\{\bfx_i\}_{i=1}^N)}\]

  • The growth function does not depend on the datapoints \(\{\bfx_i\}_{i=1}^N\) and \(m_{\calH}(N)\leq 2^N\)

  • If not data set of size \(k\) can be shattered by \(\calH\), then \(k\) is a break point for \(\calH\)

  • If \(k\) is a break point then \(m_\calH(k)<2^k\)

Sauer’s lemma

  • \(B(N,1)=1\), \(B(1,k)=2\) for \(k>1\), \[\forall k>1\quad B(N,k)\leq B(N-1,k) + B(N-1,k-1)\]

  • \[B(N,k)\leq \sum_{i=0}^{k-1}{N\choose i}\]

  • \(B(N,k)\) is polynomial in \(N\) of degree \(k-1\)

  • Conclusion: if \(\calH\) has a break point, \(m_\calH(N)\) is polynomial in \(N\)

VC generalization bound

  • \[\P{\sup_{h\in\calH}\abs{R(h)-\widehat{R}_N(h)}>\epsilon}\leq 4 m_\calH(2N)e^{-\frac{1}{8}\epsilon^2 N}\]

  • Compare this with our previous generalization bound that assumed \(\card{\calH}<\infty\) \[\P{\max_{h\in\calH}\abs{R(h)-\widehat{R}_N(h)}>\epsilon}\leq 2 \card{\calH}e^{-2\epsilon^2 N}\]
    • We replace the \(\max\) by \(\sup\) and \(\card{\calH}\) by \(m_\calH(2N)\)
    • We can now handle infinite hypothesis classes!
  • With probability at least \(1-\delta\) \[R(h^*)\leq \widehat{R}_N(h^*) + \sqrt{\frac{8}{N}\left(\log m_\calH(2N) + \log \frac{4}{\delta}\right)}\]

  • Key insight behind proof is how to relate \(\sup_{h\in\calH}\) to \(\max_{h\in\calH'}\) with \(\calH'\subset\calH\) and \(\card{\calH'}<\infty\)

  • Approach developed by Vapnik and Chervonenkis in 1971

Key insights of VC bound

  • The growth function \(m_\calH\) plays a role
    • There may be infinitely many \(h\in\calH\), but they generate a finite number of unique dichotomies
    • Hence, \(\{\widehat{R}_N(h):h\in\calH\}\) is finite
    • Unfortunately \(R(h)\) still potentialy takes infinitely many different values
  • Key insight: use a second ghost dataset of size \(N\) with empirical risk \(\widehat{R}'_N(h)\)
    • Hope that we can squeeze \(R(h)\) between \(\widehat{R}'_N(h)\) and \(\widehat{R}_N(h)\)
Image
  • We will try to relate \(\P{\abs{R(h)-\widehat{R}_N(h)}>\epsilon}\) to \(\P{\abs{\widehat{R}'_N(h)-\widehat{R}_N(h)}>\epsilon'}\) with \(\epsilon'=f(\epsilon)\)
    • \(\P{\abs{\widehat{R}_N(h)-\widehat{R}'_N(h)}>\epsilon}\) only depends on the finite number of unique dichotomies

Intuition

  • Assume that \(X\), \(X'\) be i.i.d. random variables with symmetric distribution around their mean \(\mu\)
    • Let \(\calA\eqdef \{\abs{X-\mu}>\epsilon\}\)
    • Let \(\calB\eqdef \{\abs{X-X'}>\epsilon\}\)
  • \[\P{\calA}\leq 2\P{\calB}\]

  • If \(X\eqdef \widehat{R}_N(h)\) and \(X'\eqdef \widehat{R}'_N(h)\) had symmetric distributions, we would obtain \[\P{\abs{R(h)-\widehat{R}_N(h)}>\epsilon}\leq 2 \P{\abs{\widehat{R}_N(h)-\widehat{R}'_N(h)}>\epsilon}\]

  • Not quite true, but close

Proof of VC bound

  • If \(N\geq 4\epsilon^{-2}\ln 2\), \[\P{\sup_{h\in\calH}\abs{R(h)-\widehat{R}_N(h)}>\epsilon}\leq 2 \P{\sup_{h\in\calH}\abs{\widehat{R}'_N(h)-\widehat{R}_N(h)}>\frac{\epsilon}{2}}\]

  • Let \(\calS\eqdef\{(\bfx_i,y_i)\}_{i=1}^{2N}\) be a dataset partitioned into two subsets \(\calS_1\) and \(\calS_2\) of \(N\) points. Assume that \(\widehat{R}_N(h)\) is computed on \(\calS_1\) while \(\widehat{R}'_N(h)\) is computed on \(\calS_2\). \[ \P{\sup_{h\in\calH}\abs{\widehat{R}'_N(h)-\widehat{R}_N(h)}>\frac{\epsilon}{2}} \leq m_{\calH}(2N) \sup_{\calS_1,\calS_2} \sup_{h\in\calH} \P{\abs{\widehat{R}'_N(h)-\widehat{R}_N(h)}>\frac{\epsilon}{2}|\calS_1,\calS_2} \]

  • For any \(h\in\calH\) and any partition \(\calS_1,\calS_2\), we have \[ \P{\abs{\widehat{R}'_N(h)-\widehat{R}_N(h)}>\frac{\epsilon}{2}|\calS_1,\calS_2} \leq 2e^{-\frac{1}{8}\epsilon^2 N} \]