Introduction to VC dimension

Matthieu R Bloch

RECAP: VC generalization bound

  • \[\P{\sup_{h\in\calH}\abs{R(h)-\widehat{R}_N(h)}>\epsilon}\leq 4 m_\calH(2N)e^{-\frac{1}{8}\epsilon^2 N}\]

  • Compare this with our previous generalization bound that assumed \(\card{\calH}<\infty\) \[\P{\max_{h\in\calH}\abs{R(h)-\widehat{R}_N(h)}>\epsilon}\leq 2 \card{\calH}e^{-2\epsilon^2 N}\]
    • We replace the \(\max\) by \(\sup\) and \(\card{\calH}\) by \(m_\calH(2N)\)
    • We can now handle infinite hypothesis classes!
  • With probability at least \(1-\delta\) \[R(h^*)\leq \widehat{R}_N(h^*) + \sqrt{\frac{8}{N}\left(\log m_\calH(2N) + \log \frac{4}{\delta}\right)}\]

  • Key insight behind proof is how to relate \(\sup_{h\in\calH}\) to \(\max_{h\in\calH'}\) with \(\calH'\subset\calH\) and \(\card{\calH'}<\infty\)

  • Approach developed by Vapnik and Chervonenkis in 1971

Proof of VC bound

  • If \(N\geq 4\epsilon^{-2}\ln 2\), \[\P{\sup_{h\in\calH}\abs{R(h)-\widehat{R}_N(h)}>\epsilon}\leq 2 \P{\sup_{h\in\calH}\abs{\widehat{R}'_N(h)-\widehat{R}_N(h)}>\frac{\epsilon}{2}}\]

  • Let \(\calS\eqdef\{(\bfx_i,y_i)\}_{i=1}^{2N}\) be a dataset partitioned into two subsets \(\calS_1\) and \(\calS_2\) of \(N\) points. Assume that \(\widehat{R}_N(h)\) is computed on \(\calS_1\) while \(\widehat{R}'_N(h)\) is computed on \(\calS_2\). \[ \P{\sup_{h\in\calH}\abs{\widehat{R}'_N(h)-\widehat{R}_N(h)}>\frac{\epsilon}{2}} \leq m_{\calH}(2N) \sup_{\calS_1,\calS_2} \sup_{h\in\calH} \P{\abs{\widehat{R}'_N(h)-\widehat{R}_N(h)}>\frac{\epsilon}{2}|\calS_1,\calS_2} \]

  • For any \(h\in\calH\) and any partition \(\calS_1,\calS_2\), we have \[ \P{\abs{\widehat{R}'_N(h)-\widehat{R}_N(h)}>\frac{\epsilon}{2}|\calS_1,\calS_2} \leq 2e^{-\frac{1}{8}\epsilon^2 N} \]

VC dimension

  • The VC dimension of a class \(\calH\), denoted \(d_{\mathrm{VC}}(\calH)\), is the largest \(n\) such that \(m_\calH(n)=2^n\).

  • By definition \(d_{\mathrm{VC}}\) is one less that the smallest break point

  • For linear classifiers in \(\bbR^2\), \(d_{\mathrm{VC}}=3\)

  • If \(k\) is a break point for \(\calH\), \[m_\calH(N)\leq \sum_{i=0}^{k-1}{N\choose i}\leq N^{k-1}+1\]

  • In particular \[ R(h^*)\leq \widehat{R}_N(h^*)+\sqrt{\frac{8}{N}\log \frac{4((2N)^{d_{\mathrm{VC}}}+1)}{\delta}}\stackrel{(d_{\mathrm{VC}}\geq 2)}{\leq}\widehat{R}_N(h^*)+\sqrt{\frac{8d_{\mathrm{VC}}}{N}\log \frac{8N}{\delta}}\]

VC Dimension for linear classifiers

  • The VC dimension for linear classifiers in \(\bbR^d\) is \(d+1\).

  • The number of parameters of a linear classfier in \(\bbR^d\) is \(d+1\)
    • This is misleading because more parameters does not necessarily means higher VC dimension
    • Hard work required to compute VC dimension in general
  • VC dimension of 1-NN: \(d_{\mathrm{VC}}=\infty\)

  • Through fairly involved arguments, one can show two important facts:
    1. SVMs with large margin have a small VC dimension
    2. SVMs with large margin have a small generalization error
  • Sort of works for soft margin, but the math is much more involved

  • Parting thoughts
    • VC dimension is a measure of complexity of infinite sized hypothesis classes
    • Hard to work with in practice
    • Fails for DNN, which do generalize well!
    • Useful as a theoretical tool to develop insight… but we will see others