Dr. Matthieu R Bloch
Wednesday, December 1, 2021
General announcements
Assignment 6 posted (last assignment)
Due December 7, 2021 for bonus, deadline December 10, 2021
2 lectures left
Let me know what’s missing
Assignment 5 grades posted
Reviewing Midterm2 grades one last time
The learning problem and why we need probabilities.
Lecture notes 17 and 23
Flip a biased coin, lands on head with unknown probability \(p\in[0,1]\)
\(\P{\text{head}}=p\) and \(\P{\text{tail}}=1-p\)
Say we flip the coin \(N\) times, can we estimate \(p\)?
\[ \hat{p} = \frac{\text{\# head}}{N} \]
Can we relate \(\hat{p}\) to \(p\)?
It is possible that \(\hat{p}\) is completely off but it is not probable
An unknown function \(f:\calX\to\calY:\bfx\mapsto y=f(\bfx)\) to learn
A dataset \(\calD\eqdef\{(\bfx_1,y_1),\cdots,(\bfx_N,y_N)\}\)
A set of hypotheses \(\calH\) as to what the function could be
An algorithm \(\texttt{ALG}\) to find the best \(h\in\calH\) that explains \(f\)
An unknown conditional distribution \(P_{y|\bfx}\) to learn
A dataset \(\calD\eqdef\{(\bfx_1,y_1),\cdots,(\bfx_N,y_N)\}\)
A set of hypotheses \(\calH\) as to what the function could be
An algorithm \(\texttt{ALG}\) to find the best \(h\in\calH\) that explains \(f\)
A dataset \(\calD\eqdef\{(\bfx_1,y_1),\cdots,(\bfx_N,y_N)\}\)
An unknown conditional distribution \(P_{y|\bfx}\)
A set of hypotheses \(\calH\) as to what the function could be
A loss function \(\ell:\calY\times\calY\rightarrow\bbR^+\) capturing the “cost” of prediction
An algorithm \(\texttt{ALG}\) to find the best \(h\in\calH\) that explains \(f\)
Learning is not memorizing
Consider hypothesis \(h\in\calH\). We can easily compute the empirical risk (a.k.a. in-sample error) \[\widehat{R}_N(h)\eqdef\frac{1}{N}\sum_{i=1}^N\ell(y_i,h(\bfx_i))\]
What we really care about is the true risk (a.k.a. out-sample error) \(R(h)\eqdef\E[\bfx y]{\ell(y,h(\bfx))}\)
Question #1: Can we generalize?
Question #2: Can we learn well?
Consider a special case of the general supervised learning problem
Dataset \(\calD\eqdef\{(\bfx_1,y_1),\cdots,(\bfx_N,y_N)\}\)
Unknown \(f:\calX\to\calY\), no noise.
Finite set of hypotheses \(\calH\), \(\card{\calH}=M<\infty\)
Binary loss function \(\ell:\calY\times\calY\rightarrow\bbR^+:(y_1,y_2)\mapsto \indic{y_1\neq y_2}\)
In this very specific case, the true risk simplifies \[ R(h)\eqdef\E[\bfx y]{\indic{h(\bfx)\neq y}} = \P[\bfx y]{h(\bfx)\neq y} \]
The empirical risk becomes \[ \widehat{R}_N(h)=\frac{1}{N}\sum_{i=1}^{N} \indic{h(\bfx_i)\neq y_i} \]
Our objective is to find a hypothesis \(h^*=\argmin_{h\in\calH}\widehat{R}_N(h)\) that ensures a small risk
For a fixed \(h_j\in\calH\), how does \(\widehat{R}_N(h_j)\) compares to \({R}(h_j)\)?
Observe that for \(h_j\in\calH\)
The empirical risk is a sum of iid random variables \[ \widehat{R}_N(h_j)=\frac{1}{N}\sum_{i=1}^{N} \indic{h_j(\bfx_i)\neq y_i} \]
\(\E{\widehat{R}_N(h_j)} = R(h_j)\)
\(\P{\abs{\widehat{R}_N(h_j)-{R}(h_j)}>\epsilon}\) is a statement about the deviation of a normalized sum of iid random variables from its mean
We’re in luck! Such bounds, a.k.a, known as concentration inequalities, are a well studied subject
Let \(X\) be a non-negative real-valued random variable. Then for all \(t>0\) \[\P{X\geq t}\leq \frac{\E{X}}{t}.\]
Let \(X\) be a real-valued random variable. Then for all \(t>0\) \[\P{\abs{X-\E{X}}\geq t}\leq \frac{\Var{X}}{t^2}.\]
Let \(\{X_i\}_{i=1}^N\) be i.i.d. real-valued random variables with finite mean \(\mu\) and finite variance \(\sigma^2\). Then \[\P{\abs{\frac{1}{N}\sum_{i=1}^N X_i-\mu}\geq\epsilon}\leq\frac{\sigma^2}{N\epsilon^2}\qquad\lim_{N\to\infty}\P{\abs{\frac{1}{N}\sum_{i=1}^N X_i-\mu}\geq \epsilon}=0.\]
By the law of large number, we know that \[ \forall\epsilon>0\quad\P[\{(\bfx_i,y_i)\}]{\abs{\widehat{R}_N(h_j)-{R}(h_j)}\geq\epsilon}\leq \frac{\Var{\indic{h_j(\bfx_1)\neq y_1}}}{N\epsilon^2}\leq \frac{1}{N\epsilon^2}\]
Given enough data, we can generalize
How much data? \(N=\frac{1}{\delta\epsilon^2}\) to ensure \(\P{\abs{\widehat{R}_N(h_j)-{R}(h_j)}\geq\epsilon}\leq \delta\).
That’s not quite enough! We care about \(\widehat{R}_N(h^*)\) where \(h^*=\argmin_{h\in\calH}\widehat{R}_N(h)\)
\[ \P{\abs{\widehat{R}_N(h^*)-{R}(h^*)}\geq\epsilon} \leq \P{\exists j:\abs{\widehat{R}_N(h_j)-{R}(h_j)}\geq\epsilon} \]
\[ \P{\abs{\widehat{R}_N(h^*)-{R}(h^*)}\geq\epsilon} \leq \frac{M}{N\epsilon^2} \]
If we choose \(N\geq\lceil\frac{M}{\delta\epsilon^2}\rceil\) we can ensure \(\P{\abs{\widehat{R}_N(h^*)-{R}(h^*)}\geq\epsilon}\leq \delta\).
We can obtain much better bounds than with Chebyshev
Let \(\{X_i\}_{i=1}^N\) be i.i.d. real-valued zero-mean random variables such that \(X_i\in[a_i;b_i]\) with \(a_i<b_i\). Then for all \(\epsilon>0\) \[\P{\abs{\frac{1}{N}\sum_{i=1}^N X_i}\geq\epsilon}\leq 2\exp\left(-\frac{2N^2\epsilon^2}{\sum_{i=1}^N(b_i-a_i)^2}\right).\]
In our learning problem \[ \forall\epsilon>0\quad\P{\abs{\widehat{R}_N(h_j)-{R}(h_j)}\geq\epsilon}\leq 2\exp(-2N\epsilon^2)\]
\[ \forall\epsilon>0\quad\P{\abs{\widehat{R}_N(h^*)-{R}(h^*)}\geq\epsilon}\leq 2M\exp(-2N\epsilon^2)\]
We can now choose \(N\geq \lceil\frac{1}{2\epsilon^2}\left(\ln \frac{2M}{\delta}\right)\rceil\)
\(M\) can be quite large (almost exponential in \(N\)) and, with enough data, we can generalize \(h^*\).
How about learning \(h^{\sharp}\eqdef\argmin_{h\in\calH}R(h)\)?
If \(\forall j\in\calH\,\abs{\widehat{R}_N(h_j)-{R}(h_j)}\leq\epsilon\) then \(\abs{R(h^*)-{R}(h^\sharp)}\leq 2\epsilon\).
How do we make \(R(h^\sharp)\) small?
The function \(N_{\calH}(\epsilon,\delta)\) is called sample complexity
We have effectively already proved the following result
A finite hypothesis set \(\calH\) is PAC learnable with the Empirical Risk Minimization algorithm and with sample complexity \[N_\calH(\epsilon,\delta)={\lceil{\frac{2\ln(2\card{\calH}/\delta)}{\epsilon^2}}\rceil}\]
Ideally we want \(\card{\calH}\) small so that \(R(h^*)\approx R(h^\sharp)\) and get lucky so that \(R(h^*)\approx 0\)
In general this is not possible
Remember, we usually have to learn \(P_{y|\bfx}\), not a function \(f\)
Questions