Concentration inequalities are statements about the probability that some random variable deviates from its mean. If this probability is small, one can argue that actual realizations of the random variable are not too far from the average, which is often much easier to analyze. Concentration inequalities can be used as proof techniques to establish fairly fundamental results, such as the Johnson-Lindenstrauss Lemma used in compressed sensing or Shannonâ€™s channel coding theorem in information theory, or as a tool to analyze the performance of randomized algorithms, such as in statistical learning.
We outline here one such application in the context of risk minimization, which is particularly useful in machine learning. Assume that we are given some data and associated labels as a set of pairs $\{(x_i,y_i)\}_{i=1}^N$, obtained as iid realizations according to a distribution $p_{XY}$. The objective is to fit a model to explain how $y$ relates to $x$, described by a function $f:\calX\to\calY$ in a set $\calF$, $\card{\calF}<\infty$. The quality of the fit is measured through a loss function $\ell:\calY\times\calY\rightarrow\mathbb{R}^+$. The true risk function is defined as In practice, we usually do not have access to the distribution $p_{XY}$ and replace the true risk in \eqref{eq:true_risk} [eq:true_risk]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:true_riskâ€ť} by the empirical risk, defined as We then attempt to â€ślearnâ€ť the model by finding the best model $f^* = \argmin_{f\in\mathcal{F}}\widehat{R}_N(f)$. Since the learning is performed using the empirical risk, it is natural to wonder under which conditions $\widehat{R}_N(f^*)$ is not too different from $R(f^*)$. Intuitively, this should be true if the number of samples $N$ is large enough; concentration inequalities allow us to make this statement precise.
If we could show, and we will under some mild conditions, that there exists a constant $c>0$ such that then a union bound ensures that Equivalently, for $\delta\in[0,1]$, with probability at least $1-\delta$ This result is â€śgood news,â€ť because it tells us that the error we make when using the empirical risk instead of the true risk decays as $\frac{\log\card{\calF}}{N}$; in other words, even if the family $\calF$ is large, a reasonable number of samples leads to a small error.
With our assumption on the data, notice that $\E{\ell(f(X_i),Y_i)} = R(f)$; therefore,Â [eq:example_concentration_ineq]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:example_concentration_ineqâ€ť} is really a statement about the concentration of sample average of $\ell(f(X_i),Y_i)$ around the mean $R(f)$. The crucial aspect ofÂ [eq:example_concentration_ineq]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:example_concentration_ineqâ€ť} that makes it useful is the exponential concentration with the number of samples $N$ in the right-hand side, which is typically what concentration inequalities are about.
\begin{remark} This inequality only addresses part of the problem. In machine learning, we would not only like to ensure that $\widehat{R}_N(f^*)$ is close to $R(f^*)$ but also that $R(f^*)$ is small. For more on this, consider taking ECEÂ 6254 â€śStatistical Machine Learning.â€ť \end{remark}
Most concentration inequalities are astute specializations of Markovâ€™s inequality stated below.
Let $Z \in \mathbb{R}^+$ be such that $\E{\abs{Z}}<\infty$. Then $\forall t \geq 0$
For $t\in\mathbb{R}$, let $\indic{Z \geq t}$ be the indicator function of the event ${Z\geq t}$. Then,
By choosing $t=\delta\E{Z}$ for $\delta>0$ inÂ [eq:MarkovInequality]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:MarkovInequalityâ€ť}, we obtain $\P{Z\geq\delta\E{Z}}\leq\frac{1}{\delta}$, which is consistent with the intuition that it is unlikely that a random variable takes a value very far away from its mean.
In spite of its relative simplicity, Markovâ€™s inequality is a powerful tool because it can be â€śboosted.â€ť For $Z \in \calZ \subset \mathbb{R}$, consider $\phi:\calZ\rightarrow\mathbb{R}^+$ increasing on $\calZ$ such that $\E{\abs{\phi(Z)}}<\infty$. Then, where we have used the definition of $\phi$ and the fact that an indicator function is upper bounded by one. Applying Markovâ€™s inequality we obtain which is potentially a better bound thanÂ [eq:MarkovInequality]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:MarkovInequalityâ€ť}. Of course, the difficulty is in choosing the appropriate function $\phi$ to make the result meaningful. The most well-known application of this concept leads to Chebyshevâ€™s inequality.
Let $X \in \mathbb{R}$. Then,
Define $Y \triangleq |X - \E{X}|$ and $\phi: \mathbb{R}^+\to\mathbb{R}^+: t \mapsto t^2$. Then, by the boosted Markovâ€™s inequality we obtain
As an application of Chebychevâ€™s inequality, we derive the weak law of large numbers.
Let $X_i \sim p_{X_i}$ be independent with $\mathbb{E}[\abs{X_i}] < \infty$ and $\Var{X_i} < M$ for some $M\in\mathbb{R}^+$. Define $Z=\sum_{i=1}^n X_i$ for $n \in \mathbb{N}$. Then $Z$ converges in probability to $\frac{1}{n}\sum_{i=1}^n\E{X_i}$.
First observe that Therefore,
The weak law of large numbers is essentially stating that $\frac{1}{n} \sum_{i=1}^n X_i$ concentrates around its average. Note, however, that the convergence we proved is rather slow, on the order of $1/n$, and far from the exponential concentration suggested inÂ [eq:example_concentration_ineq]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:example_concentration_ineqâ€ť}. We could attempt to further improve upon Chebyshevâ€™s inequality by choosing a boosting function $\phi:t\rightarrow t^q$ for $q \in \mathbb{N}\setminus {0,1}$. Then, If $\forall q \in \mathbb{N}\setminus {0,1}$, $\mathbb{E}[|X - \E{X}|^q] < \infty$, we obtain This might come in handy if one has access to higher order absolute moments, but this will not give us exponential concentration.
The trick to obtain exponential concentration is to boost Markovâ€™s inequality with functions of the form $\phi_{\lambda}:t\rightarrow e^{\lambda t}$ for $\lambda \in \mathbb{R}^+$. The resulting bounds are often known as Chernoff bounds. Note that for a real-valued random variable $Z$ Applying the union bound Setting $\tilde{Z} \triangleq Z - \mathbb{E}[Z]$ or $\tilde{Z} \triangleq \mathbb{E}[Z] - Z$, the problem of deriving concentration inequalities is tantamount to studying $P[\tilde{Z} \geq t]$ where $\tilde{Z} \in \mathbb{R}$ is centered. We will make this assumption from now on to simplify analysis and notation without losing generality.
If $Z$ centered and real-valued, we have For $\lambda\in\mathbb{R}$, $\mathbb{E}[e^{\lambda Z}]$ is the [MGF]{acronym-label=â€ťMGFâ€ť acronym-form=â€ťsingular+shortâ€ť} of $Z$, and $\psi_Z(\lambda) \triangleq \log \mathbb{E}\left[ e^{\lambda Z} \right]$ is the [CGF]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťsingular+shortâ€ť} of $Z$. We recall some of the properties of the [CGF]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťsingular+shortâ€ť}.
[[prop:properties-cumulant]]{#prop:properties-cumulant label=â€ťprop:properties-cumulantâ€ť} Let $Z$ be centered and real-valued such that $\mathbb{E}[e^{\lambda Z}] < \infty$ for all $\abs{\lambda} < \epsilon$ for some $\epsilon>0$. Then, the [CGF]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťsingular+shortâ€ť} satisfies the following properties.
$\psi_Z$ is infinitely differentiable on $]-\epsilon,\epsilon[$. In particular, $\psi_Zâ€™(0) = \psi_Z(0) = 0$;
$\psi_Z(\lambda)\geq \lambda\E{Z}=0$;
If $Z = \sum_{i=1}^n X_i$ with $X_i$ independent with well defined [CGFs]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťplural+shortâ€ť}, $\psi_Z(\lambda)=\sum_{i=1}^n\psi_{X_i}(\lambda)$.
We skip the subtleties behind proof of differentiability, which essentially follows from the dominated convergence theorem. We will also happily swap derivatives an integrals without worrying too much. By definition, $\psi_Z(0) = \log \E{1}=0$. In addition, so that $\frac{d\psi}{d\lambda}(0)=0$ since $\E{Z}=0$. For the second part, note that by Jensenâ€™s inequality For the third part, we have
Since our goal is to find the best upper bound in the right-hand side ofÂ [eq:chernoff_bound]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:chernoff_boundâ€ť}, it is natural to maximize over all $\lambda\in\mathbb{R}^+$ to obtain
[[def:Cramer-transform]]{#def:Cramer-transform label=â€ťdef:Cramer-transformâ€ť} For a real-valued centered random variable $Z$ with cumulant generating function $\psi_Z$, the Cramer transform of $\psi_Z$ is $\psi_Z^*$ defined as
Note that $\psi_Z^*(t)\geq -\psi_Z(0)=0$ so that $\psi_Z^*(t) \in \mathbb{R}^+$. in general, the Cramer transform only takes simple values in trivial cases. If $\forall \lambda \in \mathbb{R}^+$, $\psi_Z(\lambda) = \infty$ then $\psi_Z^*(t) = 0$. If $t < \E{Z}=0$ then $\psi_Z^*(t)=0$. If $t\geq \E{Z}$ then, for all $\lambda<0$, $\lambda t-\psi_Z(\lambda)\leq 0$. Consequently, the bound inÂ [eq:chernoff_bound_optimized]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:chernoff_bound_optimizedâ€ť} is only useful when $t\geq \E{Z}$, in which case we can maximize over $\lambda\in\mathbb{R}$ inÂ [eq:cramer_transform]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:cramer_transformâ€ť}. Note that we can write $\psi_Z^*(t)$ as $\psi_Z^*(t) = \lambda_t t -\psi_Z(\lambda_t)$ with $\lambda_t$ such that $\psi_Zâ€™(\lambda_t) = t$.
Let $Z \sim \mathcal{N}(0,\sigma^2)$. Then, Hence $\psi_Z(\lambda) = \log e^{\frac{\lambda^2 \sigma^2}{2}} = \frac{\lambda^2 \sigma^2}{2}$. Then $\psi_Zâ€™(\lambda) = \lambda \sigma^2$ so that $\psi_Zâ€™(\lambda_t) = t \Leftrightarrow \lambda_t \sigma^2 = t \Leftrightarrow \lambda_t = \frac{t}{\sigma^2}$ and Hence, $\mathbb{P}[Z \geq t] \leq e^{-\frac{t^2}{2 \sigma^2}}$.
The pleasingly simple form of the Chernoff bound for $Z\sim\calN(0,1)$ stems from the simple form of the [CGF]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťsingular+shortâ€ť}. This naturally leads to the following definition.
$Z \in \mathbb{R}$ is subgaussian if $\exists \sigma^2 \in \mathbb{R}_*^+$ such that $\forall \lambda\in\mathbb{R}$, $\psi_Z(\lambda) \leq \frac{\lambda^2 \sigma^2}{2}$
If $Z$ is subgaussian then $\forall \lambda \in \mathbb{R}$, $\lambda t - \psi_Z(\lambda) \geq \lambda t - \frac{\lambda^2 \sigma^2}{2}$. In this case Consequently, proving sub-Gaussianity is a proxy for obtaining exponential concentration.
As an application, we establish the celebrated Hoeffdingâ€™s inequality. We start by proving that some variables are sub-Gaussian.
[[lm:hoeffding]]{#lm:hoeffding label=â€ťlm:hoeffdingâ€ť} Let a random variable $Y$ such that $\mathbb{E}[Y]=0$ and $Y \in [a,b]$. Then $Y$ is sub-Gaussian, and more specifically $\psi_{Y}(\lambda) \leq \lambda^2 \frac{(b-a)^2}{8}$.
Recall that $\psi_Y(\lambda) = \log \mathbb{E}[e^{\lambda Y}]$. We bound $\mathbb{E}[e^{\lambda Y}]$. Since $f:x\rightarrow e^{\lambda x}$ is convex, note that we can write Then $y = \gamma a + (1 - \gamma) b$, hence and Consider the function $g_p:x\rightarrow \ln(1-\rho +\rho e^x) - p x$, then $g_p(x) \leq \frac{x^2}{8}$. Hence $\mathbb{E}[e^{\lambda Y}] \leq \exp\left( \frac{\lambda^2 (b-a)^2}{8} \right)$.
Let a random variable $Y \sim p_Y$ such that $\mathbb{E}[Y] = 0$ and $Y\in [a,b]$. Then $a \leq Y \leq b$ and $\frac{a-b}{2} \leq Y - \frac{a+b}{2} \leq \frac{b-a}{2}$, so that $\left| Y - \frac{a+b}{2} \right| \leq \frac{b-a}{2}$ and $\Var{Y} \leq \frac{(b-a)^2}{4}$. Define $Z \in [a,b]$ such that $p_Z(y) = \frac{e^{\lambda y}}{\mathbb{E}[e^{\lambda Y}]} p_Y(y)$. Then we also have $\Var{Z} \leq \frac{(b-a)^2}{4}$. On the other hand Note that Then and From Taylorâ€™s theorem, $\exists c \in [0,\lambda]$ such that Therefore, $\psi_Y(\lambda) \leq \frac{\lambda^2(b-a)^2}{8}$.
Consider independent random variables $X_i$ with $\mathbb{E}[X_i] = 0$ and $X_i \in [a_i,b_i]$. Let $Y = \sum_{i=1}^n X_i$. Then
The proof follows by combining LemmaÂ [lm:hoeffding]{reference-type=â€ťrefâ€ť reference=â€ťlm:hoeffdingâ€ť} with PropositionÂ [prop:properties-cumulant]{reference-type=â€ťrefâ€ť reference=â€ťprop:properties-cumulantâ€ť} to obtain with
If $X_i \sim \mathcal{B}\left( \frac{1}{2} \right)$ on ${-1,1}$ then
We have only touched upon concentration inequalities, there is an entire field of research devoted to proving such results in intricate situations. Two great references areÂ (Boucheron, Lugosi, & Massart, 2016), from which I borrowed most of the ideas, andÂ (Raginsky & Sason, 2014).
@book{ConcentrationInequalities, title = {Concentration Inequalities: A Nonasymptotic Theory of Independence}, publisher = {Oxford University Press}, year = {2016}, author = {Boucheron, St\'ephane and Lugosi, Gabor and Massart, Pascal}, edition = {1st}, month = apr, file = {:Books/Boucheron et al - Concentration inequalities.pdf:PDF} }
@article{Raginsky2014, author = {Raginsky, Maxim and Sason, Igal}, title = {Concentration of Measure Inequalities in Information Theory, Communications, and Coding}, journal = {Foundations and Trends in Communications and Information Theory}, year = {2014}, month = sep, doi = {http://dx.doi.org/10.1561/0100000064}, publisher = {Now Publisher} }