# Concentration inequalities: why should you care?

Concentration inequalities are statements about the probability that some random variable deviates from its mean. If this probability is small, one can argue that actual realizations of the random variable are not too far from the average, which is often much easier to analyze. Concentration inequalities can be used as proof techniques to establish fairly fundamental results, such as the Johnson-Lindenstrauss Lemma used in compressed sensing or Shannon’s channel coding theorem in information theory, or as a tool to analyze the performance of randomized algorithms, such as in statistical learning.

We outline here one such application in the context of risk minimization, which is particularly useful in machine learning. Assume that we are given some data and associated labels as a set of pairs $\{(x_i,y_i)\}_{i=1}^N$, obtained as iid realizations according to a distribution $p_{XY}$. The objective is to fit a model to explain how $y$ relates to $x$, described by a function $f:\calX\to\calY$ in a set $\calF$, $\card{\calF}<\infty$. The quality of the fit is measured through a loss function $\ell:\calY\times\calY\rightarrow\mathbb{R}^+$. The true risk function is defined as \begin{align} R(f) \eqdef \E[p_{XY}]{\ell(f(X),Y)}\label{eq:true_risk}\end{align} In practice, we usually do not have access to the distribution $p_{XY}$ and replace the true risk in \eqref{eq:true_risk} [eq:true_risk]{reference-type=”eqref” reference=”eq:true_risk”} by the empirical risk, defined as \begin{align} \widehat{R}_N(f) \eqdef \frac{1}{N}\sum_{i=1}^N\ell(f(x_i),y_i).\label{eq:empirical_risk}\end{align} We then attempt to “learn” the model by finding the best model $f^* = \argmin_{f\in\mathcal{F}}\widehat{R}_N(f)$. Since the learning is performed using the empirical risk, it is natural to wonder under which conditions $\widehat{R}_N(f^*)$ is not too different from $R(f^*)$. Intuitively, this should be true if the number of samples $N$ is large enough; concentration inequalities allow us to make this statement precise.

If we could show, and we will under some mild conditions, that there exists a constant $c>0$ such that \begin{align} \forall f\in\calF \quad\forall \epsilon>0\quad\P{\abs{R(f)-\widehat{R}_N(f)}>\epsilon}\leq \exp\left(-c \epsilon^2 N\right),\label{eq:example_concentration_ineq}\end{align} then a union bound ensures that \begin{align} \P{\exists f\in\calF:\abs{R(f)-\widehat{R}_N(f)}>\epsilon}\leq \card{\calF}\exp\left(-c \epsilon^2 N\right).\end{align} Equivalently, for $\delta\in[0,1]$, with probability at least $1-\delta$ % This result is “good news,” because it tells us that the error we make when using the empirical risk instead of the true risk decays as $\frac{\log\card{\calF}}{N}$; in other words, even if the family $\calF$ is large, a reasonable number of samples leads to a small error.

With our assumption on the data, notice that $\E{\ell(f(X_i),Y_i)} = R(f)$; therefore, [eq:example_concentration_ineq]{reference-type=”eqref” reference=”eq:example_concentration_ineq”} is really a statement about the concentration of sample average of $\ell(f(X_i),Y_i)$ around the mean $R(f)$. The crucial aspect of [eq:example_concentration_ineq]{reference-type=”eqref” reference=”eq:example_concentration_ineq”} that makes it useful is the exponential concentration with the number of samples $N$ in the right-hand side, which is typically what concentration inequalities are about.

\begin{remark} This inequality only addresses part of the problem. In machine learning, we would not only like to ensure that $\widehat{R}_N(f^*)$ is close to $R(f^*)$ but also that $R(f^*)$ is small. For more on this, consider taking ECE 6254 “Statistical Machine Learning.” \end{remark}

# Markov’s inequality and improvements

Most concentration inequalities are astute specializations of Markov’s inequality stated below.

Let $Z \in \mathbb{R}^+$ be such that $\E{\abs{Z}}<\infty$. Then $\forall t \geq 0$ $\mathbb{P}[Z \geq t] \leq \frac{\mathbb{E}[Z]}{t}.\label{eq:MarkovInequality}$

For $t\in\mathbb{R}$, let $\indic{Z \geq t}$ be the indicator function of the event ${Z\geq t}$. Then, \begin{align} \mathbb{E}[Z] \geq \mathbb{E}[Z \indic{Z \geq t}] \geq t \mathbb{P}[Z \geq t].\end{align}

By choosing $t=\delta\E{Z}$ for $\delta>0$ in [eq:MarkovInequality]{reference-type=”eqref” reference=”eq:MarkovInequality”}, we obtain $\P{Z\geq\delta\E{Z}}\leq\frac{1}{\delta}$, which is consistent with the intuition that it is unlikely that a random variable takes a value very far away from its mean.

In spite of its relative simplicity, Markov’s inequality is a powerful tool because it can be “boosted.” For $Z \in \calZ \subset \mathbb{R}$, consider $\phi:\calZ\rightarrow\mathbb{R}^+$ increasing on $\calZ$ such that $\E{\abs{\phi(Z)}}<\infty$. Then, $\mathbb{P}[Z \geq t] = \mathbb{E}[\indic{Z \geq t}] = \mathbb{E}[{\indic{Z \geq t}} \indic{\phi(Z) \geq \phi(t)}] \leq \mathbb{P}[\phi(Z) \geq \phi(t)],$ where we have used the definition of $\phi$ and the fact that an indicator function is upper bounded by one. Applying Markov’s inequality we obtain $\mathbb{P}[Z \geq t] \leq \frac{\mathbb{E}[\phi(Z)]}{\phi(t)},$ which is potentially a better bound than [eq:MarkovInequality]{reference-type=”eqref” reference=”eq:MarkovInequality”}. Of course, the difficulty is in choosing the appropriate function $\phi$ to make the result meaningful. The most well-known application of this concept leads to Chebyshev’s inequality.

Let $X \in \mathbb{R}$. Then, $\mathbb{P}[|X - \E{X}| \geq t] \leq \frac{\Var{X}}{t^2}.$

Define $Y \triangleq |X - \E{X}|$ and $\phi: \mathbb{R}^+\to\mathbb{R}^+: t \mapsto t^2$. Then, by the boosted Markov’s inequality we obtain $\mathbb{P}[Y \geq t] \leq \frac{\mathbb{E}[Y^2]}{t^2}=\frac{\Var{X}}{t^2}.$

As an application of Chebychev’s inequality, we derive the weak law of large numbers.

Let $X_i \sim p_{X_i}$ be independent with $\mathbb{E}[\abs{X_i}] < \infty$ and $\Var{X_i} < M$ for some $M\in\mathbb{R}^+$. Define $Z=\sum_{i=1}^n X_i$ for $n \in \mathbb{N}$. Then $Z$ converges in probability to $\frac{1}{n}\sum_{i=1}^n\E{X_i}$.

First observe that $\mathbb{E}[Z] = \sum_{i=1}^n \mathbb{E}[X_i] \quad\text{and}\quad \Var{Z} = \sum_{i=1}^n \Var{X_i}.$ Therefore, $% $

The weak law of large numbers is essentially stating that $\frac{1}{n} \sum_{i=1}^n X_i$ concentrates around its average. Note, however, that the convergence we proved is rather slow, on the order of $1/n$, and far from the exponential concentration suggested in [eq:example_concentration_ineq]{reference-type=”eqref” reference=”eq:example_concentration_ineq”}. We could attempt to further improve upon Chebyshev’s inequality by choosing a boosting function $\phi:t\rightarrow t^q$ for $q \in \mathbb{N}\setminus {0,1}$. Then, $\mathbb{P}[|X - \E{X}| \geq t] \leq \frac{\mathbb{E}[|X - \E{X}|^q]}{t^q}.$ If $\forall q \in \mathbb{N}\setminus {0,1}$, $\mathbb{E}[|X - \E{X}|^q] < \infty$, we obtain $\mathbb{P}[|X - \E{X}| \geq t] \leq \inf_{q\in \mathbb{N}\setminus \{0,1\}} \frac{\mathbb{E}[|X - \E{X}|^q]}{t^q}.$ This might come in handy if one has access to higher order absolute moments, but this will not give us exponential concentration.

# Chernoff bounds

The trick to obtain exponential concentration is to boost Markov’s inequality with functions of the form $\phi_{\lambda}:t\rightarrow e^{\lambda t}$ for $\lambda \in \mathbb{R}^+$. The resulting bounds are often known as Chernoff bounds. Note that for a real-valued random variable $Z$ $\forall \lambda \in \mathbb{R}^+ \qquad \mathbb{P}[|Z-\mathbb{E}[Z]| \geq t] \leq \frac{\E{\phi_\lambda(\abs{Z-\E{Z}})}}{\phi_\lambda(t)}=e^{-\lambda t} \mathbb{E}\left[e^{\lambda |Z-\mathbb{E}[Z]|}\right]$ Applying the union bound $\mathbb{P}[|Z-\mathbb{E}[Z]| \geq t] = \mathbb{P}[Z-\mathbb{E}[Z] \geq t] + \mathbb{P}[\mathbb{E}[Z]-Z \geq t].$ Setting $\tilde{Z} \triangleq Z - \mathbb{E}[Z]$ or $\tilde{Z} \triangleq \mathbb{E}[Z] - Z$, the problem of deriving concentration inequalities is tantamount to studying $P[\tilde{Z} \geq t]$ where $\tilde{Z} \in \mathbb{R}$ is centered. We will make this assumption from now on to simplify analysis and notation without losing generality.

If $Z$ centered and real-valued, we have $\forall \lambda\in\mathbb{R}^+\quad \mathbb{P}[Z \geq t] \leq e^{-\lambda t} \mathbb{E}[e^{\lambda Z}] \label{eq:chernoff_bound}.$ For $\lambda\in\mathbb{R}$, $\mathbb{E}[e^{\lambda Z}]$ is the [MGF]{acronym-label=”MGF” acronym-form=”singular+short”} of $Z$, and $\psi_Z(\lambda) \triangleq \log \mathbb{E}\left[ e^{\lambda Z} \right]$ is the [CGF]{acronym-label=”CGF” acronym-form=”singular+short”} of $Z$. We recall some of the properties of the [CGF]{acronym-label=”CGF” acronym-form=”singular+short”}.

[[prop:properties-cumulant]]{#prop:properties-cumulant label=”prop:properties-cumulant”} Let $Z$ be centered and real-valued such that $\mathbb{E}[e^{\lambda Z}] < \infty$ for all $\abs{\lambda} < \epsilon$ for some $\epsilon>0$. Then, the [CGF]{acronym-label=”CGF” acronym-form=”singular+short”} satisfies the following properties.

1. $\psi_Z$ is infinitely differentiable on $]-\epsilon,\epsilon[$. In particular, $\psi_Z’(0) = \psi_Z(0) = 0$;

2. $\psi_Z(\lambda)\geq \lambda\E{Z}=0$;

3. If $Z = \sum_{i=1}^n X_i$ with $X_i$ independent with well defined [CGFs]{acronym-label=”CGF” acronym-form=”plural+short”}, $\psi_Z(\lambda)=\sum_{i=1}^n\psi_{X_i}(\lambda)$.

We skip the subtleties behind proof of differentiability, which essentially follows from the dominated convergence theorem. We will also happily swap derivatives an integrals without worrying too much. By definition, $\psi_Z(0) = \log \E{1}=0$. In addition, \begin{align} \frac{d\psi}{d\lambda}(\lambda) = \frac{\E{Ze^{\lambda Z}}}{\E{e^{\lambda Z}}}, \end{align} so that $\frac{d\psi}{d\lambda}(0)=0$ since $\E{Z}=0$. For the second part, note that by Jensen’s inequality \begin{align} \psi_{Z}(\lambda) = \log \E{e^{\lambda Z}} \geq \E{\log e^{\lambda Z}} = \lambda\E{Z}. \end{align} For the third part, we have $\mathbb{E}\left[e^{\lambda Z}\right] = \mathbb{E}\left[e^{\lambda \sum_{i=1}^n X_i}\right] = \mathbb{E}\left[\prod_{i=1}^n e^{\lambda X_i} \right] = \prod_{i=1}^n \mathbb{E}\left[ e^{\lambda X_i} \right] = e^{ \sum_{i=1}^n\log \mathbb{E}\left[ e^{\lambda X_i} \right] }$

Since our goal is to find the best upper bound in the right-hand side of [eq:chernoff_bound]{reference-type=”eqref” reference=”eq:chernoff_bound”}, it is natural to maximize over all $\lambda\in\mathbb{R}^+$ to obtain \begin{align} \mathbb{P}[Z \geq t] \leq \exp\left(-\sup_{\lambda\in\mathbb{R}^+}(\lambda t - \psi_Z(\lambda))\right)\label{eq:chernoff_bound_optimized}.\end{align}

[[def:Cramer-transform]]{#def:Cramer-transform label=”def:Cramer-transform”} For a real-valued centered random variable $Z$ with cumulant generating function $\psi_Z$, the Cramer transform of $\psi_Z$ is $\psi_Z^*$ defined as $\forall t \in \mathbb{R}^+ \qquad \psi_Z^*(t) \triangleq \sup_{\lambda\in\mathbb{R}^+} \left( \lambda t - \psi_Z(\lambda) \right).\label{eq:cramer_transform}$

Note that $\psi_Z^*(t)\geq -\psi_Z(0)=0$ so that $\psi_Z^*(t) \in \mathbb{R}^+$. in general, the Cramer transform only takes simple values in trivial cases. If $\forall \lambda \in \mathbb{R}^+$, $\psi_Z(\lambda) = \infty$ then $\psi_Z^*(t) = 0$. If $t < \E{Z}=0$ then $\psi_Z^*(t)=0$. If $t\geq \E{Z}$ then, for all $\lambda<0$, $\lambda t-\psi_Z(\lambda)\leq 0$. Consequently, the bound in [eq:chernoff_bound_optimized]{reference-type=”eqref” reference=”eq:chernoff_bound_optimized”} is only useful when $t\geq \E{Z}$, in which case we can maximize over $\lambda\in\mathbb{R}$ in [eq:cramer_transform]{reference-type=”eqref” reference=”eq:cramer_transform”}. Note that we can write $\psi_Z^*(t)$ as $\psi_Z^*(t) = \lambda_t t -\psi_Z(\lambda_t)$ with $\lambda_t$ such that $\psi_Z’(\lambda_t) = t$.

Let $Z \sim \mathcal{N}(0,\sigma^2)$. Then, $\mathbb{E}[e^{\lambda Z}] = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{z^2}{2\sigma^2}} e^{\lambda z} \text{d}z = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{(z-\lambda \sigma^2)^2}{2\sigma^2}} e^{\frac{\lambda^2 \sigma^2}{2}} \text{d}z = e^{\frac{\lambda^2 \sigma^2}{2}}.$ Hence $\psi_Z(\lambda) = \log e^{\frac{\lambda^2 \sigma^2}{2}} = \frac{\lambda^2 \sigma^2}{2}$. Then $\psi_Z’(\lambda) = \lambda \sigma^2$ so that $\psi_Z’(\lambda_t) = t \Leftrightarrow \lambda_t \sigma^2 = t \Leftrightarrow \lambda_t = \frac{t}{\sigma^2}$ and \begin{align} \psi_Z^\*(t)=\frac{t^2}{\sigma^2} - \frac{t^2}{2\sigma^2} = \frac{t^2}{2\sigma^2}.\end{align} Hence, $\mathbb{P}[Z \geq t] \leq e^{-\frac{t^2}{2 \sigma^2}}$.

The pleasingly simple form of the Chernoff bound for $Z\sim\calN(0,1)$ stems from the simple form of the [CGF]{acronym-label=”CGF” acronym-form=”singular+short”}. This naturally leads to the following definition.

$Z \in \mathbb{R}$ is subgaussian if $\exists \sigma^2 \in \mathbb{R}_*^+$ such that $\forall \lambda\in\mathbb{R}$, $\psi_Z(\lambda) \leq \frac{\lambda^2 \sigma^2}{2}$

If $Z$ is subgaussian then $\forall \lambda \in \mathbb{R}$, $\lambda t - \psi_Z(\lambda) \geq \lambda t - \frac{\lambda^2 \sigma^2}{2}$. In this case $\psi_Z^\*(\lambda) \geq \sup_{\lambda\in\mathbb{R}}\left( \lambda t - \frac{\lambda^2 \sigma^2}{2} \right) = \frac{t^2}{2 \sigma^2}.$ Consequently, proving sub-Gaussianity is a proxy for obtaining exponential concentration.

# Hoeffding’s inequality

As an application, we establish the celebrated Hoeffding’s inequality. We start by proving that some variables are sub-Gaussian.

[[lm:hoeffding]]{#lm:hoeffding label=”lm:hoeffding”} Let a random variable $Y$ such that $\mathbb{E}[Y]=0$ and $Y \in [a,b]$. Then $Y$ is sub-Gaussian, and more specifically $\psi_{Y}(\lambda) \leq \lambda^2 \frac{(b-a)^2}{8}$.

Recall that $\psi_Y(\lambda) = \log \mathbb{E}[e^{\lambda Y}]$. We bound $\mathbb{E}[e^{\lambda Y}]$. Since $f:x\rightarrow e^{\lambda x}$ is convex, note that we can write $\forall y\in[a,b] \qquad y = \underbrace{\frac{b-y}{b-a}}_{0 \leq \gamma \leq 1} a + \underbrace{\frac{y-a}{b-a}}_{1-\gamma} b.$ Then $y = \gamma a + (1 - \gamma) b$, hence $e^{\lambda y} \leq \gamma e^{\lambda a} + (1-\gamma) e^{\lambda b} = \frac{b-y}{b-a} e^{\lambda a} + \frac{y-a}{b-a} e^{\lambda b}$ and % Consider the function $g_p:x\rightarrow \ln(1-\rho +\rho e^x) - p x$, then $g_p(x) \leq \frac{x^2}{8}$. Hence $\mathbb{E}[e^{\lambda Y}] \leq \exp\left( \frac{\lambda^2 (b-a)^2}{8} \right)$.

Let a random variable $Y \sim p_Y$ such that $\mathbb{E}[Y] = 0$ and $Y\in [a,b]$. Then $a \leq Y \leq b$ and $\frac{a-b}{2} \leq Y - \frac{a+b}{2} \leq \frac{b-a}{2}$, so that $\left| Y - \frac{a+b}{2} \right| \leq \frac{b-a}{2}$ and $\Var{Y} \leq \frac{(b-a)^2}{4}$. Define $Z \in [a,b]$ such that $p_Z(y) = \frac{e^{\lambda y}}{\mathbb{E}[e^{\lambda Y}]} p_Y(y)$. Then we also have $\Var{Z} \leq \frac{(b-a)^2}{4}$. On the other hand % Note that $\psi_Y(\lambda) = \log \mathbb{E}[e^{\lambda Y}] \qquad \psi_Y(0) = 0$ Then $\psi_Y'(\lambda) = \frac{\mathbb{E}[Y e^{\lambda Y}]}{\mathbb{E}[e^{\lambda Y}]} \qquad \psi_Y'(0) = 0$ and $\psi_Y''(\lambda) = \frac{\mathbb{E}[Y^2 e^{\lambda Y}]}{\mathbb{E}[e^{\lambda Y}]} - \frac{\mathbb{E}[Y e^{\lambda Y}]^2}{\mathbb{E}[e^{\lambda Y}]^2} = \Var{Z} \leq \frac{(b-a)^2}{4}$ From Taylor’s theorem, $\exists c \in [0,\lambda]$ such that $\psi_Y(\lambda) = \psi_Y(0) + \lambda \psi_Y'(0) + \frac{\lambda^2}{2} \psi_Y''(c)$ Therefore, $\psi_Y(\lambda) \leq \frac{\lambda^2(b-a)^2}{8}$.

Consider independent random variables $X_i$ with $\mathbb{E}[X_i] = 0$ and $X_i \in [a_i,b_i]$. Let $Y = \sum_{i=1}^n X_i$. Then \begin{align} \mathbb{P}\left[ \sum_{i=1}^n X_i \geq t \right]\leq \exp \left( - \frac{2 t^2}{\sum_{i=1}^n (a_i-b_i)^2} \right)\end{align}

The proof follows by combining Lemma [lm:hoeffding]{reference-type=”ref” reference=”lm:hoeffding”} with Proposition [prop:properties-cumulant]{reference-type=”ref” reference=”prop:properties-cumulant”} to obtain \begin{align}\psi_Y(\lambda) = \sum_{i=1}^n \psi_{X_i}(\lambda) \leq \sum_{i=1}^n \frac{\lambda^2}{8} (b_i-a_i)^2 = \frac{\lambda^2 \sigma^2}{2}\end{align} with $\sigma^2 \eqdef \frac{1}{4} \sum_{i=1}^n (b_i-a_i)^2.$

If $X_i \sim \mathcal{B}\left( \frac{1}{2} \right)$ on ${-1,1}$ then \begin{align}\mathbb{P}\left[ \sum_{i=1}^n X_i \geq t \right] \leq \exp \left( - \frac{t^2}{2 n} \right)\end{align}

# To go further

We have only touched upon concentration inequalities, there is an entire field of research devoted to proving such results in intricate situations. Two great references are (Boucheron, Lugosi, & Massart, 2016), from which I borrowed most of the ideas, and (Raginsky & Sason, 2014).

1. S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, 1st ed. Oxford University Press, 2016.

@book{ConcentrationInequalities,
title = {Concentration Inequalities: A Nonasymptotic Theory of Independence},
publisher = {Oxford University Press},
year = {2016},
author = {Boucheron, St\'ephane and Lugosi, Gabor and Massart, Pascal},
edition = {1st},
month = apr,
file = {:Books/Boucheron et al - Concentration inequalities.pdf:PDF}
}


2. M. Raginsky and I. Sason, “Concentration of Measure Inequalities in Information Theory, Communications, and Coding,” Foundations and Trends in Communications and Information Theory, Sep. 2014.

@article{Raginsky2014,
author = {Raginsky, Maxim and Sason, Igal},
title = {Concentration of Measure Inequalities in Information Theory, Communications, and Coding},
journal = {Foundations and Trends in Communications and Information Theory},
year = {2014},
month = sep,
doi = {http://dx.doi.org/10.1561/0100000064},
publisher = {Now Publisher}
}