# Concentration inequalities: why should you care?

Concentration inequalities are statements about the probability that some random variable deviates from its mean. If this probability is small, one can argue that actual realizations of the random variable are not too far from the average, which is often much easier to analyze. Concentration inequalities can be used as proof techniques to establish fairly fundamental results, such as the Johnson-Lindenstrauss Lemma used in compressed sensing or Shannonâ€™s channel coding theorem in information theory, or as a tool to analyze the performance of randomized algorithms, such as in statistical learning.

We outline here one such application in the context of risk minimization, which is particularly useful in machine learning. Assume that we are given some data and associated labels as a set of pairs $\{(x_i,y_i)\}_{i=1}^N$, obtained as iid realizations according to a distribution $p_{XY}$. The objective is to fit a model to explain how $y$ relates to $x$, described by a function $f:\calX\to\calY$ in a set $\calF$, $\card{\calF}<\infty$. The quality of the fit is measured through a loss function $\ell:\calY\times\calY\rightarrow\mathbb{R}^+$. The true risk function is defined as \begin{align} R(f) \eqdef \E[p_{XY}]{\ell(f(X),Y)}\label{eq:true_risk}\end{align} In practice, we usually do not have access to the distribution $p_{XY}$ and replace the true risk in \eqref{eq:true_risk} [eq:true_risk]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:true_riskâ€ť} by the empirical risk, defined as \begin{align} \widehat{R}_N(f) \eqdef \frac{1}{N}\sum_{i=1}^N\ell(f(x_i),y_i).\label{eq:empirical_risk}\end{align} We then attempt to â€ślearnâ€ť the model by finding the best model $f^* = \argmin_{f\in\mathcal{F}}\widehat{R}_N(f)$. Since the learning is performed using the empirical risk, it is natural to wonder under which conditions $\widehat{R}_N(f^*)$ is not too different from $R(f^*)$. Intuitively, this should be true if the number of samples $N$ is large enough; concentration inequalities allow us to make this statement precise.

If we could show, and we will under some mild conditions, that there exists a constant $c>0$ such that \begin{align} \forall f\in\calF \quad\forall \epsilon>0\quad\P{\abs{R(f)-\widehat{R}_N(f)}>\epsilon}\leq \exp\left(-c \epsilon^2 N\right),\label{eq:example_concentration_ineq}\end{align} then a union bound ensures that \begin{align} \P{\exists f\in\calF:\abs{R(f)-\widehat{R}_N(f)}>\epsilon}\leq \card{\calF}\exp\left(-c \epsilon^2 N\right).\end{align} Equivalently, for $\delta\in[0,1]$, with probability at least $1-\delta$ % This result is â€śgood news,â€ť because it tells us that the error we make when using the empirical risk instead of the true risk decays as $\frac{\log\card{\calF}}{N}$; in other words, even if the family $\calF$ is large, a reasonable number of samples leads to a small error.

With our assumption on the data, notice that $\E{\ell(f(X_i),Y_i)} = R(f)$; therefore,Â [eq:example_concentration_ineq]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:example_concentration_ineqâ€ť} is really a statement about the concentration of sample average of $\ell(f(X_i),Y_i)$ around the mean $R(f)$. The crucial aspect ofÂ [eq:example_concentration_ineq]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:example_concentration_ineqâ€ť} that makes it useful is the exponential concentration with the number of samples $N$ in the right-hand side, which is typically what concentration inequalities are about.

\begin{remark} This inequality only addresses part of the problem. In machine learning, we would not only like to ensure that $\widehat{R}_N(f^*)$ is close to $R(f^*)$ but also that $R(f^*)$ is small. For more on this, consider taking ECEÂ 6254 â€śStatistical Machine Learning.â€ť \end{remark}

# Markovâ€™s inequality and improvements

Most concentration inequalities are astute specializations of Markovâ€™s inequality stated below.

Let $Z \in \mathbb{R}^+$ be such that $\E{\abs{Z}}<\infty$. Then $\forall t \geq 0$ $\mathbb{P}[Z \geq t] \leq \frac{\mathbb{E}[Z]}{t}.\label{eq:MarkovInequality}$

For $t\in\mathbb{R}$, let $\indic{Z \geq t}$ be the indicator function of the event ${Z\geq t}$. Then, \begin{align} \mathbb{E}[Z] \geq \mathbb{E}[Z \indic{Z \geq t}] \geq t \mathbb{P}[Z \geq t].\end{align}

By choosing $t=\delta\E{Z}$ for $\delta>0$ inÂ [eq:MarkovInequality]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:MarkovInequalityâ€ť}, we obtain $\P{Z\geq\delta\E{Z}}\leq\frac{1}{\delta}$, which is consistent with the intuition that it is unlikely that a random variable takes a value very far away from its mean.

In spite of its relative simplicity, Markovâ€™s inequality is a powerful tool because it can be â€śboosted.â€ť For $Z \in \calZ \subset \mathbb{R}$, consider $\phi:\calZ\rightarrow\mathbb{R}^+$ increasing on $\calZ$ such that $\E{\abs{\phi(Z)}}<\infty$. Then, $\mathbb{P}[Z \geq t] = \mathbb{E}[\indic{Z \geq t}] = \mathbb{E}[{\indic{Z \geq t}} \indic{\phi(Z) \geq \phi(t)}] \leq \mathbb{P}[\phi(Z) \geq \phi(t)],$ where we have used the definition of $\phi$ and the fact that an indicator function is upper bounded by one. Applying Markovâ€™s inequality we obtain $\mathbb{P}[Z \geq t] \leq \frac{\mathbb{E}[\phi(Z)]}{\phi(t)},$ which is potentially a better bound thanÂ [eq:MarkovInequality]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:MarkovInequalityâ€ť}. Of course, the difficulty is in choosing the appropriate function $\phi$ to make the result meaningful. The most well-known application of this concept leads to Chebyshevâ€™s inequality.

Let $X \in \mathbb{R}$. Then, $\mathbb{P}[|X - \E{X}| \geq t] \leq \frac{\Var{X}}{t^2}.$

Define $Y \triangleq |X - \E{X}|$ and $\phi: \mathbb{R}^+\to\mathbb{R}^+: t \mapsto t^2$. Then, by the boosted Markovâ€™s inequality we obtain $\mathbb{P}[Y \geq t] \leq \frac{\mathbb{E}[Y^2]}{t^2}=\frac{\Var{X}}{t^2}.$

As an application of Chebychevâ€™s inequality, we derive the weak law of large numbers.

Let $X_i \sim p_{X_i}$ be independent with $\mathbb{E}[\abs{X_i}] < \infty$ and $\Var{X_i} < M$ for some $M\in\mathbb{R}^+$. Define $Z=\sum_{i=1}^n X_i$ for $n \in \mathbb{N}$. Then $Z$ converges in probability to $\frac{1}{n}\sum_{i=1}^n\E{X_i}$.

First observe that $\mathbb{E}[Z] = \sum_{i=1}^n \mathbb{E}[X_i] \quad\text{and}\quad \Var{Z} = \sum_{i=1}^n \Var{X_i}.$ Therefore, $% $

The weak law of large numbers is essentially stating that $\frac{1}{n} \sum_{i=1}^n X_i$ concentrates around its average. Note, however, that the convergence we proved is rather slow, on the order of $1/n$, and far from the exponential concentration suggested inÂ [eq:example_concentration_ineq]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:example_concentration_ineqâ€ť}. We could attempt to further improve upon Chebyshevâ€™s inequality by choosing a boosting function $\phi:t\rightarrow t^q$ for $q \in \mathbb{N}\setminus {0,1}$. Then, $\mathbb{P}[|X - \E{X}| \geq t] \leq \frac{\mathbb{E}[|X - \E{X}|^q]}{t^q}.$ If $\forall q \in \mathbb{N}\setminus {0,1}$, $\mathbb{E}[|X - \E{X}|^q] < \infty$, we obtain $\mathbb{P}[|X - \E{X}| \geq t] \leq \inf_{q\in \mathbb{N}\setminus \{0,1\}} \frac{\mathbb{E}[|X - \E{X}|^q]}{t^q}.$ This might come in handy if one has access to higher order absolute moments, but this will not give us exponential concentration.

# Chernoff bounds

The trick to obtain exponential concentration is to boost Markovâ€™s inequality with functions of the form $\phi_{\lambda}:t\rightarrow e^{\lambda t}$ for $\lambda \in \mathbb{R}^+$. The resulting bounds are often known as Chernoff bounds. Note that for a real-valued random variable $Z$ $\forall \lambda \in \mathbb{R}^+ \qquad \mathbb{P}[|Z-\mathbb{E}[Z]| \geq t] \leq \frac{\E{\phi_\lambda(\abs{Z-\E{Z}})}}{\phi_\lambda(t)}=e^{-\lambda t} \mathbb{E}\left[e^{\lambda |Z-\mathbb{E}[Z]|}\right]$ Applying the union bound $\mathbb{P}[|Z-\mathbb{E}[Z]| \geq t] = \mathbb{P}[Z-\mathbb{E}[Z] \geq t] + \mathbb{P}[\mathbb{E}[Z]-Z \geq t].$ Setting $\tilde{Z} \triangleq Z - \mathbb{E}[Z]$ or $\tilde{Z} \triangleq \mathbb{E}[Z] - Z$, the problem of deriving concentration inequalities is tantamount to studying $P[\tilde{Z} \geq t]$ where $\tilde{Z} \in \mathbb{R}$ is centered. We will make this assumption from now on to simplify analysis and notation without losing generality.

If $Z$ centered and real-valued, we have $\forall \lambda\in\mathbb{R}^+\quad \mathbb{P}[Z \geq t] \leq e^{-\lambda t} \mathbb{E}[e^{\lambda Z}] \label{eq:chernoff_bound}.$ For $\lambda\in\mathbb{R}$, $\mathbb{E}[e^{\lambda Z}]$ is the [MGF]{acronym-label=â€ťMGFâ€ť acronym-form=â€ťsingular+shortâ€ť} of $Z$, and $\psi_Z(\lambda) \triangleq \log \mathbb{E}\left[ e^{\lambda Z} \right]$ is the [CGF]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťsingular+shortâ€ť} of $Z$. We recall some of the properties of the [CGF]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťsingular+shortâ€ť}.

[[prop:properties-cumulant]]{#prop:properties-cumulant label=â€ťprop:properties-cumulantâ€ť} Let $Z$ be centered and real-valued such that $\mathbb{E}[e^{\lambda Z}] < \infty$ for all $\abs{\lambda} < \epsilon$ for some $\epsilon>0$. Then, the [CGF]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťsingular+shortâ€ť} satisfies the following properties.

1. $\psi_Z$ is infinitely differentiable on $]-\epsilon,\epsilon[$. In particular, $\psi_Zâ€™(0) = \psi_Z(0) = 0$;

2. $\psi_Z(\lambda)\geq \lambda\E{Z}=0$;

3. If $Z = \sum_{i=1}^n X_i$ with $X_i$ independent with well defined [CGFs]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťplural+shortâ€ť}, $\psi_Z(\lambda)=\sum_{i=1}^n\psi_{X_i}(\lambda)$.

We skip the subtleties behind proof of differentiability, which essentially follows from the dominated convergence theorem. We will also happily swap derivatives an integrals without worrying too much. By definition, $\psi_Z(0) = \log \E{1}=0$. In addition, \begin{align} \frac{d\psi}{d\lambda}(\lambda) = \frac{\E{Ze^{\lambda Z}}}{\E{e^{\lambda Z}}}, \end{align} so that $\frac{d\psi}{d\lambda}(0)=0$ since $\E{Z}=0$. For the second part, note that by Jensenâ€™s inequality \begin{align} \psi_{Z}(\lambda) = \log \E{e^{\lambda Z}} \geq \E{\log e^{\lambda Z}} = \lambda\E{Z}. \end{align} For the third part, we have $\mathbb{E}\left[e^{\lambda Z}\right] = \mathbb{E}\left[e^{\lambda \sum_{i=1}^n X_i}\right] = \mathbb{E}\left[\prod_{i=1}^n e^{\lambda X_i} \right] = \prod_{i=1}^n \mathbb{E}\left[ e^{\lambda X_i} \right] = e^{ \sum_{i=1}^n\log \mathbb{E}\left[ e^{\lambda X_i} \right] }$

Since our goal is to find the best upper bound in the right-hand side ofÂ [eq:chernoff_bound]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:chernoff_boundâ€ť}, it is natural to maximize over all $\lambda\in\mathbb{R}^+$ to obtain \begin{align} \mathbb{P}[Z \geq t] \leq \exp\left(-\sup_{\lambda\in\mathbb{R}^+}(\lambda t - \psi_Z(\lambda))\right)\label{eq:chernoff_bound_optimized}.\end{align}

[[def:Cramer-transform]]{#def:Cramer-transform label=â€ťdef:Cramer-transformâ€ť} For a real-valued centered random variable $Z$ with cumulant generating function $\psi_Z$, the Cramer transform of $\psi_Z$ is $\psi_Z^*$ defined as $\forall t \in \mathbb{R}^+ \qquad \psi_Z^*(t) \triangleq \sup_{\lambda\in\mathbb{R}^+} \left( \lambda t - \psi_Z(\lambda) \right).\label{eq:cramer_transform}$

Note that $\psi_Z^*(t)\geq -\psi_Z(0)=0$ so that $\psi_Z^*(t) \in \mathbb{R}^+$. in general, the Cramer transform only takes simple values in trivial cases. If $\forall \lambda \in \mathbb{R}^+$, $\psi_Z(\lambda) = \infty$ then $\psi_Z^*(t) = 0$. If $t < \E{Z}=0$ then $\psi_Z^*(t)=0$. If $t\geq \E{Z}$ then, for all $\lambda<0$, $\lambda t-\psi_Z(\lambda)\leq 0$. Consequently, the bound inÂ [eq:chernoff_bound_optimized]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:chernoff_bound_optimizedâ€ť} is only useful when $t\geq \E{Z}$, in which case we can maximize over $\lambda\in\mathbb{R}$ inÂ [eq:cramer_transform]{reference-type=â€ťeqrefâ€ť reference=â€ťeq:cramer_transformâ€ť}. Note that we can write $\psi_Z^*(t)$ as $\psi_Z^*(t) = \lambda_t t -\psi_Z(\lambda_t)$ with $\lambda_t$ such that $\psi_Zâ€™(\lambda_t) = t$.

Let $Z \sim \mathcal{N}(0,\sigma^2)$. Then, $\mathbb{E}[e^{\lambda Z}] = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{z^2}{2\sigma^2}} e^{\lambda z} \text{d}z = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{(z-\lambda \sigma^2)^2}{2\sigma^2}} e^{\frac{\lambda^2 \sigma^2}{2}} \text{d}z = e^{\frac{\lambda^2 \sigma^2}{2}}.$ Hence $\psi_Z(\lambda) = \log e^{\frac{\lambda^2 \sigma^2}{2}} = \frac{\lambda^2 \sigma^2}{2}$. Then $\psi_Zâ€™(\lambda) = \lambda \sigma^2$ so that $\psi_Zâ€™(\lambda_t) = t \Leftrightarrow \lambda_t \sigma^2 = t \Leftrightarrow \lambda_t = \frac{t}{\sigma^2}$ and \begin{align} \psi_Z^\*(t)=\frac{t^2}{\sigma^2} - \frac{t^2}{2\sigma^2} = \frac{t^2}{2\sigma^2}.\end{align} Hence, $\mathbb{P}[Z \geq t] \leq e^{-\frac{t^2}{2 \sigma^2}}$.

The pleasingly simple form of the Chernoff bound for $Z\sim\calN(0,1)$ stems from the simple form of the [CGF]{acronym-label=â€ťCGFâ€ť acronym-form=â€ťsingular+shortâ€ť}. This naturally leads to the following definition.

$Z \in \mathbb{R}$ is subgaussian if $\exists \sigma^2 \in \mathbb{R}_*^+$ such that $\forall \lambda\in\mathbb{R}$, $\psi_Z(\lambda) \leq \frac{\lambda^2 \sigma^2}{2}$

If $Z$ is subgaussian then $\forall \lambda \in \mathbb{R}$, $\lambda t - \psi_Z(\lambda) \geq \lambda t - \frac{\lambda^2 \sigma^2}{2}$. In this case $\psi_Z^\*(\lambda) \geq \sup_{\lambda\in\mathbb{R}}\left( \lambda t - \frac{\lambda^2 \sigma^2}{2} \right) = \frac{t^2}{2 \sigma^2}.$ Consequently, proving sub-Gaussianity is a proxy for obtaining exponential concentration.

# Hoeffdingâ€™s inequality

As an application, we establish the celebrated Hoeffdingâ€™s inequality. We start by proving that some variables are sub-Gaussian.

[[lm:hoeffding]]{#lm:hoeffding label=â€ťlm:hoeffdingâ€ť} Let a random variable $Y$ such that $\mathbb{E}[Y]=0$ and $Y \in [a,b]$. Then $Y$ is sub-Gaussian, and more specifically $\psi_{Y}(\lambda) \leq \lambda^2 \frac{(b-a)^2}{8}$.

Recall that $\psi_Y(\lambda) = \log \mathbb{E}[e^{\lambda Y}]$. We bound $\mathbb{E}[e^{\lambda Y}]$. Since $f:x\rightarrow e^{\lambda x}$ is convex, note that we can write $\forall y\in[a,b] \qquad y = \underbrace{\frac{b-y}{b-a}}_{0 \leq \gamma \leq 1} a + \underbrace{\frac{y-a}{b-a}}_{1-\gamma} b.$ Then $y = \gamma a + (1 - \gamma) b$, hence $e^{\lambda y} \leq \gamma e^{\lambda a} + (1-\gamma) e^{\lambda b} = \frac{b-y}{b-a} e^{\lambda a} + \frac{y-a}{b-a} e^{\lambda b}$ and % Consider the function $g_p:x\rightarrow \ln(1-\rho +\rho e^x) - p x$, then $g_p(x) \leq \frac{x^2}{8}$. Hence $\mathbb{E}[e^{\lambda Y}] \leq \exp\left( \frac{\lambda^2 (b-a)^2}{8} \right)$.

Let a random variable $Y \sim p_Y$ such that $\mathbb{E}[Y] = 0$ and $Y\in [a,b]$. Then $a \leq Y \leq b$ and $\frac{a-b}{2} \leq Y - \frac{a+b}{2} \leq \frac{b-a}{2}$, so that $\left| Y - \frac{a+b}{2} \right| \leq \frac{b-a}{2}$ and $\Var{Y} \leq \frac{(b-a)^2}{4}$. Define $Z \in [a,b]$ such that $p_Z(y) = \frac{e^{\lambda y}}{\mathbb{E}[e^{\lambda Y}]} p_Y(y)$. Then we also have $\Var{Z} \leq \frac{(b-a)^2}{4}$. On the other hand % Note that $\psi_Y(\lambda) = \log \mathbb{E}[e^{\lambda Y}] \qquad \psi_Y(0) = 0$ Then $\psi_Y'(\lambda) = \frac{\mathbb{E}[Y e^{\lambda Y}]}{\mathbb{E}[e^{\lambda Y}]} \qquad \psi_Y'(0) = 0$ and $\psi_Y''(\lambda) = \frac{\mathbb{E}[Y^2 e^{\lambda Y}]}{\mathbb{E}[e^{\lambda Y}]} - \frac{\mathbb{E}[Y e^{\lambda Y}]^2}{\mathbb{E}[e^{\lambda Y}]^2} = \Var{Z} \leq \frac{(b-a)^2}{4}$ From Taylorâ€™s theorem, $\exists c \in [0,\lambda]$ such that $\psi_Y(\lambda) = \psi_Y(0) + \lambda \psi_Y'(0) + \frac{\lambda^2}{2} \psi_Y''(c)$ Therefore, $\psi_Y(\lambda) \leq \frac{\lambda^2(b-a)^2}{8}$.

Consider independent random variables $X_i$ with $\mathbb{E}[X_i] = 0$ and $X_i \in [a_i,b_i]$. Let $Y = \sum_{i=1}^n X_i$. Then \begin{align} \mathbb{P}\left[ \sum_{i=1}^n X_i \geq t \right]\leq \exp \left( - \frac{2 t^2}{\sum_{i=1}^n (a_i-b_i)^2} \right)\end{align}

The proof follows by combining LemmaÂ [lm:hoeffding]{reference-type=â€ťrefâ€ť reference=â€ťlm:hoeffdingâ€ť} with PropositionÂ [prop:properties-cumulant]{reference-type=â€ťrefâ€ť reference=â€ťprop:properties-cumulantâ€ť} to obtain \begin{align}\psi_Y(\lambda) = \sum_{i=1}^n \psi_{X_i}(\lambda) \leq \sum_{i=1}^n \frac{\lambda^2}{8} (b_i-a_i)^2 = \frac{\lambda^2 \sigma^2}{2}\end{align} with $\sigma^2 \eqdef \frac{1}{4} \sum_{i=1}^n (b_i-a_i)^2.$

If $X_i \sim \mathcal{B}\left( \frac{1}{2} \right)$ on ${-1,1}$ then \begin{align}\mathbb{P}\left[ \sum_{i=1}^n X_i \geq t \right] \leq \exp \left( - \frac{t^2}{2 n} \right)\end{align}

# To go further

We have only touched upon concentration inequalities, there is an entire field of research devoted to proving such results in intricate situations. Two great references areÂ (Boucheron, Lugosi, & Massart, 2016), from which I borrowed most of the ideas, andÂ (Raginsky & Sason, 2014).

1. S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, 1st ed. Oxford University Press, 2016.

@book{ConcentrationInequalities,
title = {Concentration Inequalities: A Nonasymptotic Theory of Independence},
publisher = {Oxford University Press},
year = {2016},
author = {Boucheron, St\'ephane and Lugosi, Gabor and Massart, Pascal},
edition = {1st},
month = apr,
file = {:Books/Boucheron et al - Concentration inequalities.pdf:PDF}
}


2. M. Raginsky and I. Sason, â€śConcentration of Measure Inequalities in Information Theory, Communications, and Coding,â€ť Foundations and Trends in Communications and Information Theory, Sep. 2014.

@article{Raginsky2014,
author = {Raginsky, Maxim and Sason, Igal},
title = {Concentration of Measure Inequalities in Information Theory, Communications, and Coding},
journal = {Foundations and Trends in Communications and Information Theory},
year = {2014},
month = sep,
doi = {http://dx.doi.org/10.1561/0100000064},
publisher = {Now Publisher}
}