Prof. Matthieu Bloch
Tuesday August 29, 2023
A function \(f:\intseq{a}{b} \longmapsto \mathbb{R}\) is convex if \(\forall \lambda \in \intseq{0}{1}\) \(f(\lambda a + (1-\lambda)b ) \leq \lambda f(a) + (1 - \lambda) f(b)\). A function \(f\) is strictly convex if the inequality above is strict. A function \(f\) is (strictly) concave if \(-f\) is (strictly) convex.
Let \(X\) be a real-valued random variable defined on some interval \([a,b]\) and with PDF \(p_X\). Let \(f:[a,b]\rightarrow \mathbb{R}\) be a real valued function that is convex in \([a,b]\). Then,
\[\begin{align*} f(\E{X})\leq \E{f(X)}. \end{align*}\]
For any strictly convex function, equality holds if and only if \(X\) is a constant. The results also holds more generally for continuous random variables.
Let \(\{a_i\}_{i=1}^n\in\bbR_{+}^n\) and \(\{b_i\}_{i=1}^n\in\bbR_{+}^n\). Then,
\[\begin{align*} \sum_{i=1}^n a_i\ln \frac{a_i}{b_i} \geq \left(\sum_{i=1}^n a_i\right)\ln\frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i}. \end{align*}\]
Let \(f:\bbR\to\bbR\) be a convex function such that \(f(1)=0\). The \(f\)-divergence between two distributions \(P,Q \in\Delta(\calX)\) is
\[\begin{align*} \bbD_f(P\Vert Q)\eqdef \sum_{x\in\calX}f\left(\frac{P(x)}{Q(x)}\right)Q(x), \end{align*}\]
with the convention that \(\bbD_f(P\Vert Q)=\infty\) if \(P\) is not absolutely continuous with respect to \(Q\).
A function \(W:\calX\times\calY:(x,y)\mapsto W(y|x)\) is a Markov kernel (also called transition probability) from \(\calX\) to \(\calY\) if for all \(x,y\in\calX\times\calY\) \(W(y|x)\geq 0\) and for all \(x\in\calX\) \(\sum_{y\in\calY}W(y|x)=1\).
For any \(P,Q \in \Delta(\calX)\) \(\bbD_f(P\Vert Q)\geq 0\) with equality if and only if \(P=Q\).
Let \(P,Q\in\Delta(\calX)\) and let \(W\) be a Markov kernel from \(\calX\) to \(\calY\).
\[\begin{align*} \bbD_f(W\circ P\Vert W\circ Q) \leq \bbD_f(W\cdot P\Vert W\cdot Q) \leq \bbD_f(P\Vert Q). \end{align*}\]
The \(f\)-divergence \(\bbD:\Delta(\calX)\times\Delta(\calX)\to\bbR^+:(p,q)\mapsto\mathbb{D}_f(P\Vert Q)\) is jointly convex, i.e., for all \(P_1,Q_1,P_2,Q_2\in\Delta(\calX)\) and \(\lambda\in[0;1]\)
\[\begin{align*} \mathbb{D}_f\left(\lambda P_1+(1-\lambda) P_2\Vert \lambda Q_1+(1-\lambda) Q_2\right)\leq \lambda\mathbb{D}_f(P_1\Vert Q_1)+(1-\lambda)\mathbb{D}_f(P_2\Vert Q_2). \end{align*}\]
For \(P,Q\in\Delta(\calX)\), the total variation between \(P\) and \(Q\) is the \(f\)-divergence for \(f:t\mapsto\abs{t-1}\), i.e.,
\[\begin{align} \mathbb{V}\left(P,Q\right)\eqdef \frac{1}{2}\norm[1]{P-Q}\eqdef \frac{1}{2}\sum_{x\in\calX}\abs{P(x)-Q(x)}. \end{align}\]
For \(P,Q\in\Delta(\calX)\), the relative entropy, also called Kullback-Leibler divergence, between \(P\) and \(Q\) is the \(f\)-divergence for \(f:t\mapsto t\ln t\), i.e.,
\[\begin{align} \bbD(P\Vert Q)\eqdef \sum_{x\in\calX}P(x)\ln\frac{P(x)}{Q(x)}, \end{align}\]
with the convention that \(\bbD_f(P\Vert Q) = \infty\) if \(P\) is not absolutely continuous with respect to \(Q\).
The only functions that satisfy the axioms are \(H(X) = -C\sum_xP_X(x)\log P_X(x)\) for some \(C>0\).
Let \(X\in\calX\) be a discrete random variable with \(\card{\calX}<\infty\). The Shannon entropy of \(X\) is defined as
\[\begin{align} \mathbb{H}(X) \eqdef \E[X]{-\ln P_X(X)} = - \sum_{x \in {\calX}} P_X(x) \ln P_X(x), \end{align}\]
with the convention that \(0\ln 0 = 0\).
The entropy \(H:\Delta(\calX)\to\bbR:P\mapsto H(P)\) is a concave function.
Let \(X\in\calX\) be a discrete random variable. Then \(\mathbb{H}(X) \geq 0\) with equality iff \(X\) is a constant, i.e., there exists \(x^*\in\calX\) such that \(\P{X=x^*}=1\).
Let \(X \in {\calX}\) be a discrete random variable. Then \(\mathbb{H}(X)\leq \ln \card{\calX}\) with equality if and only if \(X\) is uniform over \({\calX}\), i.e., \(\forall x \in {\calX}, P_X(x) = \frac{1}{\card{\calX}}\).
The joint entropy of two discrete random variables \(X\in\calX\) and \(Y\in\calY\) with joint PMF \(P_{XY}\) is
\[\begin{align*} \mathbb{H}(X,Y) \eqdef \mathbb{E}_{XY}(-\log_2 P_{XY}(X,Y)) = - \sum_{x \in {\calX}} \sum_{y \in \mathcal{Y}}P_{XY}(x, y) \log_2 P_{XY}(x, y). \end{align*}\]
Furthermore, the conditional entropy \(Y\) given \(X\) is
\[\begin{align*} \mathbb{H}(Y|X) \eqdef \mathbb{E}_{XY}(-\log_2 P_{Y|X}(Y|X)) = - \sum_{x \in {\calX}} \sum_{y \in \mathcal{Y}} P_{XY}(x,y) \log_2 P_{Y|X}(y|x). \end{align*}\]
Let \(X, Y\) be discrete random variables with joint PMF \(p_{XY}\). Then \(\mathbb{H}(Y|X) \geq 0\) with equality if and only if \(Y\) is a function of \(X\).
Let \(X\) and \(Y\) be discrete random variables with joint PMF \(p_{XY}\). Then \(\mathbb{H}(X|Y) \leq \mathbb{H}(X)\) with equality if and only if \(X\) is independent of \(Y\).
Let \(X\), \(Y\), and \(Z\) be discrete random variable with joint PMF \(p_{XYZ}\). Then \[ \mathbb{H}(XY|Z) = \mathbb{H}(X|Z) + \mathbb{H}(Y|XZ) \notag = \mathbb{H}(Y|Z) + \mathbb{H}(X|YZ). \] More generally, if \(\bfX\eqdef\set{X_i}_{i=1}^n\) and \(Z\) are jointly distributed random variables we have \[ \mathbb{H}(\bfX|Z) = \sum_{i=1}^n\mathbb{H}(X_i|X^{1:i-1}Z) \] with the convention \(X^{1:0}\eqdef \emptyset\).
Let \(X, Y\) be two random variables with joint PMF \(P_{XY}\). The mutual information between \(X\) and \(Y\) is \[ \bbI(X;Y)\eqdef \bbH(X)-\bbH(X|Y)=\bbD(P_{XY}\Vert P_XP_Y). \]
If \(\Delta(\calX\to\calY)\) denotes the set of kernels from \(\calX\) to \(\calY\) then the function \[ I:\Delta(\calX)\times \Delta(\calX\to\calY):(P,W)\mapsto I(P;W) \] is a concave function of \(P\) and a convex function of \(W\).