ECE 6605 - Information Theory

Prof. Matthieu Bloch

Tuesday August 29, 2023

Announcements

Make sure you're familiar with the course logistics
- Canvas: read the syllabus
- Piazza: make sure you register, this is not automatic
- Gradescope: make sure you login to try
Office hours
- Conflict on Thursday August 31, 2023 because of seminar
- Moved to Friday September 01, 2023 11am in TSRB 423
- live streamed and recorded

Last time…

Notation
- Make sure you're comfortable with my notation
- Ask for clarifications if needed!
Concepts in applied probability
- "Mathematics of ECE" to be announced this week: optional but good opportunity for review
- Review notes from last week
- Make sure you're comfortable with writing proofs
Today
- Jensen's inequality
- \(f\)-divergence
- Shannon entropy (and more…)

Convexity and Jensen’s inequality

A function \(f:\intseq{a}{b} \longmapsto \mathbb{R}\) is convex if \(\forall \lambda \in \intseq{0}{1}\) \(f(\lambda a + (1-\lambda)b ) \leq \lambda f(a) + (1 - \lambda) f(b)\). A function \(f\) is strictly convex if the inequality above is strict. A function \(f\) is (strictly) concave if \(-f\) is (strictly) convex.

Let \(X\) be a real-valued random variable defined on some interval \([a,b]\) and with PDF \(p_X\). Let \(f:[a,b]\rightarrow \mathbb{R}\) be a real valued function that is convex in \([a,b]\). Then,

\[\begin{align*} f(\E{X})\leq \E{f(X)}. \end{align*}\]

For any strictly convex function, equality holds if and only if \(X\) is a constant. The results also holds more generally for continuous random variables.

Let \(\{a_i\}_{i=1}^n\in\bbR_{+}^n\) and \(\{b_i\}_{i=1}^n\in\bbR_{+}^n\). Then,

\[\begin{align*} \sum_{i=1}^n a_i\ln \frac{a_i}{b_i} \geq \left(\sum_{i=1}^n a_i\right)\ln\frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i}. \end{align*}\]

\(f\)-divergence

Let \(f:\bbR\to\bbR\) be a convex function such that \(f(1)=0\). The \(f\)-divergence between two distributions \(P,Q \in\Delta(\calX)\) is

\[\begin{align*} \bbD_f(P\Vert Q)\eqdef \sum_{x\in\calX}f\left(\frac{P(x)}{Q(x)}\right)Q(x), \end{align*}\]

with the convention that \(\bbD_f(P\Vert Q)=\infty\) if \(P\) is not absolutely continuous with respect to \(Q\).

A function \(W:\calX\times\calY:(x,y)\mapsto W(y|x)\) is a Markov kernel (also called transition probability) from \(\calX\) to \(\calY\) if for all \(x,y\in\calX\times\calY\) \(W(y|x)\geq 0\) and for all \(x\in\calX\) \(\sum_{y\in\calY}W(y|x)=1\).

For any \(P,Q \in \Delta(\calX)\) \(\bbD_f(P\Vert Q)\geq 0\) with equality if and only if \(P=Q\).

More properties of \(f\) divergence

Let \(P,Q\in\Delta(\calX)\) and let \(W\) be a Markov kernel from \(\calX\) to \(\calY\).

\[\begin{align*} \bbD_f(W\circ P\Vert W\circ Q) \leq \bbD_f(W\cdot P\Vert W\cdot Q) \leq \bbD_f(P\Vert Q). \end{align*}\]

The \(f\)-divergence \(\bbD:\Delta(\calX)\times\Delta(\calX)\to\bbR^+:(p,q)\mapsto\mathbb{D}_f(P\Vert Q)\) is jointly convex, i.e., for all \(P_1,Q_1,P_2,Q_2\in\Delta(\calX)\) and \(\lambda\in[0;1]\)

\[\begin{align*} \mathbb{D}_f\left(\lambda P_1+(1-\lambda) P_2\Vert \lambda Q_1+(1-\lambda) Q_2\right)\leq \lambda\mathbb{D}_f(P_1\Vert Q_1)+(1-\lambda)\mathbb{D}_f(P_2\Vert Q_2). \end{align*}\]

Special cases of \(f\)-divergence

For \(P,Q\in\Delta(\calX)\), the total variation between \(P\) and \(Q\) is the \(f\)-divergence for \(f:t\mapsto\abs{t-1}\), i.e.,

\[\begin{align} \mathbb{V}\left(P,Q\right)\eqdef \frac{1}{2}\norm[1]{P-Q}\eqdef \frac{1}{2}\sum_{x\in\calX}\abs{P(x)-Q(x)}. \end{align}\]

For \(P,Q\in\Delta(\calX)\), the relative entropy, also called Kullback-Leibler divergence, between \(P\) and \(Q\) is the \(f\)-divergence for \(f:t\mapsto t\ln t\), i.e.,

\[\begin{align} \bbD(P\Vert Q)\eqdef \sum_{x\in\calX}P(x)\ln\frac{P(x)}{Q(x)}, \end{align}\]

with the convention that \(\bbD_f(P\Vert Q) = \infty\) if \(P\) is not absolutely continuous with respect to \(Q\).

Entropy

Axiomatic construction of entropy H
1. If \(m>n\in\mathbb{N}\) and \(X\sim\calU([1;m])\), \(Y\sim\calU[1;n]\) then \(H(X)>H(Y)\)
2. If \(m,n\in\mathbb{N}\), \(X\sim\calU([1;m])\), \(Y\sim\calU[1;n]\), \(X\) indep. of \(Y\) then \(H(XY)=H(X)+H(Y)\)
3. Let \(\calA,\calB\subset\calX\) with \(\calA\cap\calB=\emptyset\); then \[ H(P_X) = \P{A}H(P_{X|A}) + \P{B}H(P_{X|B}) \]
4. \(H(X)\) is continuous function of \(p\) if \(X\sim B(p)\)

The only functions that satisfy the axioms are \(H(X) = -C\sum_xP_X(x)\log P_X(x)\) for some \(C>0\).

Let \(X\in\calX\) be a discrete random variable with \(\card{\calX}<\infty\). The Shannon entropy of \(X\) is defined as

\[\begin{align} \mathbb{H}(X) \eqdef \E[X]{-\ln P_X(X)} = - \sum_{x \in {\calX}} P_X(x) \ln P_X(x), \end{align}\]

with the convention that \(0\ln 0 = 0\).

Properties of entropy

The entropy \(H:\Delta(\calX)\to\bbR:P\mapsto H(P)\) is a concave function.

Let \(X\in\calX\) be a discrete random variable. Then \(\mathbb{H}(X) \geq 0\) with equality iff \(X\) is a constant, i.e., there exists \(x^*\in\calX\) such that \(\P{X=x^*}=1\).

Let \(X \in {\calX}\) be a discrete random variable. Then \(\mathbb{H}(X)\leq \ln \card{\calX}\) with equality if and only if \(X\) is uniform over \({\calX}\), i.e., \(\forall x \in {\calX}, P_X(x) = \frac{1}{\card{\calX}}\).

Joint and conditional entropy

The joint entropy of two discrete random variables \(X\in\calX\) and \(Y\in\calY\) with joint PMF \(P_{XY}\) is

\[\begin{align*} \mathbb{H}(X,Y) \eqdef \mathbb{E}_{XY}(-\log_2 P_{XY}(X,Y)) = - \sum_{x \in {\calX}} \sum_{y \in \mathcal{Y}}P_{XY}(x, y) \log_2 P_{XY}(x, y). \end{align*}\]

Furthermore, the conditional entropy \(Y\) given \(X\) is

\[\begin{align*} \mathbb{H}(Y|X) \eqdef \mathbb{E}_{XY}(-\log_2 P_{Y|X}(Y|X)) = - \sum_{x \in {\calX}} \sum_{y \in \mathcal{Y}} P_{XY}(x,y) \log_2 P_{Y|X}(y|x). \end{align*}\]

Let \(X, Y\) be discrete random variables with joint PMF \(p_{XY}\). Then \(\mathbb{H}(Y|X) \geq 0\) with equality if and only if \(Y\) is a function of \(X\).

Let \(X\) and \(Y\) be discrete random variables with joint PMF \(p_{XY}\). Then \(\mathbb{H}(X|Y) \leq \mathbb{H}(X)\) with equality if and only if \(X\) is independent of \(Y\).

Chain rule of entropy

Let \(X\), \(Y\), and \(Z\) be discrete random variable with joint PMF \(p_{XYZ}\). Then \[ \mathbb{H}(XY|Z) = \mathbb{H}(X|Z) + \mathbb{H}(Y|XZ) \notag = \mathbb{H}(Y|Z) + \mathbb{H}(X|YZ). \] More generally, if \(\bfX\eqdef\set{X_i}_{i=1}^n\) and \(Z\) are jointly distributed random variables we have \[ \mathbb{H}(\bfX|Z) = \sum_{i=1}^n\mathbb{H}(X_i|X^{1:i-1}Z) \] with the convention \(X^{1:0}\eqdef \emptyset\).

Mutual information

Let \(X, Y\) be two random variables with joint PMF \(P_{XY}\). The mutual information between \(X\) and \(Y\) is \[ \bbI(X;Y)\eqdef \bbH(X)-\bbH(X|Y)=\bbD(P_{XY}\Vert P_XP_Y). \]

If \(\Delta(\calX\to\calY)\) denotes the set of kernels from \(\calX\) to \(\calY\) then the function \[ I:\Delta(\calX)\times \Delta(\calX\to\calY):(P,W)\mapsto I(P;W) \] is a concave function of \(P\) and a convex function of \(W\).