Chapter 3 Concentration Inequalities

(This chapter was scribed by Paul Barber. Proofread and polished by Baozhen Wang.)

So called “concentration inequalities” are technical tools to show how much

  1. \(x_{1}+x_{2}+ \cdots + x_{n}\) concentrates around \(E( x_{1}+x_{2}+ \cdots +x_{n})\)
  2. \(f(x_{1},x_{2},\ldots,x_{n})\) concentrates around \(E(f(x_{1},\ldots , x_{n}))\)

3.1 Basic Inequalities

Definition 3.1 (Big O Notation) \(O(x)\) is a placeholder for \(f(x)\) s.t. for some \(c\), \(|f(x)| \le cx\) for all \(x\).

Unlike traditional use, we don’t let \(n \to \infty\). This basically defines an upper bound, and \(x\) may be replaced by any other term, e.g. we might have \(O(x^2), \ O( \log x)\), etc. Moreover, for \(a,b \ge 0\), \(a \lesssim b\) means there is some \(c>0\) such that \(a \le c b\).

Theorem 3.1 (Markov's Inequality) For \(X\ge 0\) a.s., \(EX < \infty\), t>0, \[P(X\ge t)\le \frac{EX}{t}\]

Proof. \[\begin{align*} EX &\ge \int 1 \cdot X dP \\ &\ge \int \mathbb{1}[x \ge t] X dP \\ &\ge \int \mathbb{1}[x \ge t] t dP \\ &= t \int \mathbb{1}[x \ge t] dP\\ &= t P(X\ge t) .\square \end{align*}\]

Theorem 3.2 (The Moment Inequality) For \(EX< \infty\), \(f: [0, \infty) \to [0, \infty)\) strictly monotonically increasing, \(Ef(|X-EX|)< \infty\), \(t>0\), \[P(|X-EX|\ge t) = P(f(|X-EX|) \ge f(t)) \le \frac{Ef(|X-EX|)}{f(t)} \]

Proof. Use Markov’s Inequality with \(X\) replaced by \(f(|X-EX|)\). \(\square\)

As an example application, setting \(f(a)=a^2\) gives Chebyshev’s inequality:

\[P\left(|X-EX|>t\right) \le \frac{E(|X-EX|^2)}{t^2} \le \frac{Var (X)}{t^2}\]

Other example:

\[P\left(|X-EX|\ge t\right) \le \frac{E|X-EX|^{k}}{t ^{k}}\]

Remarks on Chebyshev’s inequality:

  1. Finite variance of \(X\).
  2. As we approach the tails of the distribution of \(X\), the density decreases at a rate of at least \(\frac{1}{t^2}\).
  3. For all \(\delta \in [0,1]\) small, let \(\delta = \frac{Var(X)}{t^2}\). This implies \(t= \frac{sd (X)}{\sqrt{\delta} }\). Then we have

\[P\left(|X-EX|\le \frac{sd (x)}{\sqrt{\delta} }\right) \ge 1-\delta\]

  1. Unfortunately, it is rather weak, e.g. consider \(X \sim N(0,1)\). Then we can show that

\[P(X>t) = \int_{t}^{\infty} \frac{1}{\sqrt{2 \pi} } e^{-\frac{x^2}{2}} dx\]

Since \(x / t \ge 1\), we have

\[\begin{align*} P(X>t) &\le \int_{t}^{\infty} \frac{x}{t} \frac{1}{\sqrt{2\pi} }e^{-\frac{x^2}{2}} dx \\ &= -\int_{t}^{\infty} \frac{1}{t} \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} d (-\frac{x^2}{2})\\ &= -\int_{t}^{\infty} \frac{1}{t} \frac{1}{\sqrt{2\pi}} d(e^{-\frac{x^2}{2}}) \\ &= \frac{1}{t \sqrt{2 \pi} } e^{-\frac{t^2}{2}} \\ &\le e^{- \frac{t^2}{2}} .\end{align*}\]

If \(X \sim N(\mu, \sigma^2)\), then

\[P(|X-\mu| \ge t \cdot \sigma) = P( \left| \frac{x-\mu}{\sigma} \right| \ge t) \le 2e^{-\frac{t^2}{2}} = \delta \implies t = \sqrt{2 \log \left( 2/\delta \right) }\]

So \(|X-\mu| \le \sqrt{2 \log \left( 2/\delta \right) } \cdot \sigma\) with high probability \(1-\delta\), a tighter bound than Chebyshev’s

\[\frac{SD(X)}{\sqrt{\delta} } = \frac{\sigma}{\sqrt{\delta} }.\]

To see this, let \(\delta = n^{-c}\). Then Chebyshev’s \(\implies \sigma \cdot\delta^{-\frac{1}{2}} = \sigma \cdot O(n^{\frac{c}{2}})\). Normal: \(\sigma \cdot \sqrt{2 \log \left( 2/\delta \right) } = \sigma \cdot O(\sqrt{\log n} )\)

  1. Without any knowledge about the distribution of \(X\), Chebyshev is indeed optimal, i.e. there is a distribution for which it is tight.

3.2 Chernoff bounds: a useful technique

Comment: Take the moment inequality with \(f(a) = e^{\lambda a}, \ (\lambda >0)\).

Theorem 3.3 For \(EX< \infty\), \(E e^{\lambda (X-EX))}< \infty\), \(t>0\) \[\begin{align*} P(X-EX \ge t) &= P(e^{\lambda(X-EX)}\ge e^{\lambda t}) \\ &\le E[e^{\lambda (X-EX)} ]/e^{\lambda t} \\ &= e^{-\lambda t } \cdot M_{X-\mu}(\lambda), \end{align*}\]

where \(M_{X-\mu}\) is the moment generating function of \(X-\mu\).

Example 3.1 (Gaussian) For \(X \sim N(\mu, \sigma^2)\), let \(y= x - \mu\). Then

\[\begin{align*} M_{X- \mu} (\lambda) &= E(e^{\lambda(X-\mu)}) \\ &= \int \frac{1}{\sqrt{2 \pi \sigma^2} } e^{-(x - \mu)^2/ 2 \sigma^2} e^{\lambda(X-\mu)} dx\\ &(\text{ let } y=x-\mu) \\ &= \int \frac{1}{\sqrt{2 \pi \sigma^2} } e^{-\frac{y^2}{2\sigma^2}+\lambda y} dy\\ &= \frac{1}{\sqrt{2 \pi \sigma^2} } \int e^{-\frac{1}{2\sigma^2}(y^2-2\sigma^2 \lambda y + \sigma^{4} \lambda^2)+\sigma^2 y^2 / 2} dy\\ &= \frac{1}{\sqrt{2 \pi \sigma^2} } e^{-\frac{\sigma^2\lambda^2}{2}} \int e^{-\frac{(y-\sigma^2\lambda)^2}{2\sigma^2}} dy\\ &= e^{\frac{\sigma^2 \lambda^2}{2}} .\end{align*}\]

The Chernoff Bound:

\[P(X-\mu \ge t) \le e^{-\lambda t} e^{\frac{\sigma^2 \lambda^2}{2}} = e^{-\lambda t+\frac{\sigma^2 \lambda^2}{2}} ,\ \forall \lambda > 0 \]

Take \(\lambda = \frac{t}{\sigma^2}\) and use the Chernoff bound technique

\[\begin{align*} P(x-\mu \ge t) &\le e^{-\frac{t^2}{\sigma^2}+\frac{t^2 \sigma^2}{2\sigma^{4}}}\\ &\le e^{\frac{-t^2}{2\sigma^2}} \\ \end{align*}\]

For \[\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_{i} \sim N( \mu, \frac{1}{n} \sigma^2), \] we have \[P(\bar{X}-\mu \ge t) \le e^{-\frac{t^2 n}{2 \sigma^2}}.\]

Theorem 3.4 (Hoeffding's Lemma (Bound)) For r.v. \(z \in [a,b]\) a.s. with \(\mu = E(Z)\) and \(s>0\),

\[E e^{s(z-\mu)} \le e^{\frac{s^2(b-a)^2}{8}}.\]

Proof. We are going to focus on the version with \(\mu =0\), i.e. \(Ee^{s z } \le e^{\frac{s^2(b-a)^2}{8}}\). Consider \(\psi (s) = \log E (e ^{s z})\). It suffices to prove that \(\psi(s) \le \frac{s^2(b-a)^2}{8}\). We will start by computing the first several terms in the Taylor expansion of \(\psi(s)\):

\[\begin{align*} \psi'(s) &= (\log E e^{sz} )' = \frac{E(Ze^{sZ})}{E(e^{sZ})} \\ \psi''(s) &= \frac{E(Z^2 e^{sZ})E (e^{sZ}) -(E(Z e^{sZ} ))^2}{(Ee^{sZ})^2} \\ &= E(Z^2 \left( \frac{e^{sZ}}{E(e^{dZ})} \right) ) -E\left( Z \left( \frac{e^{sZ}}{E(e^{sZ})} \right) \right)^2\\ &= Var_{F}(Z) \left(\text{by Radon-Nikodym derivative } dF := \left( \frac{e^{sZ}}{E(e^{sZ})} \right) dP\right)\\ &= Var_{F}(Z- \frac{a+b}{2}) \\ &\le E_{F}((Z-\frac{a+b}{2})^2) \\ &\le \left( \frac{b-a}{2} \right)^2 \end{align*}\]

Finally, we use the fundamental theorem of calculus:

\[\psi(s) = \int_{0}^{s} \int_{0}^{v} \psi''(u) \, du dv \le \int_0^s \left( \frac{b-a}{2} \right)^2 \theta d\theta = \frac{s^2(b-a)^2}{8}.\square\]

3.3 Hoeffding’s Inequality:

This is a concentration inequality for bounded r.v. with exponential tail bound.

Theorem 3.5 (Hoeffding's Inequality) Let \(X_{1}, X_{2}, \ldots, X_{n}\) be independent (but not necessarily identically distributed) r.v.s such that \(a_{i}\le X_{i} \le b_{i}\) a.s.. Define \(\bar{X} := \frac{1}{n} \sum_{i=1}^{n} X_{i}\) and \(\mu := E(\bar{X})\). Then for any \(\epsilon >0\), \[P(|\bar{X}-\mu|\le \epsilon) \ge 1- 2 \text{exp} \left( - \frac{2n^2 \epsilon^2}{\sum_{i=1}^{n} (b_{i}-a_{i})^2} \right) .\]

Remarks:

  • \(| \sum_{i=1}^{n} X_{i} - n \mu | \le t = n \epsilon\).
  • Version for \([a_{i},b_{i}]=[0,1]\) :

\[P(|\bar{X}-\mu \le \epsilon) \ge 1- 2 e^{-2 n \epsilon^2}.\]

  • \(\sum_{i=1}^{n} (b_{i}-a_{i})^2\) can be thought of as an upper bound or proxy of \(Var(X_{i})= E(X_{i} -\mu)^2\).

\[\begin{align*} Var(\bar{X}) &= Var( \frac{1}{n} \sum_{i=1}^{n} X_{i}) \\ &= \frac{1}{n^2} \sum_{i=1}^{n} E(X_{i}-\mu)^2 \\ &\le \frac{1}{n^2} \sum_{i=1}^{n} (b_{i}-a_{i})^2\\ &\triangleq \sigma^2 .\end{align*}\]

  • For RHS \(1-2\text{exp}(-2 \epsilon^2 / \sigma^2)\). Let \(\delta = 2 \text{exp}( - 2 \epsilon^2 / \sigma^2)\), then

\[\epsilon = \sqrt{ \frac{1}{2} \sigma^2 \log\left( \frac{2}{\delta} \right) }.\]

So \[|\bar{X}-\mu| \le \sqrt{\frac{1}{2} \sigma^2 \log ( \frac{2}{\delta})} \] with high probability.

Proof. Define \(Z_{i} = X_{i} - E X_{i}\). It suffices to show that

\[P(\frac{1}{n} \sum_{i=1}^{n} Z_{i} \ge \epsilon ) \le \exp(\frac{-2 n^2 \epsilon^2}{\sum_{i=1}^{n} (b_{i}-a_{i})^2})\]

as translation doesn’t change \(b_{i}-a_{i}\). We use the Chernoff bound technique:

\[\begin{align*} P(\sum_{i=1}^{n} n Z_{i} \ge \epsilon) &= P ( \exp (s \sum_{i=1}^{n} Z_{i})\ge \exp(sn\epsilon)) \\ &\le e^{-sn\epsilon} \prod_{i=1}^{n} E\left[ \exp(sz_{i})\right] \\ &\le e^{-sn\epsilon} \prod_{i=1}^{n} e^{\frac{s^2(b_i-a_i)^2}{8}}\\ &\le exp\left(-s n \epsilon + \frac{s^2}{8} \sum_{i=1}^{n}(b_{i}-a_{i})^2\right) .\end{align*}\]

Define \(g(s) := -sn\epsilon + \frac{s^2}{8} \sum_{i=1}^n (b_{i}-a_{i})^2\). Then

\[0 = g'(s) = -n \epsilon + \frac{s}{4} \sum_{i=1}^n (b_{i}-a_{i})^2 \implies s = \frac{4 n \epsilon}{\sum (b_{i}-a_{i})^2}.\] Plug in \(s\) we have \[P(\sum_{i=1}^{n} n Z_{i} \ge \epsilon)\le \exp\left(-\frac{2\epsilon n^2}{\sum_i^n (a_i-b_i)^2}\right).\square\]

In the following chapter, we will use Hoeffding’s inequality to bound the excess risk of a binary classifier given a finite hypothesis class.