Chapter 5 Advanced concentration inequality

(This chapter was scribed by Xiyong Yan.)

With hoeffding’s bound \(|\hat{L_n}({h}) - {L_n}|\) for given h, which has high probability \(1-\frac{ \delta}{M}\)
After union bound for all \(h\in \mathcal{H}\), Where \(\mathcal{H}\) is finite. We have
\({L}(\hat{h}) - {L(\overline{h})} \leq 2\max\limits_{h\in \mathcal{H}}|\hat{L_n}({h}) - {L_n}|\leq \sqrt{\frac{2log(2M/\delta)}{n}}\) with probability \(1-\delta\).

Comparison with asymptotic bounds \({L}(\hat{h_n}) - {L(\overline{h})}\leq \frac{c}{n}+o(\frac{1}{n})\), where \(o(\frac{1}{n})\) can be significantly depends on the problem. For example, both \(n^{-2}\) and \(p^{100}n^{-2}\) are \(o(\frac{1}{n})\),but the second one converges more slowly. We will no longer have this \(o(\frac{1}{n})\) term in the non-asyptotic bound.

However,we now have a slower rate of \(o(\frac{1}{n^{1/2}})\) instead of \(o(\frac{1}{n})\). \(o(\frac{1}{n})\) is typically called fast rate.\(o(\frac{1}{{n}^{1/2}})\) is typically called slow rate.
It may ne possible to obtain fast rate by uniform convergence, under more condidtions. unfortunately, the union bound technique can not be used if \(\mathcal{H}\) is infinite. We will need some different techniques.

On next few lecture, we will have

Additional concentration ineqality as tools.
Some detours: discuss how to obtain fast rate under some conditions on noise (\(\mathcal{H}\) is finite)
Back to main track: how to bound excess risk using a complexity measure called rademacher complexity.

5.1 Sub-Gaussian random variables

Definition: A random variable X with mean \(\mu\) is sub-gaussian with parameter \(\sigma\), if \(M_{X-\mu}(\lambda)=E(exp(\lambda(X-\mu)))\leq exp(\sigma^2\lambda^2/2 ) \forall \lambda\) . We say that X is \(\sigma\)-sub-Gaussion with variance proxy \(\sigma^2\).

Recall chernoff bound.
\(P(X-\mu)=P(exp(\lambda(X-E(X))>exp(\lambda t)) \leq exp(-\lambda t)M_{X-\mu}(\lambda)\).
So if X is sub-Gaussian, this allows us to obtain an exponential tail bound.
If \(X\sim N(\mu,\sigma^2)\), then \(M_{X-\mu}(\lambda)=e^{\sigma^2\lambda^2/2}\). Hence, this is sub-Gaussian.

Theorem: If X is sub-Gaussian with \(\mu\), then \(P(|X-\mu|>t)\leq exp(-t^2/{2\sigma^2})\).
Proof:

\(P(X-\mu>t)=P[exp(\lambda(X-\mu))\geq exp(\lambda t)] \leq exp(-\lambda t)E(exp(\lambda(X-\mu)))\leq exp(-\lambda t)exp(\sigma^2 \lambda^2/2)=exp(-\lambda t+\sigma^2 \lambda^2/2)=K \ \forall \lambda\).

Let \(g(\lambda)=-\lambda t+\sigma^2 \lambda^2/2\)
\(g'(\lambda)=-t+\sigma^2 \lambda\)
then \(\lambda =t/\sigma^2\)
Thus, \(K=exp(-t^2/{2\sigma^2})\)

Going through the same line of reasoning for (-X) and (-t), we have \(P(X-\mu \leq exp(-t^2/{2\sigma^2}))\) By union bound.
\(P[|X-\mu|\geq t]\leq P(X-\mu\geq t)+P(X=\mu \leq -t)\leq 2exp(-t^2/{2\sigma^2})\) as required.

Sum of sub-Gaussian r.v.s is the sub-Gaussian.
Theorem: If \(X_1,...,X_n\) are sub-Gaussian independent and with variance proxies \(\sigma_i^2, i=1,...,n.\) Then \(Z \triangleq \sum_{i=1}^n X_i\) is sub-Gaussian with variance proxies \(\sum_{i=1}^n\sigma_i^2\) and consequently, \(P[|Z-E(Z)|\geq t]\leq 2exp(-\frac{-t^2}{2\sum_{i=1}^n \sigma_i^2})\)
Proof:
First part, \[\begin{aligned} E(exp(\lambda(Z-E(Z)))) =E(exp(\lambda\sum_{i=1}^n(X_i=\mu_i))) =\prod_{i=1}^nE[exp(\lambda(X_i-\mu_i))]\leq \prod_{i=1}^n exp(\lambda^2\sigma_i^2/2) =exp(\lambda^2 \sum_{i=1}^n \sigma_i^2/2)\end{aligned}\] By the definition of sub-Gaussian with variance proxy \(\sum_{i=1}^n\sigma_i^2\) second part follows. As required.

Example of sub-Gaussian r.v.: 1. Rademacher r.v..
A Rademacher r.v. \(\epsilon\) takes value of 1 and -1 with probability 50-50. Te see that \(\epsilon\) is sub-Gaussian :
\[\begin{aligned} E(exp(\lambda\epsilon)) =(exp(\lambda)+exp(-\lambda))/2\\ =(\sum_{k=0}^\infty\lambda^k/k!+\sum_{k=0}^\infty (-\lambda)^k/k!)/2 =\sum_{k=0}^\infty\lambda^{2k}/(2k)! \end{aligned}\] (odd k’s were canceled)
\(=1+\sum_{k=1}^\infty\frac{(\lambda^2)^k}{2^k k!}=exp(\lambda^2/2)\) sub-Gaussian with \(\sigma^2=1\)

2. Bounded r.v.
If X satisfied \(a \leq X\leq b\) a.s. then \(E(exp(\lambda(X-\mu)))\leq exp(\lambda^2(b-a)^2/8)\) (Hoeffding’s lemma).
This implies that X is sub-Gaussian with \(\sigma^2=(b-a)^2/4\). Note if \(X_i \in [a_i,b_i]\) independent, then \(X_i\) is sub_Gaussian \(\sigma_i^2=(b_i-a_i)^2/4\). This implies \(\sum_{i=1}^nX_i\) is sub-Gaussian \(\sigma^2=\sum_{i=1}^n(b_i-a_i)^2/4\).

Hence,\(P[|\sum_{i=1}^nX_i-E(\sum_{i=1}^nX_i)|>t]\leq 2exp(-4t^2/\sum_{i=1}^n(b_i-a_i)^2)\) (Hoeffding’s inequality).

5.2 Concentration of function of r.v.

We now introduce inequality related to the second of our two goals, namely, to show that \(f(X_1,...,X_n)\) concentrates arounds \(Ef(X_1,...,X_n)\).

Definition: Bounded different condition. Let \(g:X \rightarrow \mathbb{R}\) and constant \(C_i's\) be given. Then, g is said to satisfiy the bounded different condition if \(\sup\limits_{X_1,...,X_n}|g(X_1,...,X_i,..,X_n)-g(X_1,...,X_i',..,X_n)|\leq C_i \ \forall i\)

Intuitively, g satisfies the bounded different condition if changing on coordinate of g at a time can not change the value of g too much.

Theorem: Bounded different inequality.(aka. McDiavmid’s inequality)
Suppose f satisfies the bound different condition with constant \(C_1,...,C_n\). Then for any independent r.v.s. \(X_1,...X_n\), \(P[|f(X_1,..,X_n)-E(f(X_1,..,X_n))|\geq t]\leq 2exp(-2t^2/\sum_{i=1}^nC_i^2)\)

Moreover, \(f(X_1,..,X_n)\) is sub-Gaussian with variance proxy \(\sum_{i=1}^n C_i^2\).
Note: Hoeffding’s inequality is a special case of McDiavmid’s inequality with \(f(X_1,..,X_n)=\sum_{i=1}^n X-i\) and \(C_i=b_i-a_i\)

Proof: The idea is to break the quantity \(f(X_1,..,X_n)-E(f(X_1,..,X_n))\) into manageable components by conditioning on portions of the sample.
\(Z_0 \triangleq E(f(X_1,..,X_n))\) constant
\(Z_1 \triangleq E(f(X_1,..,X_n)|x_1)\) function of \(X_1\)
\(Z_2 \triangleq E(f(X_1,..,X_n)|x_2)\) function of \(X_1,X_2\)
…..
\(Z_n\triangleq E(f(X_1,..,X_n)|x_n)=f(X_1,..,X_n)\) function of \(X_1,..,X_n\)
Note that for each \(i\geq 1\), \(E(Z_i)=E[E(f(X_1,..,X_n)|X_1,...,X_i)]=E(f(X_1,..,X_n))=Z_0\)

Let \(D_i \triangleq Z_i=Z_{i-1}\) then \(E(D_i)=0\). Next, \(Z_n-Z_0\) can be wriiten as a telescoping sum:

\(Z_n-Z_0=(Z_n-Z_{n-1})+(Z_{n-1}-Z_{n-2})+...+(Z_1-Z_0)=\sum_{i=1}^nD_i\)
Next, We show that the expectation of \(D_i\) conditional on \(X_1,...,X_{i-1}\) is bounded r.v.. To see that, let

\(A_i\triangleq\inf\limits_XE(f(X_1,..,X_n)|X_1,...,X_{i-1]},X_i=x)-E(f(X_1,..,X_n)|X_1,...,X_{i-1]})\)
\(B_i\triangleq \sup\limits_XE(f(X_1,..,X_n)|X_1,...,X_{i-1]},X_i=x)-E(f(X_1,..,X_n)|X_1,...,X_{i-1]})\)
It is clear that \(A_i\leq D_i\leq B_i\)
\(D_i=E(f(X_1,..,X_n)|X_1,...,X_{i-1],X_i},X_i=x)-E(f(X_1,..,X_n)|X_1,...,X_{i-1]})\)

Furthermore: \(B_i-A_i\leq \sup\limits_{X_1,...,X_{i-1},X_{i+1},...,X_n}\sup\limits_{X,X'}\int (f(X_1,...,X,...,X_n)-f(X_1,...,X',...,X_n))dP(X_{i+1},...,X_n)\leq C_i\) where X,X’ are on the \(i^{th}\) position.
Using this, we have \[ \begin{aligned} E(exp(\lambda(Z_n-Z_0))) &=E(exp(\lambda \sum_{i=1}^n D_i)) &=E[E(exp(\lambda\sum_{i=1}^n D_i)|X_1,...,X_{n-1})] &=E[exp(\lambda\sum_{i=1}^n D_i)E(e^{\lambda D_n}|X_1,...,X_{n-1})] &\leq E[exp(\lambda\sum_{i=1}^n D_i)exp(-\lambda^2 C_n^2/8)] &=exp(-\lambda^2 C_n^2/8)E[exp(\lambda\sum_{i=1}^n D_i)] &=exp(-\lambda^2 C_n^2/8)exp(-\lambda^2 C_{n-1}^2/8)E[exp(\lambda\sum_{i=1}^{n-2} D_i)] \leq exp(-\lambda^2 \sum_{i=1}^n C_i^2)\ \end{aligned} \] Hence, \(Z_n-Z_0\) is sub-Gaussian with variance proxy \(\sum C_i^2/4\)
Thus, \(P(|Z_n-Z_0|\geq t)\leq 2exp(-t^2/(\sum C_i^2/2))\), as required.

Unfortunately, the bounded different condition is often only satified by bounded r.v. or bounded function f. To get similar inequality ofr unbounded r.v., we need special conditions. For example: Normality. Example: If f is L-Lipschitz with respect to \(L^2\)-norm, \(X_i \sim N(0,1).\)Then, \(P(|f(X)-Ef(X)|\geq t)\leq2exp(-t^2/(2L^2))\)
\(\forall X,X', ||f(X)-f(X')\leq ||X-X'||\)

5.3 Bernsteins Inequality

Hoeffding assume bounded r.v., McDiavmid’s assumes f is bounded different. However, both have completely ignore the variances of the r.v.s.. When the r.v. in question have some known variance (small), we have improved bounds.

Theorem: (Berstein’s Inequality)
Let \(X_1,...,X_n\) be independent centered r.v. with \(|X_i|\leq C\) for every i, write \(\sigma^2 \triangleq \sum_{i=1}^nVar(X_i)/n\). Then,
\(P(\frac{1}{n}\sum_{i=1}^nX_i>t)\leq exp(-\frac{nt^2}{2\sigma^2+2tC/3})\)
-Note Hoeffding’s inequality gives \(P(\frac{1}{n}\sum_{i=1}^nX_i>t)\leq exp(-\frac{n^2t^2}{\sum_{i=1}^n(b_i-a_i)^2}\)
If \(\sigma^2\) is small, then exponent in right hand side of Bernstein’s is \(-\frac{nt^2}{2tc/3}\sim O(e^{-nt^2})e^{-n/10}\). The exponent in right hand side of Hoeffding is \(O(e^{-nt^2})*e^{-n/100}\)

Proof:
The idea is using chernoff bound as usual, but using the assunption on variance. Then, we can provide a slight better bound to MGF.
\[\begin{aligned} E(exp(sX_i))=1+E(sX_i)+E(\sum_{k=2}^\infty\frac{(sX_i)^k}{k!}) \leq 1+0+E(\sum_{k=2}^\infty(X_i^2*\frac{(s^k|X_i|^{k-2}}{k!})\leq 1+E(\sum_{k=2}^\infty(X_i)^2*\frac{s^kC^{k-2}}{k!})=1+\sum_{k=2}^\infty Var(X_i)*\frac{s^kC^{k-2}}{k!})=1+Var(X_i)g(s)\leq exp(Var(X_i)g(s)) where\ g(s)=\frac{e^{sC}-1-sC}{C^2}\end{aligned}\] Now , use Chernoff bound.

\(P(\frac{1}{n}\sum_{i=1}^nX_i>t) \leq e^{-snt}E(e^{s\sum_{i=1}^nX_i})\leq exp(\sum_{i=1}^nVar(X_i)\frac{e^{sC}-1-sC}{C^2}-snt) \ \forall s>0\)
Choose \(s=\frac{1}{C}log(1+\frac{tC}{\sigma^2})\)
\(Cs=log(1+\frac{tC}{\sigma^2})\)
Exponent in \[\begin{aligned} RHS=n\sigma^2\frac{1+tC/\sigma^2-1-log(1+tC/\sigma^2)}{C^2}-\frac{nt}{C}log(1+tc/\sigma^2)=-\frac{n\sigma^2}{C^2}[-tC/\sigma^2+log(1+tC/\sigma^2)+\frac{tC}{\sigma^2}log(1+tC/\sigma^2)]=-\frac{n\sigma^2}{C^2}h(tC/\sigma^2) \end{aligned}\] where \(h(u)=(1+u)log(1+u)-u, u=\frac{tC}{\sigma^2}\)
Since \(h(u)\leq \frac{u^2}{2+2u/3}\leq -\frac{n\sigma^2}{C^2}*\frac{\frac{t^2C^2}{\sigma^4}}{2+2tC/3\sigma^2}=-\frac{nt^2}{2\sigma^2+2tc/3}\)
Hence, \(P(\frac{1}{n}\sum_{i=1}^nX_i>t) \leq exp(-\frac{nt^2}{2\sigma^2+2tc/3})\)
As required.