Chapter 7 Rademacher complexity
7.1 General framework
So far we have relied on concentration inequalities (e.g. Hoeffding’s inequality) along with the union bound to derive the generalization bound on \(L(\hat h)-L(\bar h)\) via uniform convergence of \(\sup\limits_{h\in\mathcal{H}}|\hat L_n(h)-L(h)|\) for a finite hypothesis class \(\mathcal{H}\). This approach would not work when \(\mathcal{H}\) is infinite. Instead, we seek some bounds that potentially depend on \(n\) and the complexity of \(\mathcal{H}\). One framework to measure the complexity of \(\mathcal{H}\) is known as the Rademacher complexity. It has a few nice properties and has connections with other complexity measures such as VC dimension and covering number.
Mainly, we have a 3-line roadmap:
Define \(G_n\overset{\triangle}{=}\sup\limits_{h\in\mathcal{H}}L(h)-\hat L_n(h)\).
Show that \(G_n\) concentrates around \(\mathbb{E}[G_n]\) using McDiarmid’s inequality (Bounded difference).
Bound \(\mathbb{E}[G_n]\) by a quantity known as the Rademacher complexity using a technique called “symmetrization.”
Step 1: Define \(G_n\overset{\triangle}{=}\sup\limits_{h\in\mathcal{H}}L(h)-\hat L_n(h)\) as the largest difference between the expected and empirical risk over all \(h\in\mathcal{H}\). Note that \(G_n\) is a random variable depending on the dataset \(D\). Since \[\begin{aligned} L(\hat h)- L(\bar h)&=\hat L_n(\hat h)-\hat L_n(\bar h)+L(\hat h)-\hat L_n(\hat h)+[-(L(\bar h)-\hat L_n(\bar h))]\\ &\leq 0 + L(\hat h)-\hat L_n(\hat h)+[-(L(\bar h)-\hat L_n(\bar h))], \end{aligned}\] thus \[\{L(\hat h)- L(\bar h)\geq \varepsilon\}\subset \{\underbrace{L(\hat h)-\hat L_n(\hat h)}_{G_n}\geq\varepsilon/2\}\cup\{\underbrace{-(L(\bar h)-\hat L_n(\bar h))}_{G^\prime_n}\geq\varepsilon/2\},\] where \(G_n^\prime\) is the counterpart of \(G_n\) defined for negative loss \(\ell^\prime\overset{\triangle}{=}-\ell\). Therefore, \[\begin{equation} \mathbb{P}[L(\hat h)- L(\bar h)\geq\varepsilon]\leq \mathbb{P}\left[G_n\geq\varepsilon/2\right]+\mathbb{P}[G_n^\prime\geq\varepsilon/2].\tag{7.1} \end{equation}\]
Step 2: Bound \(G_n\) by \(\mathbb{E}[G_n]+\varepsilon\) (\(G_n\) concentrates around \(\mathbb{E}[G_n]\)).
Write \(G_n\) as \(g(Z_1, \ldots, Z_n)\), where \(Z_i=(X_i, Y_i)\) and \(D=\{Z_1, \cdots, Z_n\}\). When we use 0-1 loss \(\ell(Z_i, h)=\mathbb{1}\{h(X_i)\neq Y_i\}\), \(g\) satisfies the bounded difference condition \[|g(Z_1, \ldots, Z_i, \ldots, Z_n)-g(Z_1, \ldots, Z_i^\prime, \ldots, Z_n)|\leq\frac{1}{n}.\] The difference between \(L(h)-\hat L_n(h)\) and the perturbed \(L(h)-\hat L_n(h, D^\prime)\) is at most \(\frac{1}{n}\). Since our loss function is bounded, “sup” cannot increase the difference.
From the McDiarmid’s inequality we have \[\begin{equation} \begin{split} \mathbb{P}\left[G_n-\mathbb{E}[G_n]\geq\varepsilon\right]&\leq\exp\left(-\frac{2\varepsilon^2}{\sum_{i=1}^n(\frac{1}{n})^2}\right)=\exp(-2n\varepsilon^2)\\ \Leftrightarrow~~ \mathbb{P}[G_n\geq\mathbb{E}[G_n]+\varepsilon]&\leq\exp(-2n\varepsilon^2). \end{split} \tag{7.2} \end{equation}\]
Setting \(\exp(-2n\varepsilon^2)=\frac{\delta}{2}\), we obtain \(\varepsilon=\sqrt{\frac{1}{2n}\log\frac{2}{\delta}}\). Equivalently, we conclude that with probability at least \(1-\delta\), \[G_n\leq \mathbb{E}[G_n]+\sqrt{\frac{1}{2n}\log\frac{2}{\delta}},\] where there is no cardinality of hypothesis space in the second term of above R.H.S. quantity. Recall there is an \(M\) measuring the complexity in the bound \(\sqrt{\frac{2}{n}\log\frac{M}{\delta}}\) for the finite \(\mathcal{H}\) case.
Step 3: Symmetrization.
We need to bound \(\mathbb{E}_{(X_i, Y_i)\sim\mathcal{P}}[G_n]\) but it is quite difficult since it depends on \(L(h)=\mathbb{E}[\hat L_n(h)]\), where \(\mathcal{P}\) is unknown. The goal of symmetrization is to remove this strong dependency on \(\mathcal{P}\) and replace it with a quantity that only indirectly depends on \(\mathcal{P}\) through \(D=\{Z_i\}_{i=1}^n=\{(X_i, Y_i)\}_{i=1}^n\). The key idea is to introduce another set of i.i.d. dataset \(D^\prime=\{Z_i^\prime\}_{i=1}^n=\{(X_i^\prime, Y_i^\prime)\}_{i=1}^n\) from \(\mathcal{P}\). This sample only exists in the proof and for the proof so that it is sometimes referred to as the ghost sample.
Define \(\hat L_n^\prime(h)=\frac{1}{n}\sum\limits_{i=1}^n\ell(X_i^\prime, Y_i^\prime, h)=\frac{1}{n}\sum\limits_{i=1}^n\mathbb{1}\{h(X_i^\prime)\neq Y_i^\prime\}\). Then rewrite \(L(h)\) in \(\mathbb{E}[G_n]\) using the ghost sample \[\begin{aligned} \mathbb{E}[G_n]=\mathbb{E}\bigg[\sup\limits_{h\in\mathcal{H}}\underbrace{L(h)}_{\mathbb{E}[\hat L_n^\prime(h)]}-\hat L_n(h)\bigg]&=\mathbb{E}\left[\sup\limits_{h\in\mathcal{H}}\mathbb{E}\left[\hat L_n^\prime(h)\mid D\right]-\hat L_n(h)\right] ~~~\mbox{due to}~D\perp D^\prime\\ &=\mathbb{E}\left[\sup\limits_{h\in\mathcal{H}}\mathbb{E}\left[\hat L_n^\prime(h)-\hat L_n(h)\mid D\right]\right]\\ &\leq \mathbb{E}\left[\mathbb{E}\left[ \sup\limits_{h\in\mathcal{H}}\hat L_n^\prime(h)-\hat L_n(h)\mid D\right]\right]\\ &= \mathbb{E}\left[\sup\limits_{h\in\mathcal{H}}\hat L_n^\prime(h)-\hat L_n(h)\right]. \end{aligned}\]
Now introduce i.i.d. Rademacher random variables \(\sigma_i,i=1,\ldots,n\) such that \(\sigma_i\) takes values on \(\{1, -1\}\) uniformly. Since \(\mathbb{1}\{h(X_i^\prime)\neq Y_i^\prime\}-\mathbb{1}\{h(X_i)\neq Y_i\}\) is symmetric around \(0\), multiplying \(\sigma_i\) does not change its distribution. Based on this property and together with above inequality, we have \[\begin{aligned} \mathbb{E}[G_n]\leq \mathbb{E}\left[\sup\limits_{h\in\mathcal{H}}\hat L_n^\prime(h)-\hat L_n(h)\right] &=\mathbb{E}\left[\sup\limits_{h\in\mathcal{H}}\frac{1}{n}\sum\limits_{i=1}^n\sigma_i\left(\mathbb{1}\{h(X_i^\prime)\neq Y_i^\prime\}-\mathbb{1}\{h(X_i)\neq Y_i\}\right)\right]\\ &\leq \mathbb{E}\left[\sup\limits_{h\in\mathcal{H}}\frac{1}{n}\sum\limits_{i=1}^n\sigma_i\mathbb{1}\{h(X_i^\prime)\neq Y_i^\prime\}\right] +\mathbb{E}\left[\sup\limits_{h\in\mathcal{H}}\frac{1}{n}\sum\limits_{i=1}^n-\sigma_i\mathbb{1}\{h(X_i)\neq Y_i\}\right]\\ &=2\cdot\underbrace{\mathbb{E}\left[\sup\limits_{h\in\mathcal{H}}\frac{1}{n}\sum\limits_{i=1}^n\sigma_i\mathbb{1}\{h(X_i)\neq Y_i\}\right]}_{\mbox{Rademacher complexity}}. \end{aligned}\]
7.2 Rademacher complexity
Let \(\mathcal{F}\) be a class of real-valued functions \(f:\mathcal{Z}\mapsto\mathbb{R}\), e.g. \(\mathcal{Z}=\mathcal{X}\times \mathcal{Y}\) and \(z=(x,y)\). Define the Rademacher complexity of \(\mathcal{F}\) as \[\mbox{Rad}_n(\mathcal{F})\overset{\triangle}{=}\underset{\sigma_i, Z_i}{\mathbb{E}}\left[\sup\limits_{f\in\mathcal{F}}\frac{1}{n}\sum\limits_{i=1}^n\sigma_i f(Z_i)\right],\] where \(Z_1, \ldots, Z_n\) are i.i.d. from \(\mathcal{P}\) and \(\sigma_1, \ldots, \sigma_n\) are i.i.d. Rademacher random variables.
Interpretation of Rademacher complexity:
- Consider binary classification inputs \(z_1, \ldots, z_n\). \(f(z_1),\ldots, f(z_n)\) are predicted labels and \(\sigma_1, \ldots, \sigma_n\) are outputs (observed labels). Clearly, this is meaningless as \(\sigma_i\)’s are noise.
- \(\sup\limits_{f\in\mathcal{F}}\frac{1}{n}\sum\limits_{i=1}^n\sigma_if(z_i)\) measures how well the best \(f\in\mathcal{F}\) aligns with \(\sigma_i\)’s.
- The more complexity (larger) \(\mathcal{F}\) is, the more likely some \(f\) in \(\mathcal{F}\) can replicate the noise sign pattern.
- In our application, we do not want \(\mbox{Rad}_n(\mathcal{F})\) to be large since we do not want to learn noise.
We also define the empirical Rademacher complexity of \(\mathcal{F}\) to be \[\widehat{\mbox{Rad}}_n(\mathcal{F})\overset{\triangle}{=}\mathbb{E}\left[\sup\limits_{f\in\mathcal{F}}\frac{1}{n}\sum\limits_{i=1}^n\sigma_i f(Z_i)\mid Z_1, \ldots, Z_n\right],\] which is a random variable (a function of the dataset \(\{Z_1,\ldots, Z_n\}\)). Note that \(\mbox{Rad}_n(\mathcal{F})=\mathbb{E}_{\{Z_i\}_{i=1}^n}\left[\widehat{\mbox{Rad}}_n(\mathcal{F})\right]\).
Now we can use the Rademacher complexity defined on a special class of functions to bound the excess risk.
Theorem 7.1 (Generalization Bounded based on Rademacher) Let \(\mathcal{A}=\{z\mapsto \mathbb{1}\{h(x)\neq y\}: h\in\mathcal{H}\}\) be the 0-1 loss class consisting of composition of the loss function with \(h\in\mathcal{H}\). Thus with probability at least \(1-\delta\), we have \[L(\hat h)-L(\bar h)\leq 4\mbox{Rad}_n(\mathcal{A})+\sqrt{\frac{2}{n}\log\frac{2}{\delta}}.\]
Proof. Note we know that \(\mathbb{E}[G_n]\leq 2\cdot\mathbb{E}\left[\sup\limits_{h\in\mathcal{H}}\frac{1}{n}\sum\limits_{i=1}^n\sigma_i\mathbb{1}\{h(X_i)\neq Y_i\}\right]=2 \mbox{Rad}_n(\mathcal{A})\). From the Inequality (7.2), we have \[\begin{aligned} \mathbb{P}[G_n\geq\mathbb{E}[G_n]+t]&\leq \exp(-2n t^2)\\ \Rightarrow~~~\mathbb{P}[G_n\geq\varepsilon/2]&\leq \exp\left[-2n\left(\frac{\varepsilon}{2}-\mathbb{E}[G_n]\right)^2\right] ~~~~\mbox{let}~ \mathbb{E}[G_n]+t=\varepsilon/2\\ &\leq\exp\left[-2n\left(\frac{\varepsilon}{2}-2\mbox{Rad}_n(\mathcal{A})\right)^2\right]~~~\mbox{if}~\varepsilon>4\mbox{Rad}_n(\mathcal{A}). \end{aligned} \]
Likewise, \[\begin{aligned} \mathbb{P}[G_n^\prime\geq\varepsilon/2]&\leq \exp\left[-2n\left(\frac{\varepsilon}{2}-2\mbox{Rad}_n(-\mathcal{A})\right)^2\right]~~\mbox{where}~ -\mathcal{A}\overset{\triangle}{=}\{-f: f\in\mathcal{A}\}\\ &=\exp\left[-2n\left(\frac{\varepsilon}{2}-2\mbox{Rad}_n(\mathcal{A})\right)^2\right]. \end{aligned}\] The above equality holds since negation does not change the \(\mbox{Rad}_n\) (\(\mbox{Rad}_n(\mathcal{A})=\mbox{Rad}_n(-\mathcal{A})\)).
Now let \(\exp\left[-2n(\frac{\varepsilon}{2}-2\mbox{Rad}_n(\mathcal{A}))^2\right]=\frac{\delta}{2}\). We obtain \(\varepsilon=4\mbox{Rad}_n(\mathcal{A})+\sqrt{\frac{2}{n}\log\frac{2}{\delta}}\). Together with Inequality (7.1), with probability at least \(1-\delta\), we have \[\begin{aligned} L(\hat h)-L(\bar h)\leq 4\mbox{Rad}_n(\mathcal{A})+\sqrt{\frac{2}{n}\log\frac{2}{\delta}}. \end{aligned}\]
Remark: The essence of generalization error is captured in the Rademacher complexity of the loss class \(\mbox{Rad}_n(\mathcal{A})\).
We will study the Rademacher complexity \(\mbox{Rad}_n(\mathcal{F})\) of various function class \(\mathcal{F}\). First let’s discuss some basic composite properties that Rademacher complexity enjoys.
- Boundness: \(\mbox{Rad}_n(\mathcal{F})\leq\max\limits_{f\in\mathcal{F}}\max\limits_{z}f(z)\). This is trivial, not impressive and not very useful. Ideally, we would like \(\mbox{Rad}_n(\mathcal{F})\rightarrow0\) as \(n\rightarrow\infty\).
- Singleton: \(\mbox{Rad}_n(\{f\})=0\).
- Monotonicity: \(\mbox{Rad}_n(\mathcal{F}_1)\leq \mbox{Rad}_n(\mathcal{F}_2)\) if \(\mathcal{F}_1\subset \mathcal{F}_2\).
- Linear combination: \(\mbox{Rad}_n(\mathcal{F}_1+\mathcal{F}_2)\leq \mbox{Rad}_n(\mathcal{F}_1) + \mbox{Rad}_n(\mathcal{F}_2)\), where \(\mathcal{F}_1+\mathcal{F}_2\overset{\triangle}{=}\{f_1+f_2: f_1\in\mathcal{F}_1, f_2\in\mathcal{F}_2\}\).
- Scaling: \(\mbox{Rad}_n(c\mathcal{F})=|c|\cdot\mbox{Rad}_n(\mathcal{F})\), where \(c\mathcal{F}\overset{\triangle}{=}\{cf: f\in \mathcal{F}\}\).
- Lipschitz: \(\mbox{Rad}_n(\phi\circ\mathcal{F})\leq c_\phi\mbox{Rad}_n(\mathcal{F})\), where \(\phi\circ\mathcal{F}\overset{\triangle}{=}\{z\mapsto \phi(f(z)): f\in\mathcal{F}\}\) and \(c_\phi\) is the Lipschitz constant of \(\phi\): \(\forall z, z^\prime, |\phi(z)-\phi(z^\prime)|\leq c_\phi|z-z^\prime|\).
- If \(f\) is differentiable, then \(c_\phi\) is just a bound on the \(L_2\) norm of gradients of \(\phi\).
- This property is very useful. Because we can first compute \(\mbox{Rad}_n(\mathcal{H})\) and then compute with \(\ell\) to get the Rademacher complexity of the loss class.
- Convex hull: \(\mbox{Rad}_n(\mbox{convex hull}(\mathcal{F}))=\mbox{Rad}_n(\mathcal{F})\) for finite \(\mathcal{F}\).
7.3 Finite hypothesis class
Let’s instantiate Rademacher complexity for the case where the hypothesis class is finite. This may seem like an overkill, since we already analyzed finite \(\mathcal{H}\) case. Recall in Chapter 4.3, the generalization bound was \(L(\hat h)-L(\bar h)\leq \sqrt{\frac{2}{n}\log\frac{2M}{\delta}}\) with probability at least \(1-\delta\). But here we can do this as a sanity check.
Lemma 7.1 (Massart's finite lemma) Condition on \(\{Z_1,\ldots, Z_n\}\) (treat them as constants). Let \(\mathcal{F}\) be a finite class of functions \(\{f_1, \ldots, f_M\}\). Let \(c^2\) satisfy \(\sup\limits_{f\in\mathcal{F}}\frac{1}{n} \sum\limits_{i=1}^nf^2(Z_i)\leq c^2\). Then \[\widehat{\mbox{Rad}}_n(\mathcal{F})\leq \sqrt{\frac{2c^2}{n}\log M}.\]
Proof. Let \(W_f\overset{\triangle}{=}\frac{1}{n}\sum\limits_{i=1}^n\sigma_if(Z_i)\). We are interested in bounding \(\widehat{\mbox{Rad}}_n(\mathcal{F})=\mathbb{E}\big[\sup\limits_{f\in\mathcal{F}} W_f\mid Z_1,\ldots, Z_n\big]\).
\[\begin{aligned} \exp\left(t\cdot\widehat{\mbox{Rad}}_n(\mathcal{F})\right)&=\exp\left(t\mathbb{E}\bigg[\sup\limits_{f\in\mathcal{F}} W_f\mid Z_1,\ldots, Z_n\bigg]\right)~~~~\mbox{take}~t>0\\ &\leq \mathbb{E}\bigg[\exp\left(t\sup\limits_{f\in\mathcal{F}} W_f\right)\mid Z_1,\ldots, Z_n\bigg]~~~\mbox{Jensen's inequality}\\ &= \mathbb{E}\bigg[\sup\limits_{f\in\mathcal{F}}\exp\left(t W_f\right)\mid Z_1,\ldots, Z_n\bigg]\\ &\leq \sum\limits_{f\in\mathcal{F}}\mathbb{E}\left[\exp\left(t W_f\right)\mid Z_1,\ldots, Z_n\right]. \end{aligned}\] Note that the randomness in \(W_f\) is from \(\sigma_i\) not from \(f(Z_i)\) (which is conditioned on). By Hoeffding’s lemma, \(\sigma_i\in\{-1, 1\}\) is bounded and hence is sub-Gaussian with variance proxy 1. Thereby, \(W_f=\frac{1}{n}\sum\limits_{i=1}^n\sigma_if(Z_i)\) is a linear combination of \(\sigma_i\), which is sub-Gaussian with variance proxy \(\frac{1}{n}\sum\limits_{i=1}^nf^2(Z_i)\leq \frac{c^2}{n}\). By the definition of sub-Gaussianity, \(\mathbb{E}\left[\exp\left(t W_f\right)\mid Z_1,\ldots, Z_n\right]\leq \exp\left(\frac{t^2\sigma^2}{2n}\right)\) and hence \[\begin{aligned} \exp\left(t\cdot\widehat{\mbox{Rad}}_n(\mathcal{F})\right)\leq \sum\limits_{f\in\mathcal{F}}\mathbb{E}\left[\exp\left(t W_f\right)\mid Z_1,\ldots, Z_n\right]\leq \sum\limits_{f\in\mathcal{F}}\exp\left(\frac{t^2\sigma^2}{2n}\right)=M\cdot \exp\left(\frac{t^2\sigma^2}{2n}\right), \end{aligned} \] which yields \(\widehat{\mbox{Rad}}_n(\mathcal{F})\leq \frac{\log M}{t}+\frac{tc^2}{2n}\) for all \(t>0\). To obtain the tightest bound, we take \(t=\sqrt{\frac{2n}{c^2}\log M}\). Therefore, \[\widehat{\mbox{Rad}}_n(\mathcal{F})\leq\sqrt{\frac{2c^2}{n}\log M}.\]
Remark: We can immediately apply Massart’s lemma to any finite loss class. For example, 0-1 loss \(f(Z_i)=\ell(X_i, Y_i, h)=\mathbb{1}\{h(X_i)\neq Y_i\}\leq1\) implies \(\frac{1}{n}\sum\limits_{i=1}^nf^2(Z_i)\leq1\). Then we can take \(c^2=1\) and have \(\widehat{\mbox{Rad}}_n(\mathcal{A})\leq\sqrt{\frac{2}{n}\log M}\) by Massart’s lemma 7.1, which gives us \(\mbox{Rad}_n(\mathcal{A})=\mathbb{E}[\widehat{\mbox{Rad}}_n(\mathcal{A})]\leq \sqrt{\frac{2}{n}\log M}\). Combining with Theorem 7.1, with probability at least \(1-\delta\), we have \[\begin{aligned} L(\hat h)-L(\bar h)&\leq 4\mbox{Rad}_n(\mathcal{A})+\sqrt{\frac{2}{n}\log\frac{2}{\delta}}\\ &\leq 4\sqrt{\frac{2}{n}\log M}+\sqrt{\frac{2}{n}\log\frac{2}{\delta}}\\ &\leq\sqrt{\frac{1}{n}\left(64\log M+4\log\frac{4}{\delta}\right)}~~~\mbox{due to}~\sqrt{a}+\sqrt{b}\leq\sqrt{2}\cdot\sqrt{a+b}. \end{aligned}\] Comparing with previous rate \(\sqrt{\frac{2}{n}\log\frac{2M}{\delta}}=\sqrt{\frac{1}{n}\left(2\log M+2\log\frac{2}{\delta}\right)}\), we can see both have the rate \(O\left(\frac{1}{\sqrt{n}}\right)\). However, the new bound is worse due to a larger constant.