Chapter 9 Norm-constrained Hypothesis Class
(This chapter was scribed by Baozhen Wang.)
We started in chapter 7 by establishing Rademacher complexity \(\text{Rad}_n (\mathcal{F})\) as a measure of complexity of a function class that yielded generalization bounds when applied to a loss class \(\mathcal{A}\). We then specialized to boolean function and 0-1 loss, leading to combinatoric notion such as shattering coefficient and VC Dimension to bound the Rademacher complexity.
- However, VC Dimension has a few imitation -For one, the bound will depend some how on the dimensionality. -For linear models, our bound will look like \(\tilde{o}(\sqrt{\frac{d}{n}})\) This is not good for high-dimensional models.
- In machine learning/ high-dimensional statistics, it is common to have a large number of varables/features (large \(d\)), but we could regularize or constrain the norm of \(\beta\), such as \(\|\beta\|_2\) or \(\|\beta\|_1\). In this case, the VC Dimension \(d\) doesn’t really capture the true complexity of the hypothesis class.
- Rademacher Complexity will actually be easier for the norm-constrained setting.
We will discuss three examples.
- \(\text{Rad}_n\) of linear function class with \(l_2\) norm constrain.
- \(\text{Rad}_n\) of linear function class with \(l_1\) norm constrain.
- \(\text{Rad}_n\) of a surrogate class with an underlying linear function class with \(l_2\) norm constrain.
Example 9.1 \(\text{Rad}_n\) of linear function with \(\beta\) within \(l_2\) ball
Theorem 9.1 \(\text{Rad}_n\) of \(l_2\) ball: Let \(\mathcal{F}=\left\{z\mapsto \beta\cdot z :\|\beta\|_2 \leq B_2, \beta\in \mathbb{R}^p \right\},\) bounded on \(\|\beta\|_2\). Assume \(z\in\mathbb{R}^p, E(\|z\|_2^2)\leq C_2^2\) bounded on data. Then \[\text{Rad}_n(\mathcal{F})\le \frac{C_2B_2}{\sqrt{n}}\]
Proof. \[\begin{align*} \text{Rad}_n(\mathcal{F}) &=\frac{1}{n}\cdot E_{\beta_2}\left[\sup_{\|\beta\|_2\le B_2} \sum_{i=1}^n \sigma_i (\beta\cdot z_i)\right]\\ &=\frac{1}{n}\cdot E_{\beta_2}\left[\sup_{\|\beta\|_2\le B_2} \beta\cdot (\sum_{i=1}^n \sigma_i\cdot z_i)\right]\\ \text{ (Cauchy Schwartz) }&\le \frac{1}{n} E\left[\sup_{\|\beta\|_2\le B_2} \|\beta\|_2 \| \sum_{i=1}^n \sigma_i\cdot z_i\|_2\right]\\ &\le \frac{B_2}{n} E\left[\sqrt{\| \sum_{i=1}^n \sigma_i\cdot z_i\|_2^2} \right]\\ (\sqrt{\cdot}\text{ is concave }) &\le \frac{B_2}{n} \sqrt{E\left(\|\sum_{i=1}^n \sigma_i\cdot z_i\|_2^2\right)} \\ & = \frac{B_2}{n} \sqrt{E\left(\sum_{i=1}^n \|\sigma_i\cdot z_i\|_2^2\right)} \\ (\sigma_i \text{ only changes sign }) & = \frac{B_2}{n} \sqrt{E\left(\sum_{i=1}^n \|z_i\|_2^2\right)} \\ & \le \frac{B_2}{n} \sqrt{n\cdot C_2^2}\\ &= \frac{B_2 C_2}{\sqrt{n}} \square.\end{align*}\]
Example 9.2 \(\text{Rad}_n\) of linear function with \(\beta\) within \(l_1\) ball
Theorem 9.2 \(\text{Rad}_n\) of \(l_1\) ball: Assume that every coordinate of \(z_i\) is bounded up, \[\|z_i\|_\infty \le C_\infty \text{ w.p. 1 } ,\forall i.\] Let \(\mathcal{F}=\left\{z\mapsto \beta\cdot z: \|\beta_1\|_1\le B_1, z\in\mathbb{R}^p\right\},\) then \[\text{Rad}_n(\mathcal{F})\le \frac{B_1 C_\infty}{\sqrt{n}}\cdot \sqrt{2\log(2p)}.\]
Proof. The key step is to realize that the \(l_1\) ball \[\left\{ \beta: \|\beta\|_1\le B_1\right\}\] is convex hull of the following \(2p\) many vectors \[w=\cup_{j=1}^p\left\{B_1\vec{e}_j, -B_1 \vec{e}_j \right\},\ \ \vec{e}_j= (0,0,\dots, \underbrace{1}_{\text{j-th}},\dots,0)^T.\] Since \(\text{Rad}_n\) of a class of functions is also the \(\text{Rad}_n\) of it’s convex hall, we just need to look at the finite class \[\text{Rad}_n(\mathcal{F})=E\left[\sup_{\beta\in w}\frac{1}{n}\sum_{i=1}^n \sigma_i (\beta \cdot z_i)\right].\] By Holder’s Inequality, \[\beta\cdot z_i \leq \|\beta\|_1\cdot\|z_i\|_\infty \le B_1\cdot C_\infty.\] Applying the Massart’s finite Lemma to the function class, specified by those \(\beta\) in \(w\), which has \(2p\) elements, with \(M^2=B_1^2 C_\infty^2\), we have \[\begin{align*} \text{Rad}_n (\mathcal{F}) &\le \sqrt{\frac{2 B_1^2 C_\infty^2 \log (2p)}{n}}\\ &=\frac{B_1 C_\infty}{\sqrt{n}}\cdot \sqrt{2\log(2p)}\square. \end{align*}\]
- Comparision of these two results:
- The \(l_2\) ball has the advantage that it applies to Kernel methods. More over, the dimensionality can be \(\infty\), as long as the norm is bounded.
- The \(l_1\) ball is suitable when we have a large dimensionality, and we believe only at a small subset of the variables are relevant to our learning task.
- Recall that as \(p\) goes up,
- \(p\)-norm decrease, \(\|x\|_p\ge\|x\|_q\), for \(q\ge p\).
- which means that the size of the \(l_p\) ball increase: \[\left\{\beta:\|\beta\|_p\le B \right\}\subseteq \left\{\beta:\|\beta\|_q\le B\right\}\]
- Because \(\|\beta\|_1\ge \|\beta\|_2\), a bound on \(\|\beta\|_1\) is a much stronger constrain than on the \(\|\beta\|_2\). The benefit is that we only need to measure the \(l_\infty\) norm of the data, which is much easier than \(l_2\) norm, since \[\|x\|_1\ge\|x\|_2\ge\cdots\ge\|x\|_\infty.\]
(The note below was scribed by Haoyu Zhang)
The cost however, is there a \(\log(2p)\) in the resultant bound on Rademacher complexity.
Ramification of sparsity
\(L_1\) regularization is often used when we believe in sparsity; formally the true \(\beta\) has \(s<<p\) non-zero entries.
In Math 535, you might have seen that \(L_1\) ball has sharp corners which encourages \(\overrightarrow{\beta}\) to be \(0\). But this does not say anything about the generalization error. We need a justification of \(L_1\) regularization in presence of sparsity through the lense of generalization error.
For simplicity, assume \(\|\beta\|_\infty\leq1\), \(\|x\|_\infty\leq1\). In this case, suffice to consider \(\|\beta\|_1\leq B_1=S\) (number of nonzeros). If indeed there are only \(S\) many non-zeros in \(\overrightarrow{\beta}\), then \(\|\beta\|_1\leq S\) holds. In this case, we can show Rademacher complexity and hence generalization error less than or equal to \(O(\frac{S\sqrt{\log p}}{\sqrt{n}})\). Only \(S\) matters and \(p\) can be as large as \(e^c\) and we will still be fine.
In contrast, if we have \(L_2\) ball, \(\|\beta\|_2\leq=\sqrt{S}\) and \(C_2=\sqrt{p}\). Then bound on Rademacher becomes \(O(\frac{\sqrt{S}\sqrt{p}}{\sqrt{n}})=O(\frac{S\sqrt{p/s}}{\sqrt{n}})\).
Compare the two:\ If \(S<<p\), then \(\sqrt{\log p}<<\sqrt{p/S}\) and \(L_1\) regularization is better. \ If \(S\sim p\), then \(\sqrt{\log p}>\sqrt{p/S}\) and \(L_1\) regularization is worsen by a factor of \(\sqrt{\log p}\).
Note that these are heuristic argument, since we are comparing upper bounds. In general, need to exercise some cautions.
Case III From hypothesis class to loss class for binary classification.
So far, we have Rademacher of \({\cal F}\). But we still need to turn that into Rademacher of \({\cal A}\).
Our function class \({\cal F}=\{z\rightarrow \beta z: \|\beta\|_2\leq B_2\}\). We have bounded its Rademacher complexity.
This is not our hypothesis class but close which normally looks like \({\cal H}=\{x\rightarrow \text{sign}(f(x)):f\in {\cal F}\}\)
However, we don’t really need the \(\text{Radn}({\cal H})\) but instead \(\text{Radn}({\cal A})\) where \(\{{\cal A}=\{(x,y)\rightarrow \mathbb{1}(h(x)\neq y):h\in {\cal H}\}\). If \({\cal H}\) is defined in terms of linear functions in \({\cal F}\), becomes \({\cal A}=\{(x,y)\rightarrow \mathbb{1}(y(\overrightarrow\beta\overrightarrow x)\leq 0): x\in \mathcal{R}^d, y\in\{-1,1\},\|\beta\|_2\leq B_2\}\).
To leverage the bound on \({\cal F}\), we think of each data point as \(z\triangleq y\cdot x\), so that each \(f\) in \({\cal F}\) maps \(z=yx\) to \(\beta\circ(y\cdot x)=y(\beta\cdot x)\).
Therefore, the loss class \({\cal A}\) can simply be expressed as a composition: \({\cal A}=\phi\circ{\cal F}\) where \(\phi(u)=\mathbb{1}(u\leq0)\). Then we would be able to use the composition rule of Rademacher complexity.
However, the problem is that we cannot use the composition rule directly, since \(\phi\) is not Lipschitz because it’s not continuous at zero.
Our strategy is to use a surrogate loss function to replace the \(\phi(u)=\mathbb{1}(u\leq 0)\) by the truncated hinge loss, which upper bound the \(0-1\) loss \(\phi(u)\). \[\phi_{\text{surrogate}}(u)\triangleq \min(1,\max(0,1-u))\] In terms of surrogate loss, we define the loss class \({\cal A}_{\text{surrogate}}=\phi_{\text{surrogate}}\circ{\cal F}\), and the expected loss \(L_{\text{surrogate}}(h)=\E[\phi_{\text{surrogate}}(Y\cdot h(x))]\). In comparison, \(L(h)\) under \(0-1\) loss, \[\begin{aligned} L(h)&=\E[\phi(Y\cdot h(x))]\\ &=\E[\mathbb{1}(Y\cdot h(x)\leq 0)]\\ &=\E[\mathbb{1}(Y\neq h(x))] \end{aligned}\]
Note that the Lipschitz constant of \(\phi_{\text{surrogate}}\) is \(1\). So we can use the composition rule now. \[\begin{aligned} \text{Radn}({\cal A}_{\text{surrogate}})&=\text{Radn}(\phi_{\text{surrogate}}\circ{\cal F})\\ &=\mathbb{1}\cdot\text{Radn}({\cal F}) \end{aligned}\]
Since \(\phi_{\text{surrogate}}(u)\ge\phi(u)\) for any \(u\) and \(L(h)\leq L_{\text{surrogate}}(h)\) for any \(h\) and we further have \[\begin{aligned} L(\hat h)&\leq L_{\text{surrogate}}(\hat h)\\ &\leq L_{\text{surrogate}}(\bar h)+4\text{Radn}({\cal A}_{\text{surrogate}})+\sqrt{\frac{2\log(\frac{2}{\delta})}{n}}\\ &\leq L_{\text{surrogate}}(\bar h)+4\text{Radn}({\cal F})+\sqrt{\frac{2\log(\frac{2}{\delta})}{n}} \end{aligned}\] where \(L(\hat h)\) is \(0-1\) risk of the ERM, \(L_{\text{surrogate}}(\bar h)\) is minimize \(L_{\text{surrogate}}(h)\) within \({\cal H}\), \(\text{Radn}({\cal F})\) is small when the norm is constrained and \(\sqrt{\frac{2\log(\frac{2}{\delta})}{n}}\) is going to \(0\) at rate of \(\sqrt{n}\). One should have low \(L(\hat h)\) if
we obtain low expected surrogate risk
norm is constrained so that \(\text{Radn}({\cal F})\) is small
Remarks:
These bounds either do not depend (e.g. \(L_2\) constrained) or depend weakly (e.g. \(L_1\) constrained) on dimension \(p\). This supports the prevailing wisdom that it doesn’t matter how many vars you have; as long as you regularize properly, you have good generalization.
We make no assumption about the convexity of the loss function, only that it should be Lipschitz. This allows us to work with truncated hinge loss (nonconvex but Lipschitz) rather that the \(0-1\) loss (not Lipschitz), of course, compatation/optimization of nonconvex loss is hard. We will focus on convexity in a few lecture.