Chapter 10 Covering numbers (Metric Entropy)

(This chapter was scribed by Baozhen Wang.)

We have tried to measure the complexity of a finite hypothesis class simply by counting its cardinality. For infinite hypothesis class, we observed that shattering coeff is a useful measure. Since all that mattered was the behavior of a function class on n data points.

However, shattering coeff only works for function that return a finite number of values. (e.g. we focused on boolean functions so far.)

Can we retain the combinatorial nature of shattering coefficient, but generalize to real-valued fucntion (with infinite number of value), for example, to handle regression problems?

The key measure we will explore in this chapter is the covering number which counts the number of size-\(\epsilon\) balls needed to cover the hypothesis class. This allows us to reduce the complexity to a finite number of representatives.

Then we can use Massart’s finite lemma to control the representatives; the behavior of the other functions in \({\cal H}\) is controled by virtue of being close to some representative. In essence, covering numbers allow us to discritize problem.

10.1 Definitions

First, we introduce the general union of a metric space.

Definition: Metric Space

-A matric space \(({\cal F}, \rho)\) is defined over a set \({\cal F}\) equipped with a metric \(\rho\).

-A metric \(\rho: {\cal F}\times {\cal F}\rightarrow\mathcal{R}\) must be nonegative, symmetric, satisfy the tri-angle inequality, and evaluate to \(0\) if and only if its two argument are equal.

-\(\rho\) is said to be a pseudometric if \(\rho(f,f')=0\) is possible with some \(f\neq f'\).

Example of metric

Euclidean dist

For real vectors where \({\cal F}=\mathcal{R}^d\) the metric is \(\rho(f,f')=\|f-f'\|_2\)

\(L_2(P_n)\)

-\(L_2\) distance between functions with respect to the empirical measure over \(n\) data point \(P_n=\frac{1}{n}\sum_{i=1}^n\sigma z_i\).

-Let \({\cal F}\) be a family of functions mapping \(Z\) to \(\mathcal{R}\).

-The metric is \(\rho(f,f')=[\int(f-f')^2dP]^\frac{1}{2}=[\frac{1}{n}\sum_{i=1}^n(f(z_i)-f'(z_i))]^\frac{1}{2}\)

-Remark: Since we are restricting the functions to evaluate on \(n\) points, we can really think of these functions as n-dimension vectors with Euclidean dist rescaled by \(\frac{1}{\sqrt{n}}\).

Definition: \(\epsilon-\text{ball}\)

Let a \(({\cal F}, \rho)\) be a metric space. Then the ball with radius \(\epsilon>0\) centered at \(f\) is

\[B_{\epsilon}(f)=\{f':\rho(f,f')\leq\epsilon\}\]

Definition: covering numbers and metric entropy

-An \(\epsilon\) cover of a set \({\cal F}\) with respect to a metric \(\rho\) is a finite subset \(C=\{f_1,f_2,...,f_m\}\subset {\cal F}\) such that if we put a ball of radius \(\epsilon\) at each \(f_i\in C\), the union of the balls is a superset of \({\cal F}:{\cal F}\subset\cup_{i=1}^mB_{\epsilon}(f_i)\).

-Equivalently, every point in \({\cal F}\) is at most \(\epsilon\) away (under the metric \(\rho\)) from some point in the \(\epsilon-\text{cover}\) \(C\).

-The \(\epsilon-\text{covering}\) number of \({\cal F}\) with respect to \(\rho\) is the size of the smallest \(\epsilon-\text{cover}\) of \({\cal F}\): \[{\cal N}(\epsilon,{\cal F},\rho)\triangleq\min\{m:\exists\{f_1,f_2,...,f_m\}\subset{\cal F}\subset\cup_{i=1}^mB_{\epsilon}(f_i)\}\]

-The metric-entropy of \({\cal F}\) is \(\log({\cal N}(\epsilon,{\cal F},\rho))\)

Remark:as \(\epsilon\) decreases, the point \(f_i\) in the cover are a better approximation to the point in the \(B_{\epsilon}(f_i)\), but \({\cal N}(\epsilon, {\cal F},\rho)\) also increases. What does this trade-off remind you of?

Example: (all functions from \(\mathcal{R}\) to \([0,1]\))

-Let \({\cal F}\) be all functions mapping \(\mathcal{R}\) to \([0,1]\) we use \(L_2(P_n)\) so that only the evaluations of \(n\) data points matter.

-In order to cover \({\cal F}\), fix any \(f\in{\cal F}\). Our strategy is to construct some \(g\in{\cal F}\) such that \(\rho(g,f)\leq\epsilon\), we count how many different \(g\)’s are needed as we sweep \(f\) across \({\cal F}\).

-For each data point \(z_i\), we cover the range \([0,1]\) with the set of discrete points \(Y\triangleq\{\epsilon,3\epsilon,5\epsilon,...,1\}\). For any value of \(f(z_i)\in[0,1]\), we can pick \(g(z_i)\in Y\) closest to \(f(z_i)\) such that \(|f(z_i)-g(z_i)|\leq\epsilon\). Do this for \(z=z_i\). For \(z\) not one of \(z_i\)’s \((z_1,...z_n)\), then the value of \(g(z)\) is chosen arbitrarily. Averaging over all \(z_i\)’s, \[\rho(f,g)=\sqrt{\frac{1}{n}\sum_{i=1}^n(f(z_i)-g(z_i))^2}\leq\epsilon\]

-Furthermore, \(|Y|\approx\lceil\frac{1}{2\epsilon}\rceil\)

So, \({\cal N}(\epsilon,{\cal F},L_2(P_n))\leq(\frac{1}{2\epsilon})^n\)

The metric entropy is thus \(O(n\log(\frac{1}{\epsilon}))\) which intuitively is too large. To see this, even if we ignore the \(\log(\frac{1}{\epsilon})\) term (due to the discretization error), for the moment, we would see that the Rademacher complexity of the cover, by Massart’s finite lemma, is

\[O(\sqrt{\frac{n\log(1/\epsilon)}{n}})\nrightarrow0\]

-clearly, the class of all functions is too large. We need to consider something smaller but still pretty rich enough.

Example: non-decreasing functions

-Let \({\cal F}\) be all non-decreasing functions from \(\mathcal{R}\) to \([0,1]\).

-Let \(z_1\leq z_2\leq...\leq z_n\) be \(n\) data points.

-Similar to before, we define \(Y=\{\epsilon,2\epsilon,3\epsilon,...,1\}\).

-Fix any function \(f\in {\cal F}\), we will construct a piecewise constant non-decreasing function \(g\), such that \(\rho(f,g)\leq\epsilon\). To do so, for each level \(y\in Y\), identify all the points \(z_i\)’s for which \(f(z_i)\in(y-\epsilon,y]\). Then set \(g(z_i)=y\) for these points. It’s clear that \(|f(z_i)-g(z_i)|\leq\epsilon\).

\[\rho(f,g)=\sqrt{\frac{1}{n}\sum_{i=1}^n[f(z_i)-g(z_i)]^2}\leq\epsilon\]

-Now let’s count how many \(g\)’s do we need. For each of \(|Y|=\frac{1}{\epsilon}\) levels, we choose one of the \(n\) points to serve as the leftmost \(z_i\) for which \(f(z_i)\in(y-\epsilon,y]\). Therefore,

\[{\cal N}(\epsilon,{\cal F},L_2(P_n))=\left (\begin{array}{cccc} \lceil\frac{1}{\epsilon}\rceil+n\\ \lceil\frac{1}{\epsilon}\rceil \end{array}\right)\leq O(n^{\lceil\frac{1}{\epsilon}\rceil})\]

and the metric entropy is \(O(\frac{1}{\epsilon}\log n)\), which is much better than that of the previous example. The main difference is how we construct \(g\). We now choose a point for each level, rather than a level for each point in the precious example.

10.2 Metric Entropy bounds Rademacher Complexity

We will show that metric entropy leads to an upper bound on the Rademacher complexity. This will be done in two different ways

We use a simple discretization argument that applies the covering number with a single resolution \(\epsilon\).
We use a more sophisticated argument called “chaining,” that uses many resolutions at the same time, yielding tighter bounds.

Theorem: (Simple Discretization)

Without loss of generality, let \({\cal F}\) be a family of functions mapping \(z\) to \([-1,1]\). Then

\[\hat{\text{Radn}}({\cal F})\leq\inf_{\epsilon>0}(\sqrt{\frac{2\log{\cal N}(\epsilon,{\cal F},L_2(P_n))}{n}}+\epsilon)\]

Remarks. The Bound involves two terms

*The first term is the covering number, which increase as \(\epsilon\) as decreases. It’s due to the Massart’s finite lemma.

*The second term \(\epsilon\) is the approximation/discretization error. This is the price/penalty we pay for the discretization.

Proof:

-We will always condition on \(z_1,...,z_n\) (treat them as fixed) abd surpress explicit conditioning. For example, write \(E(A)\) instead of \(E(A|z_1,..,z_n)\).

-To simplify notation, write \(<f,g>\) for

\[\begin{align} <f,g>_{L_2(P_n)}&=\int f\cdot gdP_n\\ &=\frac{1}{n}\sum_{i=1}^nf(z_i)\cdot g(z_i) \end{align}\]

Moreover, write \(\|f\|\) for

\[\begin{align} \|f\|_{L_2(P_n)}&=\sqrt{\int f^2dP_n}\\ &=\sqrt{\frac{1}{n}\sum_{i=1}^nf(z_i)^2} \end{align}\]

-Overloading our notation

Let \(\sigma:z\to \{-1,1\}\) be defined as a function which, when evaluated on \(z_i\), returns the random sign \(\sigma_i\). This allows us to write the empirical Rademacher complexity succinctly as

\[\begin{align} \hat{\text{Radn}}({\cal F})&=E[\sup_{f\in {\cal F}}\frac{1}{n}\sum_{i=1}^n\sigma_if(z_i)]\\ &=E[\sup_{f\in{\cal F}}<\sigma,f>] \end{align}\]

-Note that \(\|\sigma\|=\sqrt{\frac{1}{n}\sum_{i=1}^n\sigma_i^2}=1\)

-Now, fix \(\epsilon>0\) and let \(C\) be an \(\epsilon-\text{cover}\) of \({\cal F}\). Then

\[\begin{align} \hat{\text{Radn}}({\cal F})&=E[\sup_{f\in{\cal F}}<\sigma,f>]\\ &=E[\sup_{g\in C}\sup_{f\in{\cal F}\cap B_{\epsilon}(g)}<\sigma,f>]\\ &=E[\sup_{g\in C}\sup_{f\in{\cal F}\cap B_{\epsilon}(g)}<\sigma,g>+<\sigma,f-g>]\\ &\text{By Cauchy-Schwartz}\\ &\leq E[\sup_{g\in C}<\sigma,g>+1\cdot \epsilon]\\ &\text{By Massart's finite lemma on C}\\ &\leq \sqrt{\frac{2\cdot 1\cdot\log{\cal N}(\epsilon,{\cal F},L_2(P_n))}{n}}+\epsilon\\ \end{align}\]

Example: simple discretization of non-decreasing functions

Let \({\cal F}\) be non-decreasing function from \(Z=\mathcal{R}\to[0,1]\) plug-in the covering number of \({\cal F}\) \(O(n^\frac{1}{\epsilon})\), to the above theorem.

\[\begin{align} \hat{\text{Radn}}({\cal F})&\leq\inf_{\epsilon>0}(\sqrt{\frac{2\log(O(n^{\frac{1}{\epsilon}}))}{n}}+\epsilon)\\ &=\inf_{\epsilon>0}(\sqrt{\frac{2\cdot O(\frac{\log n}{\epsilon})}{n}}+\epsilon)\\ \end{align}\]

The right hand side is minimized when the two terms are equal. Show for \(\epsilon\)

\[\begin{align} \sqrt{\frac{2O(\log n)}{n}}&=\epsilon\cdot\epsilon^{\frac{1}{2}}\\ \frac{2O(\log n)}{n}&=\epsilon^3\\ \epsilon&=(\frac{2O(\log n)}{n})^{\frac{1}{3}} \end{align}\]

Thus, \(\hat{\text{Radn}}\leq O((\frac{\log n}{n})^{\frac{1}{3}})\).

-Note that this bound provides a rate which is worse than the usual \(\frac{1}{\sqrt{n}}\) rate. Is it because non-decreasing function class? Not in this case. We will see shortly that this is just an artifact of the analysis being too weak, because it is only able to work with one level of resolution \(\epsilon\).

Next we will introduce a clever technique called “chaining” which fixes the problem.

Theorem: Chaining (Dudley’s theorem)

Let \({\cal F}\) be a family of functions mapping \(z\) to \([-1,1]\). Then

\[\hat{\text{Radn}}({\cal F})\leq12\int_0^{\infty}\sqrt{\frac{\log{\cal N}(\epsilon,{\cal F},L_2(P_n))}{n}}d\epsilon\]

Remark: Compared with the simple discretization, this bound involves integral that sweeps across difference resolution \(\epsilon\), and importantly removes the additive \(\epsilon\) penality.

Proof: The key is to work at multiple levels of resolution.

-Let \(\epsilon_0=\sup_{f\in {\cal F}}\|f\|\) be the maximum norm of a function \(f\in {\cal F}\), which is the coarsest resolution.

-Let \(\epsilon_j=\epsilon_{j-1}/2=2^{-j}\cdot\epsilon_0\) for \(j=1,2,..,m\) be successively finer resolutions.

-For each \(j=0,...,m\). Let \(C_j\) be a \(\epsilon_j-\text{cover}\) of \({\cal F}\).

-Fix any \(f\in{\cal F}\), let \(g_j\in C_j\) be such that \(\|f-g_j\|\leq\epsilon_j\); take \(g_0=0\). Note that \(g_j\)’s depend on \(f\).

-Let us decompose \(f\) as follows:

\[f=f-g_m+g_0+\sum_{j=1}^m(g_j-g_{j-1}), g_0=0\]

-Restating Massart’s finite lemma:

If we have \(\|f\|\leq M\), \(\forall f\in {\cal F}\), then

\[\hat{\text{Radn}({\cal F})}\leq M\cdot\sqrt{\frac{2\log({\cal F})}{n}}\]

-Let’s bound some norms:

*\(\|f-g_m\|\leq\epsilon_m\)

*\[\begin{align} \|g_j-g_{j-1}\|&\leq\|g_j-f\|+\|f-g_{j-1}\|\\ &\leq\epsilon_j+\epsilon_{j-1}\\ &=3\epsilon_j \end{align}\]

-Now,

\[\begin{align} \hat{\text{Radn}}({\cal F})&=E[\sup_{f\in{\cal F}}<\sigma,f>]\\ &\text{decompose f}\\ &=E[\sup_{f\in {\cal F}}<\sigma,f-g_m>+\sum_{j=1}^m<\sigma,g_j-g_{j-1}>]\\ &\text{Cauchy-Schwartz}\\ &\leq\epsilon_m+E[\sup_{f\in {\cal F}}\sum_{j=1}^m<\sigma,g_j-g_{j-1}>]\\ &\text{push sup inside}\\ &\leq\epsilon_m+\sum_{j=1}^mE[\sup_{f\in {\cal F}}<\sigma,g_j-g_{j-1}>]\\ &\leq\epsilon_m+\sum_{j=1}^mE[\sup_{g_j\in C_j,g_{j-1}\in C_{j-1}}<\sigma,g_j-g_{j-1}>]\\ &\text{Massart's finite lemma on } \{g_j-g_{j-1}\}\\ &\leq\epsilon_m+\sum_{j=1}^m\sqrt{\frac{2\log(|C_j|\cdot|C_{j-1}|)}{n}}\cdot(3\epsilon_j)\\ &\leq\epsilon_m+\sum_{j=1}^m6\epsilon_j\sqrt{\frac{\log(|C_j|)}{n}}\\ &\text{let }\epsilon_j=2(\epsilon_j-\epsilon_{j+1})\\ &\leq\epsilon_m+\sum_{j=1}^m12(\epsilon_j-\epsilon_{j+1})\sqrt{\frac{\log(|C_j|)}{n}}\\ &\text{let }m\to\infty \text{ and bound sum with integral}\\ &\leq12\int_0^{\infty}\sqrt{\frac{\log{\cal N}(\epsilon,{\cal F},L_2(P_n))}{n}}d\epsilon \end{align}\]

(The rest of this chapter was scribed by Xinhai Zhang.)

Remark. Why Chaining is better than simple discretization?

In simple discretization, we apply Massart’s finite lemma on function whose range is \([-1,1]\), while in Chaining, we apply Massart’s finite lemma on functions \(\{g_j-g_{j-1}\}\) whose range has size \(O(\epsilon_j)\). This leads to a scaling of \(\epsilon_j\), which decreases as \(j\) increases.
This scaling is beneficial. When \(j\) increases by 1, \(\epsilon_j\) is multiplied by \(\frac{1}{2}\); as a consequence, the integrand will increase as well. However, in many examples (e.g. the next one), the integrand, because it’s under square root, will only be multiplied by a factor of \(\sqrt{2}\), not by \(2\), so we indeed benefit from this scaling.

Example 10.1 (Chaining on non-decreasing function) Let \(\mathcal{F}\) be all non-decreasing functions from \(\mathbb{R}\) to \([0,1]\). Note that \(\|f\|\leq 1\) \(\forall f\in\mathcal{F}\). So the coarsest resolution is \(\epsilon_0=1\). We only need to integrate up to \(1\), since the integrand \(\log N(\epsilon,\mathcal{F},L_2(\mathbb{P}_n))=0\) \(\forall\epsilon>1\).
Plug \(N(\epsilon,\mathcal{F},L_2(\mathbb{P}_n))=O(n^{\frac{1}{\epsilon}})\) into Dudley’s theorem (Chainning theorem), \[\begin{aligned} \widehat{Rad_n}(\mathcal{F})&\leq 12\int_0^1\sqrt{\frac{1}{n}\log(O(n^{\frac{1}{\epsilon}}))}d\epsilon\\ =O(\sqrt{\frac{\log n}{n}})\int_0^1\sqrt{\frac{1}{\epsilon}}d\epsilon\\ =O(\sqrt{\frac{\log n}{n}}) \end{aligned}\]

Remark. \(\frac{1}{\sqrt{n}}\) is better than \(\frac{1}{\sqrt[3]{n}}\) from simple discretization.

Summary:
- covering number is a powerful technique that allows us to handle function classes
- The essence is discretization so that we can use Massart’s fininte lemma
- Resolution \(\epsilon\) can be ‘finened’ (shrunk) via Chainning to get tighter bounds
- Nice property not shared by Rademacher complexity, e.g., the covering number of \(\mathcal{F}_1\cup\mathcal{F}_2\) is at most the sum of coveting number of both.