Chapter 10 Covering numbers (Metric Entropy)

(This chapter was scribed by Baozhen Wang.)

We have tried to measure the complexity of a finite hypothesis class simply by counting its cardinality. For infinite hypothesis class, we observed that shattering coeff is a useful measure. Since all that mattered was the behavior of a function class on n data points.

However, shattering coeff only works for function that return a finite number of values. (e.g. we focused on boolean functions so far.)

Can we retain the combinatorial nature of shattering coefficient, but generalize to real-valued fucntion (with infinite number of value), for example, to handle regression problems?

The key measure we will explore in this chapter is the covering number which counts the number of size-ϵ balls needed to cover the hypothesis class. This allows us to reduce the complexity to a finite number of representatives.

Then we can use Massart’s finite lemma to control the representatives; the behavior of the other functions in H is controled by virtue of being close to some representative. In essence, covering numbers allow us to discritize problem.

10.1 Definitions

First, we introduce the general union of a metric space.

Definition: Metric Space

-A matric space (F,ρ) is defined over a set F equipped with a metric ρ.

-A metric ρ:F×FR must be nonegative, symmetric, satisfy the tri-angle inequality, and evaluate to 0 if and only if its two argument are equal.

-ρ is said to be a pseudometric if ρ(f,f)=0 is possible with some ff.

Example of metric

  1. Euclidean dist

For real vectors where F=Rd the metric is ρ(f,f)=

  1. L_2(P_n)

-L_2 distance between functions with respect to the empirical measure over n data point P_n=\frac{1}{n}\sum_{i=1}^n\sigma z_i.

-Let {\cal F} be a family of functions mapping Z to \mathcal{R}.

-The metric is \rho(f,f')=[\int(f-f')^2dP]^\frac{1}{2}=[\frac{1}{n}\sum_{i=1}^n(f(z_i)-f'(z_i))]^\frac{1}{2}

-Remark: Since we are restricting the functions to evaluate on n points, we can really think of these functions as n-dimension vectors with Euclidean dist rescaled by \frac{1}{\sqrt{n}}.

Definition: \epsilon-\text{ball}

Let a ({\cal F}, \rho) be a metric space. Then the ball with radius \epsilon>0 centered at f is

B_{\epsilon}(f)=\{f':\rho(f,f')\leq\epsilon\}

Definition: covering numbers and metric entropy

-An \epsilon cover of a set {\cal F} with respect to a metric \rho is a finite subset C=\{f_1,f_2,...,f_m\}\subset {\cal F} such that if we put a ball of radius \epsilon at each f_i\in C, the union of the balls is a superset of {\cal F}:{\cal F}\subset\cup_{i=1}^mB_{\epsilon}(f_i).

-Equivalently, every point in {\cal F} is at most \epsilon away (under the metric \rho) from some point in the \epsilon-\text{cover} C.

-The \epsilon-\text{covering} number of {\cal F} with respect to \rho is the size of the smallest \epsilon-\text{cover} of {\cal F}: {\cal N}(\epsilon,{\cal F},\rho)\triangleq\min\{m:\exists\{f_1,f_2,...,f_m\}\subset{\cal F}\subset\cup_{i=1}^mB_{\epsilon}(f_i)\}

-The metric-entropy of {\cal F} is \log({\cal N}(\epsilon,{\cal F},\rho))

Remark:as \epsilon decreases, the point f_i in the cover are a better approximation to the point in the B_{\epsilon}(f_i), but {\cal N}(\epsilon, {\cal F},\rho) also increases. What does this trade-off remind you of?

Example: (all functions from \mathcal{R} to [0,1])

-Let {\cal F} be all functions mapping \mathcal{R} to [0,1] we use L_2(P_n) so that only the evaluations of n data points matter.

-In order to cover {\cal F}, fix any f\in{\cal F}. Our strategy is to construct some g\in{\cal F} such that \rho(g,f)\leq\epsilon, we count how many different g’s are needed as we sweep f across {\cal F}.

-For each data point z_i, we cover the range [0,1] with the set of discrete points Y\triangleq\{\epsilon,3\epsilon,5\epsilon,...,1\}. For any value of f(z_i)\in[0,1], we can pick g(z_i)\in Y closest to f(z_i) such that |f(z_i)-g(z_i)|\leq\epsilon. Do this for z=z_i. For z not one of z_i’s (z_1,...z_n), then the value of g(z) is chosen arbitrarily. Averaging over all z_i’s, \rho(f,g)=\sqrt{\frac{1}{n}\sum_{i=1}^n(f(z_i)-g(z_i))^2}\leq\epsilon

-Furthermore, |Y|\approx\lceil\frac{1}{2\epsilon}\rceil

So, {\cal N}(\epsilon,{\cal F},L_2(P_n))\leq(\frac{1}{2\epsilon})^n

The metric entropy is thus O(n\log(\frac{1}{\epsilon})) which intuitively is too large. To see this, even if we ignore the \log(\frac{1}{\epsilon}) term (due to the discretization error), for the moment, we would see that the Rademacher complexity of the cover, by Massart’s finite lemma, is

O(\sqrt{\frac{n\log(1/\epsilon)}{n}})\nrightarrow0

-clearly, the class of all functions is too large. We need to consider something smaller but still pretty rich enough.

Example: non-decreasing functions

-Let {\cal F} be all non-decreasing functions from \mathcal{R} to [0,1].

-Let z_1\leq z_2\leq...\leq z_n be n data points.

-Similar to before, we define Y=\{\epsilon,2\epsilon,3\epsilon,...,1\}.

-Fix any function f\in {\cal F}, we will construct a piecewise constant non-decreasing function g, such that \rho(f,g)\leq\epsilon. To do so, for each level y\in Y, identify all the points z_i’s for which f(z_i)\in(y-\epsilon,y]. Then set g(z_i)=y for these points. It’s clear that |f(z_i)-g(z_i)|\leq\epsilon.

\rho(f,g)=\sqrt{\frac{1}{n}\sum_{i=1}^n[f(z_i)-g(z_i)]^2}\leq\epsilon

-Now let’s count how many g’s do we need. For each of |Y|=\frac{1}{\epsilon} levels, we choose one of the n points to serve as the leftmost z_i for which f(z_i)\in(y-\epsilon,y]. Therefore,

{\cal N}(\epsilon,{\cal F},L_2(P_n))=\left (\begin{array}{cccc} \lceil\frac{1}{\epsilon}\rceil+n\\ \lceil\frac{1}{\epsilon}\rceil \end{array}\right)\leq O(n^{\lceil\frac{1}{\epsilon}\rceil})

and the metric entropy is O(\frac{1}{\epsilon}\log n), which is much better than that of the previous example. The main difference is how we construct g. We now choose a point for each level, rather than a level for each point in the precious example.

10.2 Metric Entropy bounds Rademacher Complexity

We will show that metric entropy leads to an upper bound on the Rademacher complexity. This will be done in two different ways

  1. We use a simple discretization argument that applies the covering number with a single resolution \epsilon.

  2. We use a more sophisticated argument called “chaining,” that uses many resolutions at the same time, yielding tighter bounds.

Theorem: (Simple Discretization)

Without loss of generality, let {\cal F} be a family of functions mapping z to [-1,1]. Then

\hat{\text{Radn}}({\cal F})\leq\inf_{\epsilon>0}(\sqrt{\frac{2\log{\cal N}(\epsilon,{\cal F},L_2(P_n))}{n}}+\epsilon)

Remarks. The Bound involves two terms

*The first term is the covering number, which increase as \epsilon as decreases. It’s due to the Massart’s finite lemma.

*The second term \epsilon is the approximation/discretization error. This is the price/penalty we pay for the discretization.

Proof:

-We will always condition on z_1,...,z_n (treat them as fixed) abd surpress explicit conditioning. For example, write E(A) instead of E(A|z_1,..,z_n).

-To simplify notation, write <f,g> for

\begin{align} <f,g>_{L_2(P_n)}&=\int f\cdot gdP_n\\ &=\frac{1}{n}\sum_{i=1}^nf(z_i)\cdot g(z_i) \end{align}

Moreover, write \|f\| for

\begin{align} \|f\|_{L_2(P_n)}&=\sqrt{\int f^2dP_n}\\ &=\sqrt{\frac{1}{n}\sum_{i=1}^nf(z_i)^2} \end{align}

-Overloading our notation

Let \sigma:z\to \{-1,1\} be defined as a function which, when evaluated on z_i, returns the random sign \sigma_i. This allows us to write the empirical Rademacher complexity succinctly as

\begin{align} \hat{\text{Radn}}({\cal F})&=E[\sup_{f\in {\cal F}}\frac{1}{n}\sum_{i=1}^n\sigma_if(z_i)]\\ &=E[\sup_{f\in{\cal F}}<\sigma,f>] \end{align}

-Note that \|\sigma\|=\sqrt{\frac{1}{n}\sum_{i=1}^n\sigma_i^2}=1

-Now, fix \epsilon>0 and let C be an \epsilon-\text{cover} of {\cal F}. Then

\begin{align} \hat{\text{Radn}}({\cal F})&=E[\sup_{f\in{\cal F}}<\sigma,f>]\\ &=E[\sup_{g\in C}\sup_{f\in{\cal F}\cap B_{\epsilon}(g)}<\sigma,f>]\\ &=E[\sup_{g\in C}\sup_{f\in{\cal F}\cap B_{\epsilon}(g)}<\sigma,g>+<\sigma,f-g>]\\ &\text{By Cauchy-Schwartz}\\ &\leq E[\sup_{g\in C}<\sigma,g>+1\cdot \epsilon]\\ &\text{By Massart's finite lemma on C}\\ &\leq \sqrt{\frac{2\cdot 1\cdot\log{\cal N}(\epsilon,{\cal F},L_2(P_n))}{n}}+\epsilon\\ \end{align}

Example: simple discretization of non-decreasing functions

Let {\cal F} be non-decreasing function from Z=\mathcal{R}\to[0,1] plug-in the covering number of {\cal F} O(n^\frac{1}{\epsilon}), to the above theorem.

\begin{align} \hat{\text{Radn}}({\cal F})&\leq\inf_{\epsilon>0}(\sqrt{\frac{2\log(O(n^{\frac{1}{\epsilon}}))}{n}}+\epsilon)\\ &=\inf_{\epsilon>0}(\sqrt{\frac{2\cdot O(\frac{\log n}{\epsilon})}{n}}+\epsilon)\\ \end{align}

The right hand side is minimized when the two terms are equal. Show for \epsilon

\begin{align} \sqrt{\frac{2O(\log n)}{n}}&=\epsilon\cdot\epsilon^{\frac{1}{2}}\\ \frac{2O(\log n)}{n}&=\epsilon^3\\ \epsilon&=(\frac{2O(\log n)}{n})^{\frac{1}{3}} \end{align}

Thus, \hat{\text{Radn}}\leq O((\frac{\log n}{n})^{\frac{1}{3}}).

-Note that this bound provides a rate which is worse than the usual \frac{1}{\sqrt{n}} rate. Is it because non-decreasing function class? Not in this case. We will see shortly that this is just an artifact of the analysis being too weak, because it is only able to work with one level of resolution \epsilon.

Next we will introduce a clever technique called “chaining” which fixes the problem.

Theorem: Chaining (Dudley’s theorem)

Let {\cal F} be a family of functions mapping z to [-1,1]. Then

\hat{\text{Radn}}({\cal F})\leq12\int_0^{\infty}\sqrt{\frac{\log{\cal N}(\epsilon,{\cal F},L_2(P_n))}{n}}d\epsilon

Remark: Compared with the simple discretization, this bound involves integral that sweeps across difference resolution \epsilon, and importantly removes the additive \epsilon penality.

Proof: The key is to work at multiple levels of resolution.

-Let \epsilon_0=\sup_{f\in {\cal F}}\|f\| be the maximum norm of a function f\in {\cal F}, which is the coarsest resolution.

-Let \epsilon_j=\epsilon_{j-1}/2=2^{-j}\cdot\epsilon_0 for j=1,2,..,m be successively finer resolutions.

-For each j=0,...,m. Let C_j be a \epsilon_j-\text{cover} of {\cal F}.

-Fix any f\in{\cal F}, let g_j\in C_j be such that \|f-g_j\|\leq\epsilon_j; take g_0=0. Note that g_j’s depend on f.

-Let us decompose f as follows:

f=f-g_m+g_0+\sum_{j=1}^m(g_j-g_{j-1}), g_0=0

-Restating Massart’s finite lemma:

If we have \|f\|\leq M, \forall f\in {\cal F}, then

\hat{\text{Radn}({\cal F})}\leq M\cdot\sqrt{\frac{2\log({\cal F})}{n}}

-Let’s bound some norms:

*\|f-g_m\|\leq\epsilon_m

*\begin{align} \|g_j-g_{j-1}\|&\leq\|g_j-f\|+\|f-g_{j-1}\|\\ &\leq\epsilon_j+\epsilon_{j-1}\\ &=3\epsilon_j \end{align}

-Now,

\begin{align} \hat{\text{Radn}}({\cal F})&=E[\sup_{f\in{\cal F}}<\sigma,f>]\\ &\text{decompose f}\\ &=E[\sup_{f\in {\cal F}}<\sigma,f-g_m>+\sum_{j=1}^m<\sigma,g_j-g_{j-1}>]\\ &\text{Cauchy-Schwartz}\\ &\leq\epsilon_m+E[\sup_{f\in {\cal F}}\sum_{j=1}^m<\sigma,g_j-g_{j-1}>]\\ &\text{push sup inside}\\ &\leq\epsilon_m+\sum_{j=1}^mE[\sup_{f\in {\cal F}}<\sigma,g_j-g_{j-1}>]\\ &\leq\epsilon_m+\sum_{j=1}^mE[\sup_{g_j\in C_j,g_{j-1}\in C_{j-1}}<\sigma,g_j-g_{j-1}>]\\ &\text{Massart's finite lemma on } \{g_j-g_{j-1}\}\\ &\leq\epsilon_m+\sum_{j=1}^m\sqrt{\frac{2\log(|C_j|\cdot|C_{j-1}|)}{n}}\cdot(3\epsilon_j)\\ &\leq\epsilon_m+\sum_{j=1}^m6\epsilon_j\sqrt{\frac{\log(|C_j|)}{n}}\\ &\text{let }\epsilon_j=2(\epsilon_j-\epsilon_{j+1})\\ &\leq\epsilon_m+\sum_{j=1}^m12(\epsilon_j-\epsilon_{j+1})\sqrt{\frac{\log(|C_j|)}{n}}\\ &\text{let }m\to\infty \text{ and bound sum with integral}\\ &\leq12\int_0^{\infty}\sqrt{\frac{\log{\cal N}(\epsilon,{\cal F},L_2(P_n))}{n}}d\epsilon \end{align}

(The rest of this chapter was scribed by Xinhai Zhang.)

Remark. Why Chaining is better than simple discretization?

  • In simple discretization, we apply Massart’s finite lemma on function whose range is [-1,1], while in Chaining, we apply Massart’s finite lemma on functions \{g_j-g_{j-1}\} whose range has size O(\epsilon_j). This leads to a scaling of \epsilon_j, which decreases as j increases.

  • This scaling is beneficial. When j increases by 1, \epsilon_j is multiplied by \frac{1}{2}; as a consequence, the integrand will increase as well. However, in many examples (e.g. the next one), the integrand, because it’s under square root, will only be multiplied by a factor of \sqrt{2}, not by 2, so we indeed benefit from this scaling.

Example 10.1 (Chaining on non-decreasing function) Let \mathcal{F} be all non-decreasing functions from \mathbb{R} to [0,1]. Note that \|f\|\leq 1 \forall f\in\mathcal{F}. So the coarsest resolution is \epsilon_0=1. We only need to integrate up to 1, since the integrand \log N(\epsilon,\mathcal{F},L_2(\mathbb{P}_n))=0 \forall\epsilon>1.
Plug N(\epsilon,\mathcal{F},L_2(\mathbb{P}_n))=O(n^{\frac{1}{\epsilon}}) into Dudley’s theorem (Chainning theorem), \begin{aligned} \widehat{Rad_n}(\mathcal{F})&\leq 12\int_0^1\sqrt{\frac{1}{n}\log(O(n^{\frac{1}{\epsilon}}))}d\epsilon\\ =O(\sqrt{\frac{\log n}{n}})\int_0^1\sqrt{\frac{1}{\epsilon}}d\epsilon\\ =O(\sqrt{\frac{\log n}{n}}) \end{aligned}

Remark. \frac{1}{\sqrt{n}} is better than \frac{1}{\sqrt[3]{n}} from simple discretization.

Summary:
- covering number is a powerful technique that allows us to handle function classes
- The essence is discretization so that we can use Massart’s fininte lemma
- Resolution \epsilon can be ‘finened’ (shrunk) via Chainning to get tighter bounds
- Nice property not shared by Rademacher complexity, e.g., the covering number of \mathcal{F}_1\cup\mathcal{F}_2 is at most the sum of coveting number of both.