Chapter 8 Vapnik-Chervonekis (VC) Theory

(This chapter was scribed by Baozhen Wang.)

In 7.3, we discussed Massart’s finite Lemma which places a bound on the empirical Rademacher Complexity (which in turn is also a bound on the Rademacher complexity). Note that the empirical \(\widehat{\text{Rad}}_n(\mathcal{F})\) depends on \(n\) data points, and \(\mathcal{F}\) was a finite class. If we had an infinite function class \(\mathcal{F}'\) but results in the same outputs on the \(n\) data points as an finite function class \(\mathcal{F}\), then \(\mathcal{F} \& \mathcal{F}'\) have the smae empirical Rademacher complexity. More formally, if \[\left\{ \left(f(z_1),\dots,f(z_n)\right): f\in \mathcal{F} \text{ finite } \right\}=\left\{ \left(f'(z_1),\dots,f'(z_n)\right): f'\in \mathcal{F}' \text{ infinite } \right\},\] then \(\widehat{\text{Rad}}_n(\mathcal{F})=\widehat{\text{Rad}}_n(\mathcal{F}').\)

Therefore, infinite class is fine and all that matters, as far as empirical Rademacher complexity is concerned, is the behavior of a function class on \(n\) data points.

This key observation is useful, when we analyze boolean functions (those that return 0/1). An important example is the loss class of \[\left\{ z \mapsto l(z,h): h\in\mathcal{H} \right\},\] where \(l\) is the 0-1 loss: \(l(z,h)=\mathbb{1}[h(x)\neq y]\) In this case, the number of different \((f(z_1),f(z_2),\dots,f(z_n))\) patterns is finite (at most \(2^n\)). The pattern (or behavior) of \(\mathcal{F}\) on \(n\) data points is captured in the shattering coefficient.

8.1 Shattering Coefficient

Definition 8.1 (Shattering Coefficient - growth function) Let \(\mathcal{F}\) be a collection of functions that map \(\mathcal{Z}\) to a finite set (usually \(\left\{0,1\right\}\)). The shattering coefficient of \(\mathcal{F}\) is the maximum number of patterns on \(n\) points \[S(\mathcal{F},n)\triangleq \max_{z_1,\dots,z_n\in\mathcal{Z}} |\underbrace{\left\{(f(z_1),\dots,f(z_n)): f\in\mathcal{F} \right\}}_{\triangleq B}|, \text{ where }|\cdot| \text{ denotes the cardinality}.\]

  • How to use this definition? Via the Massart’s finite Lemma, we can bound the empirical Rademacher complexity using the shattering coefficient. Specifically, if \(\mathcal{F}\) contains boolean functions, then \(C^2=1\) (bound from Massart’s finite Lemma) and \[\widehat{\text{Rad}}_n(\mathcal{F})\leq \sqrt{\frac{2\log S(\mathcal{F},n)}{n}},\] where \(M=|\mathcal{F}|\) is replaced by \(S(\mathcal{F},n).\) To modify the proof, \(\sum\limits_{f\in\mathcal{F}}\) will simply be replaced by \(\sum\limits_{(f(z_1),\dots,f(z_n))\in B}\). The significance is that we now can apply Massart’s finite Lemma to infinite function class with a finite shattering coefficient. In order to do this, it’s important to condition on the data, so that \(z_1,z_2,\dots,z_n\) are fixed. After the Massart’s finite Lemma, we can have \[\text{Rad}_n(\mathcal{F})=E(\widehat{\text{Rad}}_n(\mathcal{F}))\leq \sqrt{\frac{2\log S(\mathcal{F},n)}{n}}\]
  • Intuition of shattering coefficient
    • For boolean functions, if \(S(\mathcal{F},n)=2^n,\) (i.e. \(\mathcal{F}\) recovers all possible 0-1 patterns on \(n\) data points,) then we say that \(\mathcal{F}\) shatters the \(n\) data points.
    • If \(\mathcal{F}\) shatters \(n\) data points for all \(n\) (i.e. if \(S(\mathcal{F},n)=2^n, \forall n\)), then the desired bound will not converge to 0. Since \(\sqrt{\frac{2\log S(\mathcal{F},n)}{n}}=\sqrt{2\log 2}\) does not go to 0.
    • The fact the the shattering coefficient \(S(\mathcal{F},n)=2^n\) will lead to undesirable classifiers is expected: if \(\mathcal{F}\) can recover all patterns for all \(n\), then we can perfectly fit any label pattern of the training data, leading to massive overfitting.
  • Relation between \(S(\mathcal{A},n)\) and \(S(\mathcal{H},n)\) ( where \(\mathcal{A}\) denotes the loss class and \(\mathcal{H}\) denotes the hypothesis class.)
    • The bound so far depends on the shattering coefficient of the loss class \(\mathcal{A}\). However, it’s more intuitive to talk about the shattering coefficient of the hypothesis class \(\mathcal{H}\). We now show that they are the same for 0-1 loss and binary classifiers.

Corollary 8.1 Let \(\mathcal{H}\) be a class of binary classifiers: \[\mathcal{H}=\left\{ h: \mathcal{X}\mapsto \left\{0, 1\right\}\right\}.\] Let \(\mathcal{A}\) be a class of 0-1 loss operated on \(h\): \[\mathcal{A}=\left\{ (x,y)\mapsto \mathbb{1}[y\neq h(x)]: h\in \mathcal{H}\right\}.\] Note that \(h\) is defined on \(\mathcal{X}\), function in \(\mathcal{A}\) is defined on \(\mathcal{X}\times\left\{0,1\right\}\).

Result: \(S(\mathcal{H},n)=S(\mathcal{A},n),\forall n\).

Proof. For any \(n\) data points \((z_1,\dots,z_n)=((x_1,y_1),(x_2,y_2),\dots,(x_n,y_n))\), there is a bijection between the patterns of the 0-1 loss, and those of the hypothesis class. \[\left\{\mathbb{1}[h(x_1)\neq y_1],\mathbb{1}[h(x_2)\neq y_2],\dots, \mathbb{1}[h(x_2n\neq y_n]\right\}\Leftrightarrow \left\{h(x_1), h(x_2),\dots,h(x_n)\right\}\] where the relation between the two is obtained by taking the \(XOR\) operation with \((y_1,y_2,\dots,y_n)\): \[XOR(a,b)=\begin{cases} 1, & \text{if } a\neq b\\ 0, & \text{if } a=b \end{cases}\] For example, \[\begin{bmatrix} h(x_1)\\ h(x_2)\\ \cdots\\ h(x_n) \end{bmatrix} XOR \begin{bmatrix} y_1\\ y_2\\ \cdots\\ y_n \end{bmatrix} = \begin{bmatrix} \mathbb{1}[h(x_1)\neq y_1]\\ \mathbb{1}[h(x_2)\neq y_2]\\ \cdots\\ \mathbb{1}[h(x_n)\neq y_n] \end{bmatrix}\] Reverse is similar. Therefore, the number of patterns we obtain from the 0-1 loss as we range \(h\in\mathcal{H}\) is the same as we do from the hypothesis class \(h(x)\).

8.2 VC Dimension

In 8.1, we saw that the shattering coefficient upper bounds the Rademacher complexity, which in turn bounds the excess risk. Although the shattering coefficient nicely capture the behavior of an infinite \(\mathcal{H}\), it is not necessarily the most convenient to get handle on.

In 8.2, we will use a concept called VC Dimension to gain more insight about the shattering coefficient.

Definition 8.2 (VC Dimension) The VC Dimension of a family of functions \(\mathcal{H}\) with boolean outputs is the maximum number of data points that can be shattered by \(\mathcal{H}\): \[\text{VC}(\mathcal{H})\triangleq\sup\left\{n: S(\mathcal{H},n)=2^n\right\}.\]

  • Intuitions: The VC Dimension of \(\mathcal{H}\) is the maximum number of points whose label patterns can be perfectly recovered by some \(h\in\mathcal{H}.\) Intuitively, it’s easier to shatter \(n\) points when \(n\) is small than when \(n\) is large. For large \(n\), it is very difficult for a given function class \(\mathcal{H}\) to shatter \(n\) points.
  • To understand the definition of the shattering coefficient and VC Dimension better, we can think of each \(h\in\mathcal{H}\) (which returns \(0\) or \(1\) on \(\mathcal{X}\)) as a subset of sample space \(\mathcal{X}\), on which \(h(\mathcal{X})\) returns 1. Each \(h\) can be uniquely determined by a subset of \(\mathcal{X}\). Likewise, each subset in \(\mathcal{X}\) can be uniquely determined by a boolean function.
  • To this end, we consider \(\mathcal{H}\) as, not a family of functions, but a family of subsets of \(\mathcal{X}\).

Example 8.1 \(\mathcal{X}=\mathbb{R}\), \(\mathcal{H}=\left\{\text{all half-lines} \right\}=\left\{(-\infty,a]\cup[a,+\infty) \right\}.\) Clearly, \(|\mathcal{H}|=\infty.\) 1. If there is only \(1\) data point, \(z_1\in\mathbb{R},\) then \(\mathcal{H}\) shatters 1 data points. \[\tilde{h}(z_1)=\begin{cases} 0, & z_1\in(-\infty,a]\\ 1, & z_1\in[a, +\infty) \end{cases}\] \[S(\mathcal{H},1)=2=2^1\] 2. If there are \(2\) data points \(z_1<z_2\in\mathbb{R}\) There are \(4\) scenarios, \[(\hat{h}(z_1),\hat{h}(z_2))=\begin{cases} (0,0), &\ \text{ left open interval } a<z_1<z_2\\ (0,1), &\text{right open interval } z_1<a<z_2\\ (1,0), &\ \text{ left open interval } z_1<a<z_2\\ (1,1), &\ \text{ left open interval } z_1<z_2<a\\ \end{cases}\] Therefore \(S(\mathcal{H},2)=4=2^2.\) 3. If there are \(3\) data points \(z_1<z_2<z_3\in\mathbb{R}\) \[(\hat{h}(z_1),\hat{h}(z_2),\hat{h}(z_3))=\begin{cases} (0,0,0), &\ \text{ left open interval } a<z_1<z_2<z_3\\ (1,0,0), &\ \text{ left open interval } z_1<a<z_2<z_3\\ (1,1,0), &\ \text{ left open interval } z_1<z_2<a<z_3\\ (1,1,1), &\ \text{ left open interval } z_1<z_2<z_3<a\\ (0,0,1), &\text{right open interval } z_1<z_2<a<z_3\\ (0,1,1), &\text{right open interval } z_1<a<z_2<z_3\\ (1,0,1), &\ \text{can not cover}\\ (0,1,0), &\ \text{can not cover} \end{cases}\] Therefore the number of patterns that half-lines can recover for \(3\) data points is \(6\), \(S(\mathcal{H},3)=6<2^3.\) And therefore \(\text{VC}(\mathcal{H})=2.\)

Example 8.2 \(\mathcal{X}=\mathbb{R}\), \(\mathcal{H}=\left\{\text{intervals} \right\}=\left\{[a,b]: a\le b \in \mathbb{R} \right\}.\)

  • \(S(\mathcal{X},1)=2=2^1\) (can shatter 1 point)
  • \(S(\mathcal{X},2)=4=2^2\) (can shatter 2 points)
  • \(S(\mathcal{X},3)=7<2^3\) (can’t shatter 3 points) because we can’t recover \((1,0,1)\) case. Note that we can recover \((0,1,0)\) this time, comparing to the half-lines example.
  • \(S(\mathcal{X},n)=\binom{n+1}{2}+1<2^n\)
  • \(\text{VC}(\text{intervals})=2\).

Important notes: - To show that a class \(\mathcal{H}\) has VC Dimension \(d\), one needs to - Show that no \(d+1\) points can be shattered. This is done by showing that for any \(d+1\) points, there is some pattern that can’t be recovered by any element in \(\mathcal{H}\). - Show that there exist \(d\) points, that can be shattered by \(\mathcal{H}\) i.e. all \(2^d\) patterns can be recovered. - One can verify that when \(\mathcal{X}=\mathbb{R}^2,\) the VC Dimension of all rectangles is 4. Based on this, one may be attempted to claim that, VC Dimension of \(\mathcal{H}\) with \(d\) parameters is \(d\).

However the following counter example shows that it’s in general not the case.

Example 8.3 \(\mathcal{X}=\mathbb{R}\), \(\mathcal{H}=\left\{ \left\{x:\sin (\theta x)\ge 0\right\}:\theta\in\mathbb{R} \right\}\). This hypothesis class has only \(1\) parameter. We can verify that \(\text{VC}(\mathcal{H})=\infty.\)

It turns out that it’s not the number of parameters, but the dimension of the function spaces that matters.

Theorem 8.1 (Finite-dimensional function space) Let \(\mathcal{F}\) be a function class containing functions \[f:\mathcal{X}\rightarrow \mathbb{R}.\] Recall that the dimension of \(\mathcal{F}\) is of the number of elements in a basis of \(\mathcal{F}\). Let \[\mathcal{H}=\left\{ \left\{x:f(x)\ge 0\right\}: f\in\mathcal{F}\right\},\] which one may also say that \[\mathcal{H}=\left\{ x\mapsto \mathbb{1}[f(x)\ge 0]: f\in\mathcal{F}\right\},\] then we have \[\text{VC}(\mathcal{H})\le \text{dim}(\mathcal{F})\].

Remark: this theorem allows us to connect a linear algebraic property of \(\mathcal{F}\) (i.e. Dimension) with a combinatorial property of \(\mathcal{H}\) (i.e. VC Dimension).

Proof. Take any \(n\) points \(z_1,z_2,\dots,z_n\) with \(n>\dim(\mathcal{F})\), we will show that these \(n\) points can’t be shattered by \(\mathcal{H}\). Consider the linear map \(H(f)\triangleq (f(x_1),\dots,f(x_n))\) which maps each function in \(\mathcal{F}\) to a vector in \(\mathbb{R}^n\) containing function values on the \(n\) data points. Note that function evaluation is linear, and indeed \(H(f)\) is linear, i.e. \[H(f_1+tf_2)=H(f_1)+tH(f_2).\] The vector space \(\left\{ H(f): f\in\mathcal{F}\right\}\subset \mathbb{R}^n\) has dimensionality at most \(\dim(\mathcal{F})\), since applying a linear map can’t increase dimensionality.

Now since \(n>\dim(\mathcal{F})\ge \dim(\left\{H(f):f\in\mathcal{F}\right\})\), there exists some non-zero \(c\) in \(\mathbb{R}^n\), such that \[\left\langle H(f),c \right\rangle = 0\] which implies \[f(x_1)\cdot c_1+f(x_2)\cdot c_2+\cdots+f(x_n)\cdot c_n = 0.\] And it’s equivalent to \[\begin{equation} \sum_{i=c_i\ge 0} f(x_i)\cdot c_i+\sum_{i=c_i< 0} f(x_i)\cdot c_i=0. \tag{8.1} \end{equation}\] (WLOG, assume at least one \(c_i<0\).) This suggests a sign pattern for \((f(x_1),\dots,f(x_n))\). Now for the purpose of contradiction, suppose \(\mathcal{H}\) shatters \(\left\{ x_1,\dots, x_n\right\},\) then we could find a set \(A\in\mathcal{H}\) such that:

  1. whenever \(c_i\ge 0\), we have \(x_i\in A\). For example, \(h(x_1)=1 \Rightarrow\) the first term in (8.1) is nonnegative.

  2. whenever \(c_i< 0\), we have \(x_i\notin A\). For example, \(h(x_1)= 0 \Rightarrow\) the second term in (8.1) is positive.

1+2 contradicts (8.1). Therefore \(\mathcal{H}\) can’t shatter \(\left\{x_1,\dots,x_n\right\}\) for any choice of \(\left\{x_1,\dots,x_n\right\}\) with \(n>\dim(\mathcal{F})\). Hence \(\text{VC}(\mathcal{H})\le \dim(\mathcal{F})\square.\)

Example 8.4 (half-spaces passing through the origin) Let \(\mathcal{X}=\mathbb{R}^d\), and \[\mathcal{H}=\left\{ \left\{ x: \vec{w}\cdot\vec{x}\ge 0 \right\},\vec{w}\in\mathbb{R}^d \right\}\] alternatively think of \(\mathcal{H}\) as \[\mathcal{H}=\left\{ h(x)=\mathbb{1}[f(x)\ge 0]:f(x)=\vec{w}\cdot\vec{x} \right\}\] By theorem, \(\mathcal{F}=\left\{x\mapsto w\cdot x: w\in\mathbb{R}^d\right\}\) is a linear function class with dimensionality \(d\) with basis \(\left\{e_1\cdot x, e_2\cdot x,\dots, e_d\cdot x\right\}\), then we have \(\text{VC}(\mathcal{H})\le d.\)

In general, this upper bound on \(\text{VC}(\mathcal{H})\) is sufficient for the purpose of generalization bound. Let’s show the lower bound \(\text{VC}(\mathcal{H})\ge d.\)

  • Comment: Showing that \(\text{VC}(\mathcal{H})\ge d\) is easier than showing an upper bound of \(\text{VC}(\mathcal{H})\), because we only have to construct some set of \(d\) points that can be shattered by \(\mathcal{H}\), rather than showing that \(\mathcal{H}\) can’t shatter any \(d+1\) points.

  • The construction: \[\begin{align*} x_1 & = (1, 0, \dots, 0)\\ x_2 & = (0, 1, \dots, 0)\\ & \ \ \ \vdots\\ x_d & = (0, 0, \dots, 1) \end{align*}\] Given any \(I\subset\left\{1,\dots,d\right\}\) (which corresponds to a pattern for \((h(x_1),\dots,h(x_n))\) or \((f(x_1),\dots,f(x_n))\)), we can construct \(\vec{w}\) as follows: \[\begin{cases} w_i = 1, & \text{for } i\in I\\ w_i = -1, & \text{for } i\notin I \end{cases}\] In this way, we have for \[i\in I$, $f(x_i) = \vec{w}\cdot \vec{x}_i=w_i = 1 > 0 \Rightarrow h(x_i)=1,\] \[i\notin I$, $f(x_i) = \vec{w}\cdot \vec{x}_i=w_i = -1 < 0 \Rightarrow h(x_i)=0.\] This means that all \(2^d\) patterns can be recovered by some \(h\in\mathcal{H}\) i.e. \(h\) shatters these \(d\) points.

Therefore \(\text{VC}(\mathcal{H})\ge d\), and hence \(\text{VC}(\mathcal{H})= d.\)

8.3 Sauer’s Lemma.

So far, we showed the G.E. is bounded by \[G_n\triangleq \sup_{h\in\mathcal{H}} (L(h)-\hat{L}_n(h))\] \(G_n\) concentrates around \(E(G_n),\) and \(E(G_n)\) is bounded by \(\text{Rad}_n(A)\) (or \(\text{Rad}_n(\mathcal{H})\)). - For finite \(\mathcal{H}\), \(\text{Rad}_n(\mathcal{H})\) by \(|\mathcal{H}|\) - For infinite \(\mathcal{H}\), \(\text{Rad}_n(\mathcal{H})\) by shattering coefficient of \(\mathcal{H}\).

In 8.2, we have defined VC Dimension in terms of the shattering coefficient and have seen some examples of VC Dimension. Now the last missing piece is how to bound the shattering coefficient in terms of the VC Dimension.

Lemma 8.1 For a class \(\mathcal{H}\) with \(\text{VC}(\mathcal{H})=d\), we have \[S(\mathcal{H},n)\le \sum_{i=1}^d \binom{n}{i} \le \begin{cases} 2^n, & \text{if } n\le d\\ \left(\frac{en}{d}\right)^d, & \text{if } n>d \end{cases}\]

  • Intuitions
    • The VC Dimension \(d\) is a single integer summary of the shattering coefficient.
    • When \(n\le d\), the shattering coefficient grows exponentially in \(\mathbb{Z}^n\).
    • When \(n>d\), the shattering coefficient grows only polynomially in \(n\) \((O(n^d))\).
    • In some sense, given a large sample of \(n\) points, \(\mathcal{H}\) can perfectly recover a subset of up to \(d\) points.
    • The consequence of the Sauer’s Lemma is that now we have a new upper bound on the Rademacher complexity (and thus on the G.E.) for \(n>d\) (VC Dim).

\[\begin{align*} \widehat{\text{Rad}}_n(\mathcal{A}) &\le \sqrt{\frac{2\log (S(\mathcal{A},n))}{n}}\\ &=\sqrt{\frac{2\log (S(\mathcal{H},n))}{n}}\\ (\text{ by Sauer's Lemma })&\le \sqrt{\frac{2\log \left(\frac{en}{d}\right)^d}{n}}\\ &=\sqrt{\frac{2d \left(\log n + \log \frac{e}{d}\right) }{n}}\\ &\le \sqrt{\frac{2d(\log n + 1)}{n}} \end{align*}\]

The proof of Sauer’s Lemma is technical and specific, we omit it in the class. One proof is based on “shifting” which is an important technique in enumerative combinatories.