Chapter 11 Algorithmic Stability
(This chapter was scribed by Xinhai Zhang.)
Let’s shelve the generalization error \(L(\hat h)-L(\bar h)\) for the moment, and consider \(L(\hat h)-\hat L_n(\hat h)\), the gap between the expected and empirical risk of \(\hat h\).
It’s easy to see that this can be bounded by \(\sup_{h\in\mathcal{H}} L(h)-\hat L_n(h)\), i.e., \[\begin{equation} P\left(L(\hat h)-\hat L_n(\hat h) \geq \epsilon\right) \leq P \left(\sup_{h\in\mathcal{H}} L(h)-\hat L_n(h)\geq\epsilon\right).\tag{11.1} \end{equation}\]
So far, our analysis has relied on \(\hat h\) being the ERM (i.e., \(\hat h\) minimizes \(\hat L_n(h)\)). But (11.1) actually holds for any estimator \(\hat h\in\mathcal{H}\) (not just the ERM).
It is actually useful to think about \(\hat h\) as the result of applying an algorithm \(A\) on the training samples \((Z_i,\ldots, Z_n)\), i.e., \(\hat h = A(Z_i,\ldots, Z_n)\).
Colloquially, a bound on \(L(\hat h)-\hat L_n(\hat h)\) helps to answer this question:
“If I get 5% error on my training data \([\hat L_n(\hat h)]\), what should I expect the test error \([L(\hat h)]\) to be?”
our analysis so far, which relies on uniform convergence, focuses on studying the “size” of \(\mathcal{H}\) (e.g., Rademacher complexity, VC Dimension, covering number, etc). However, what if a learning algorithm \(A\) does not “use” all of \(\mathcal{H}\)? (Not try to search the entire \(\mathcal{H}\)).
For example, what if we use naive Bayes, Lasso regression, Ridge regression, early stopping, or drop-out? All of these could in principle return any \(h\in \mathcal{H}\), but are somehow constrained by their objective function or their algorithm. We might expect better generalization performance than simply looking at the complexity of \(\mathcal{H}\) without looking at the algorithm.As a concrete example, let’s analyze \(\mathcal{H}=\{\boldsymbol{x}\mapsto \boldsymbol{\beta}\boldsymbol{x}: \boldsymbol{\beta}\in\mathbb{R}^p\}\), the class of linear functions with no bounds on \(\|\boldsymbol{\beta}\|_2\).
Define the norm \(\|h\|_{\mathcal{H}}\) to be the \(\|\boldsymbol{\beta}\|_2\) of the associated coefficient vector.
Define the regularized ERM as our algorithm: \[\hat h=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{h\in\mathcal{H}} \hat L_n(h)+\frac{\lambda}{2}\|h\|^2_{\mathcal{H}}\]
Of course, it’s well-known to students in Math 535 that \(\hat h\) can be obtained by searching from a constrained class: \(\{h\in\mathcal{H}:\|h\|_\mathcal{H}\leq B\}\) for some \(B\).
However, to find out the \(B\) corresponding to \(\lambda\) required knowing the training data \(Z_i,\ldots, Z_n\).
How to analyze the regularized ERM \(\hat h\) directly?
In this chapter, we introduce a notion of algorithmic stability, where we view a learning algorithm as a function that takes the data as inputs, and outputs some function \(\hat h\), we will study the generalization performance of \(\hat h\) as a function of how stable the algorithm is. Note that the focus is on the algorithm rather than on the hypothesis class.
Stability will be measured in terms of the difference in the behavior of the algorithm given a traning data \(D\) and a perturbed version \(D^i\):
- \(D=\{Z_1,\ldots,Z_n\}\) i.i.d. \(\sim P\)
- \(D^i=\{Z_1,\ldots,Z_i',\ldots,Z_n\}\) i.i.d. \(\sim P\) (Identical to \(D\) except for the \(i\)-th data point.)
- \(Z_i\sim P\) is a new test data point
Definition 11.1 (Uniform Stability) An algorithm \(A:\mathcal{Z}^n\mapsto \mathcal{H}\) has uniform stability \(\alpha\) (or is \(\alpha\)-stable) with respect to a loss function \(\ell\) if for all training data set \(D\in\mathcal{Z}^n\), perturbed set \(D^i\in \mathcal{Z}^n\) and test data \(Z_0\in \mathcal{Z}\), \[|\ell(Z_0,A(D))-\ell(Z_0,A(D^i))|\leq\alpha.\]
Remark. Note that this is a very strong condition, since the bound must hold true uniformly for all \(Z_0, D,D^i\) and is not reliant on the probability distribution.
Example 11.1 (Stability of mean estimation) Assume \(Z\in\mathbb{R}^d\) and \(\|Z\|_2\leq B\) (bounded), consider the squared error loss: \[\ell(z,h)=\frac{1}{2}\|z-h\|^2\] Define our algorithm as follows: \[\begin{aligned} A(D)&\triangleq\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{h\in\mathbb{R}^d} \hat L_n(h)+\frac{\lambda}{2}\|h\|_2^2\\ &=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{h\in\mathbb{R}^d} \frac{1}{2n}\sum^n_{i=1}\|Z_i-h\|^2+\frac{\lambda}{2}\|h\|_2^2\\ &\triangleq\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{h\in\mathbb{R}^d}f(h)\end{aligned}.\] Note \(\nabla_nf(h)=-\frac{1}{n}\sum^n_{i=1}Z_i-h+\lambda h=0\) yields \(h=\frac{1}{1+\lambda}\frac{1}{n}\sum^n_{i=1}Z_i\), then \(A\) has uniform stability \(\alpha=\frac{6B^2}{(1+\lambda)n}\). To see this, define \[\begin{aligned} V_1&=A(D)-Z_0\leq 2B \\ V_2&=\frac{1}{(1+\lambda)n}(Z_i'-Z_i)\leq \frac{2B}{(1+\lambda)n}\leq 2B \\ \end{aligned}\] Then \[\begin{aligned} |\ell(Z_0,A(D))-\ell(Z_0,A(D^i))| &=\left|\frac{1}{2}\|A(D)-Z_0\|^2-\frac{1}{2}\|A(D^i)-Z_0\|^2\right| \\ &=\left|\frac{1}{2}\|V_1\|^2-\frac{1}{2}\|V_1+V_2\|^2\right| \\ &=\left|V_1\cdot V_2-\frac{1}{2}\|V_2\|^2\right| \\ &\leq \|V_1\|\|V_2\|+\frac{1}{2}\|V_2\|^2 \\ &= \|V_2\|\left(\|V_1\|+\frac{1}{2}\|V_2\|\right) \\ &\leq \frac{2B}{(1+\lambda)n}\left(\frac{B}{(1+\lambda)n}+2B\right)\\ &\leq \frac{6B^2}{(1+\lambda)n} \end{aligned}\] We can see that as we increase the regularization (increase \(\lambda\)), or amount of data (increase \(n\)), we have more stability (\(\alpha\) decreases).
Example 11.2 (Stability of linear predictors) Let \(\mathcal{H}=\{\boldsymbol{x}\mapsto\boldsymbol{\beta}\cdot\boldsymbol{x}:\boldsymbol{\beta}\in\mathbb{R}^d\}\). Assume the loss is 1-Lipschitz, i.e., for all \(Z_0\in\mathcal{Z}\) and for all \(h,h'\in\mathcal{H}\),
\[\left|\ell(Z_0,h)-\ell(Z_0,h')\right|\leq\|h-h'\|_{\infty}\triangleq\sup_{\boldsymbol{x}\in\mathbb{R}^d}|h(\boldsymbol{x})-h'(\boldsymbol{x})|\]
For example, for classification with \(y\in\{-1,1\}\), this holds for the hinge loss \(\ell((\boldsymbol{x},y),h)=\max\{1-y\cdot h(\boldsymbol{x}),0\}\).
Define the algorithm:
\[A(D)\triangleq\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{h\in\mathcal{H}}\hat L_n(h)+\frac{\lambda}{2}\|h\|^2_{\mathcal{H}}\]
Then \(A\) has stability \(\alpha=\frac{C_2^2}{\lambda n}\). See Bousquet & Elisseeff, JMLR (2002) for the proof.
Theorem 11.1 (Generalization under Uniform Stability) Let \(A\) be an \(\alpha\)-stable algorithm. Assume the loss is bounded, i.e., \(\sup_{z_0,h}|\ell(z_0,h)|\leq M\), then w.p. at least \(1-\delta\), \[L(A(D))-\hat L_n(A(D))\leq\alpha+(\alpha n+M)\sqrt{\frac{2}{n}\log(\frac{1}{\delta})}\].
Remark. Due to the term \(\alpha n\sqrt{\frac{1}{n}}\), for this bound to not diverge, we must have \(\alpha=o(\sqrt{\frac{1}{n}})\). Generally, we will have \(\alpha=O(\frac{1}{n})\) for many algorithms, which makes the RHS \(O(\frac{1}{\sqrt{n}})\).
Proof. Define \(\delta(D)\triangleq L(A(D))-\hat L_n(A(D))\), the proof consists of three steps:
- Bound \(\mathbb{E}[\delta(D)]\)
- Show that \(\delta(D)\) satisfies Bounded Different property
- Use McDiarmid’s Inequality to show that \(\delta(D)\) is near \(\mathbb{E}[\delta(D)]\)
Step 1: \[\begin{aligned} \mathbb{E}[\delta(D)]&=\mathbb{E}_D\left[\mathbb{E}_{Z_0}\left[\ell(Z_0,A(D))-\frac{1}{n}\sum_{i=1}^n\ell(Z_i,A(D))\right] \right]\\ &=\mathbb{E}_{D,Z_0}\left[\frac{1}{n}\sum_{i=1}^n\left[\ell(Z_0,A(D))-\ell(Z_i,A(D))\right]\right]\\ \text{(rename)}&=\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n\left[\ell(Z_i',A(D))-\ell(Z_i',A(D^i))\right]\right]\\ &\leq\alpha \end{aligned}\] The trick is that since \(D=(Z_1,\ldots,Z_n)\), \(D^i=(Z_1,\ldots,Z_i',\ldots,Z_n)\) and \(Z_0\) are all independent and therefore exchangeable. We just rename variables in a way that allow us to preserve the dependency structure and therefore preserve the expectation \(\mathbb{E}[\delta(D)]\). In the second line, \(Z_0 \not\in D\) and \(Z_i\in D\) and this is preserved after the renaming. In the third line, \(Z_i'\not\in D\) and \(Z_i'\in D\).
Step 2: Given the uniform stability, something like the bounded difference property seems to be reasonable. However, we are interested in whether \(\delta(D)\), not \(\ell(Z_0,A(D))\), is bounded difference.
The trick is to use triangle inequality to break down the difference into a chain of comparisons, each involving a perturbation. \[\begin{aligned} &\quad |\delta(D)-\delta(D^i)|\\ &=|L(A(D))-\hat L_n(A(D)) -L(A(D^i))-\hat L_n^i(A(D^i))| \\ &\leq |L(A(D))-L(A(D^i))|+|\hat L_n(A(D))-\hat L_n(A(D^i))| +|\hat L_n(A(D^i))-\hat L_n^i(A(D^i))|\\ &\leq\alpha +\alpha +\frac{2M}{n} \\ &\leq 2\alpha+\frac{2M}{n} \end{aligned}\] where \(\hat L_n^i(A(D^i))\) is the empirical risk w.r.t. the perturbed \(D^i\). The bounds for the first two difference are both \(\alpha\) by uniform stability. To see the bound for the last difference, note that their \(h\) are the same and they differ by one term only, which can differ by at most \(\frac{2M}{n}\).
It’s important that the \(\alpha\)-stability holds uniformly for any argument of \(\ell(Z_0,h)\), regardless of dependency between \(Z_0\) and \(h\).Step 3: Apply McDiarmid’s Inequality on \(\delta(D)\): \[\begin{aligned} \mathbb{P}(\delta(D)-\mathbb{E}[\delta(D)]\leq\epsilon)&\leq \exp\left(-\frac{2\epsilon^2}{n\left(2\alpha+\frac{2M}{n}\right)^2} \right)\\ &=\exp\left(-\frac{n\epsilon^2}{2(\alpha n+M)^2} \right)\triangleq \delta\\ \epsilon&=(\alpha n+M)\sqrt{\frac{2}{n}\log(\frac{1}{\delta})}\end{aligned}\] Then w.p, at least \(1-\delta\) \[\begin{aligned} \delta(D)&\leq \mathbb{E}[\delta(D)]+(\alpha n+M)\sqrt{\frac{2}{n}\log(\frac{1}{\delta})}\\ &\leq \alpha+(\alpha n+M)\sqrt{\frac{2}{n}\log(\frac{1}{\delta})}\end{aligned}\]
- Application to linear functions
Recall the example of regularized ERM on linear functions with \(1\)-Lipschitz loss (e.g. hinge loss) with stability \(\alpha=\frac{C_2^2}{\lambda n}\) where \(\lambda\) is the regularization parameter, and \(\|\boldsymbol{x}\|_2\leq C_2\). Plug this stability into the above theorem, we have w.p. at least \(1-\delta\), \[\delta(D) \leq \frac{C_2^2}{\lambda n}+\left(\frac{C_2^2}{\lambda }+M\right)\sqrt{\frac{2}{n}\log(\frac{1}{\delta})}=O(\frac{C_2^2+M}{\lambda \sqrt{n}})\] Note that this bound has the same dependency on \(n\), but a worse dependency on \(C_2\), compared to Rademacher Complexity bound, which was \(\frac{B_2C_2}{\sqrt{n}}=O(\frac{C_2}{\sqrt{n}})\).