Chapter 2 Asymptotic Analysis

(This chapter was scribed by Nicholas P Demarco.)

2.1 Asymptotics of ERM

We let \(n \to \infty\) in order to obtain a bound on EE (excess risk with respect to \(\mathcal{H}\)). In particular, we want \(L(\hat{\theta})-\inf_{\theta \in \Theta} L(\theta) \leq \frac{C}{n} + o(\frac{1}{n})\) where \(C\) is independent of \(n\) and \(o(\frac{1}{n})\) hides all kinds of dependency except on \(n\). Recall that \(o(\frac{1}{n})\) corresponds to a term which goes to zero faster than \(\frac{1}{n}\), i.e. if \(b_n \leq a_n\) and \(a_n, b_n \to 0\) as \(n\to \infty\) then \(b_n = o(a_n)\) and hence \(\frac{b_n}{a_n} \to 0\) as \(n \to \infty\). The main point is that we can ignore the dependency of the \(o(\frac{1}{n})\) term.

Theorem 2.1: Let \((X_1, Y_1), ... (X_n, Y_n)\) be data and \(\hat{\theta} = \text{argmin}_{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^n \ell (h_\theta (X_i, Y_i))\) and \(\overline{\theta}= \text{argmin}_{\theta \in \Theta} E_{(X,Y)}(\ell (X,Y, \theta))\). Suppose that

  1. \(\hat{\theta} \xrightarrow{\mathcal{P}}\overline{\theta}\),

  2. \(\nabla^2 L(\overline{\theta})\) is full rank,

  3. other regularity conditions.

Then:

  1. \(\sqrt{n}(\hat{\theta} - \overline{\theta})=o_p(1)\).

  2. \(\sqrt{n}(\hat{\theta} - \overline{\theta}) \xrightarrow{\mathcal{D}} N(0, \mathbf{\Sigma})\) where \(\mathbf{\Sigma}=[\nabla^2 L(\overline{\theta})]^{-1} \text{cov}[\nabla_\theta \ell (\boldsymbol{X}, \boldsymbol{Y}, \theta)|_{\theta = \theta^*} ][\nabla ^2 L(\overline{\theta})]^{-1}\).

  3. \(n(L(\hat{\theta})- L(\overline{\theta}))=O_p(1)\)

  4. $ n(L()-L()) ||S||_2^2$ where \(S\sim N(0, [\nabla^2 L(\overline{\theta})]^{-1} \text{cov}[\nabla_\theta \ell (\boldsymbol{X}, \boldsymbol{Y}, \theta)|_{\theta = \theta^*}][\nabla ^2 L(\overline{\theta})]^{-1})\).

  5. \(\lim_{n \to \infty} E[ n(L(\hat{\theta})-L(\overline{\theta})) ] = \frac{1}{2} \text{tr} ( [\nabla^2 L(\overline{\theta} )]^{-1} \text{cov} ( \nabla \ell ( \boldsymbol{X}, \boldsymbol{Y}, \theta^*)\)

Recall that \(X_n \xrightarrow{\mathcal{P}} X\) if and only if for every \(\epsilon >0\), \(P(||X_n -X||) \to 0\) as \(n \to \infty\) and \(X_n \xrightarrow{\mathcal{D}} X\) if and only if \(P(X_n \leq t) \to P(X \leq t)\) for all continuity points of \(P(X \leq t)\).

Key Takeaways:

  1. If we know \(\sqrt{n} ( \hat{\theta} - \overline{\theta})\) is Gaussian then the excess risk can be worked out.

  2. \(\hat{\theta} - \overline{\theta}\) decreases in order of \(\sqrt{n}\) and excess risk with respect to \(\mathcal{H}\) decreases in order of \(n\).

  3. Caveats: (1) We assume all conditions are true. (2) Assume \(n\) is large so that we can observe the limit as \(n \to \infty\).

Key Steps in the Proof:

  1. Taylor expansion of \(\nabla \hat{L}(\theta)\) around \(\overline{\theta}\).

  2. A Law of Large Numbers: \(\hat{L}(\theta) \xrightarrow{\mathcal{P}} L(\theta)\), \(\nabla \hat{L} (\theta) \to \nabla L(\theta)\), \(\nabla^2 \hat{L}(\theta) \xrightarrow{\mathcal{D}} \nabla^2 L(\theta)\).

  3. The following normality results.

Lemma: If \(\boldsymbol{Z}\in \mathbb{R}^p \sim N(0, \mathbf{\Sigma})\) then \(\mathbf{A}\boldsymbol{Z}\sim N(0,\mathbf{A}\mathbf{\Sigma}\mathbf{A}')\).

Lemma: If \(\boldsymbol{Z}\in \mathbb{R}^p \sim N(0, \mathbf{\Sigma})\) then \(\boldsymbol{Z}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{Z}\sim \chi^2_p\).

Proof: We have that:

\(0 = \nabla \hat{L}( \hat{\theta}) = \nabla \hat{L} ( \overline{\theta}) + \nabla^2 \hat{L}(\overline{\theta})(\hat{\theta}-\overline{\theta}) +O(||\hat{\theta}-\overline{\theta}||_2^2).\)

So \[\hat{\theta} - \overline{\theta} = -[\nabla^2 \hat{L}(\overline{\theta})]^{-1} \nabla\hat{L}(\hat{\theta}) + O(||\hat{\theta}-\overline{\theta}||_2^2)\] and \[\sqrt{n}(\hat{\theta} - \overline{\theta}) ) = -\sqrt{n}[\nabla^2 \hat{L}(\overline{\theta})]^{-1} \nabla\hat{L}(\hat{\theta})+O(\sqrt{n}||\hat{\theta}-\overline{\theta}||_2^2),\] we’ll label this as equation 2.1. On the other hand, let \(X_i = \nabla \ell (\boldsymbol{X}_i, \boldsymbol{Y}_i, \theta)\) and \(\overline{X}= \nabla \hat{L}(\overline{\theta})\). Note that \(E[\nabla \hat{L} ( \overline{\theta})] = \nabla L(\overline{\theta})=0\).

Now we can apply the CLT to obtain \[\sqrt{n}(\overline{X}- E( \overline{X}))) = \sqrt{n} ( \nabla \hat{L}(\overline{\theta}) - 0 ) \xrightarrow{\mathcal{D}} N( 0 , \text{cov}(\boldsymbol{X}_i)),\]

we’ll call this equation 2.2. Note that \(\text{cov}(\boldsymbol{X}_i)= \text{cov}( \nabla \ell (\boldsymbol{X}_i, \boldsymbol{Y}_i, \overline{\theta}))\). Next, we combine (2.1) and (2.2) and apply the Laws of Large Numbers and Slusky’s Theorem to obtain that \[\sqrt{n} (\hat{\theta} - \overline{\theta}) \approx -[\nabla ^2 \hat{L}( \overline{\theta})]^{-1} \sqrt{n} ( \nabla \hat{L} (\overline{\theta})- \nabla L(\overline{\theta}) ) \xrightarrow{\mathcal{D}} -[\nabla^2 L(\overline{\theta})]^{-1} N(0, \text{cov}(\boldsymbol{X}_i)) = N(0,-[\nabla^2 L(\overline{\theta})]^{-1}\text{cov}(\boldsymbol{X}_i) [\nabla^2 L(\overline{\theta})]^{-1})\]

This proves part 2. Part 1 follows from part 2 immediately since the sequence converges in distribution to one that is stochastically bounded. For Parts 3 and 4, we’ll work with the Taylor expansion of \(L\) at \(\overline{\theta}\). We can write \(L(\hat{\theta})\) as \[L(\hat{\theta}) = L(\overline{\theta}) + \langle\nabla L(\overline{\theta}), \hat{\theta}- \overline{\theta}\rangle + \frac{1}{2} \langle \hat{\theta} - \overline{\theta}, \nabla^2 L(\overline{\theta})(\hat{\theta}- \overline{\theta})\rangle +o(||\hat{\theta}-\overline{\theta}||^2_2)\]

Since \(\nabla(\overline{\theta} ) = 0\), \(L(\hat{\theta}) = L(\overline{\theta}) \frac{1}{2} \langle \hat{\theta} - \overline{\theta}, \nabla^2 L(\overline{\theta})(\hat{\theta}- \overline{\theta})\rangle +o(||\hat{\theta}-\overline{\theta}||^2_2)\). So

\(n(L(\hat{\theta}) - L(\overline{\theta})) = \frac{n}{2}\langle \hat{\theta} - \overline{\theta}, \nabla^2 L(\overline{\theta})(\hat{\theta}- \overline{\theta})\rangle + o (n || \hat{\theta} - \overline{\theta}||_2^2)\)

\(= \frac{1}{2} \langle \sqrt{n} ( \hat{\theta}- \overline{\theta}), \nabla^2 L(\overline{\theta}) \cdot \sqrt{n} (\hat{\theta}- \overline{\theta})\rangle\)

\(= \frac{1}{2} || [\nabla ^2 L(\overline{\theta})]^\frac{1}{2} \cdot \sqrt{n} ( \hat{\theta}-\theta) ||_2^2 \xrightarrow{\mathcal{D}} N(0, [\nabla^2 L(\overline{\theta})]^{-1} \text{cov}(\nabla \ell(\boldsymbol{X}, \boldsymbol{Y}, \theta^*))[\nabla^2 L(\overline{\theta})]^{-1}).\)

Hence Part 4 holds. Part 3 follows immediately from Part 4 since the sequence converges in distribution to one that is stochastically bounded.

Now we need Part 5. Let’s consider the following limit.

\(\lim_{n \to \infty} E[ n (L (\hat{\theta}) - L (\overline{\theta}))] = E[{\frac{1}{2} ||S||^2_2}]\) \(= \frac{1}{2} E(S^T S)\) \(= \frac{1}{2} E( \text{tr}(S^TS))\) \(= \frac{1}{2} E( \text{tr}(SS^T))\) \(= \frac{1}{2} \text{tr} (E(SS^T))\) \(= \frac{1}{2} \text{tr}([\nabla^2 L(\overline{\theta})]^{-\frac{1}{2}} \text{cov}(\nabla \ell(\boldsymbol{X}, \boldsymbol{Y}, \theta^*))[\nabla^2 L(\overline{\theta})]^{-\frac{1}{2}})\) \(= \frac{1}{2} \text{tr} ([\nabla^2 L(\overline{\theta})]^{-1} \text{cov}(\nabla \ell(\boldsymbol{X}, \boldsymbol{Y}, \theta^*))).\)

So Part 5 hold and the proof is complete.

Example: When \(l\) is (conditional) negative log-likelihood, the minimizer of \(l\) corresponds to the MLE.

Thus, we assume there exists a parametric family for the “true model.”

Theorem: Assume \(Y|X \sim P(Y|X, \theta_*)\) is the true model. Consider the MLE where \(\ell(X_i , Y_i, \theta)= - \log P(Y_i|X_i, \theta)\). Then:

  1. \(\overline{\theta}= \theta_*\) (population MLE recovers true \(\theta_*\))

  2. \(E[\nabla \ell(X,Y, \overline{\theta})]=0\)

  3. \(\text{cov}(\nabla \ell(X,Y, \overline{\theta})) = \nabla^2 L(\overline{\theta})\)

  4. \(\sqrt{n} (\hat{\theta}- \overline{\theta}) \xrightarrow{\mathcal{D}} N(0, [\nabla^2 L(\overline{\theta})]^{-1})\)

Proof:

  1. \(\text{Pop Loss}= L(\theta) = E[\ell(X_i, Y_i, \theta)]= E[-\log P(Y|X, \theta)]\) (note that we’re dropping the subscript due to all \(X_i\)’s and \(Y_i\)’s having the same distribution). So we have,

\[E[-\log P(Y|X, \theta)] = E[ - \log P(Y|X, \theta)+ \log P(Y|X, \theta_*)]- E[\log P(Y|X, \theta_*)]= E[ \log \frac{P(\theta_*)}{P(\theta)} +C.\]

By the Law of Iterated Expectation, we have,

\[E[ \log \frac{P(\theta_*)}{P(\theta)} +C= E[E[\log \frac{P(\theta_*)}{P(\theta)}|X] ]+C.\]

By KL divergence, the above equates to \(E[KL(P(\theta_*)|P(\theta)]+C\geq 0\) where equality holds if \(\overline{\theta}=\theta_*\).

  1. \(0=\nabla L(\overline{\theta}) = \nabla E[\ell(X,Y,\overline{\theta})] = E[\nabla \ell (X,Y, \overline{\theta})]\)

  2. \(\text{cov} (\nabla \ell(X,Y, \overline{\theta}))= E[ \nabla \ell (X,Y, \overline{\theta})(\nabla \ell (X,Y, \overline{\theta}))^T]\)

\(=E_X [ E_{Y|X}(\nabla \ell(X,Y \overline{\theta}) \nabla \ell^T (X,Y, \overline{\theta})|X)]\)

$=E_X[ P(Y|X, ) ^T dy] $

\(=E_X[ \int P(Y|X, \overline{\theta})\nabla \log P(Y|X, \overline{\theta}) (\nabla \log P(Y|X, \overline{\theta}) )^T]\)

\(=E_X[ \int \frac{\nabla P(Y|X, \overline{\theta}) \nabla^T P(Y|X, \overline{\theta})}{P(Y|X, \overline{\theta})} dy]\)

On the right hand side of the conclusion in part 3.

\(\nabla^2 L(\overline{\theta}) = \nabla^2 E[ \ell(\overline{\theta})]\)

\(= E[ \nabla^2 (\ell (\overline{\theta}))]\)

\(=E[ \nabla^2 ( - \log P(Y|X, \overline{\theta}))]\)

\(= E_X[ \int P(Y|X, \overline{\theta}) \nabla^2 ( - \log P(Y|X, \overline{\theta})) dy]\)

\(= E_X[ \int - \nabla \Big( \frac{\nabla^T P(Y|X, \overline{\theta}}{P(Y|X, \overline{\theta}}\Big) P(Y|X, \overline{\theta})dy]\)

\(= E_X[ \int - \Big( \frac{\nabla^2 P(Y|X, \overline{\theta}) \cdot P(Y|X, \overline{\theta}) - \nabla P(Y|X, \overline{\theta}) \nabla^T P(Y|X, \overline{\theta})}{P(Y|X, \overline{\theta})^2})Pdy]\)

\(=E_X[\int -\nabla^2P(Y|X, \overline{\theta}) + \frac{\nabla P(Y|X, \overline{\theta}) \nabla ^T P(Y|X, \overline{\theta})}{P(Y|X, \overline{\theta})} dy ].\)

Note that \[\int -\nabla^2P(Y|X, \overline{\theta}) = \nabla^2 \int P(Y|X, \overline{\theta}) dy = \nabla^2(1)=0.\]

So the left hand side equals the right hand side.

  1. Part 4 follows from part 3 and the general theorem.

Thus the proof is complete.

2.2 Limitation of Asymptotic Analysis?

We have no idea how large \(n\) has to be in order to ignore the remainder term. For example, assume we have \(\frac{P}{2n} + \frac{1}{n^2}\) versus \(\frac{p}{2n} + \frac{P^{100}}{n^2}\) as results. Asymptotic analysis treats both as the same. Equivalently, we fix the dimension of \(P\). We ignore the dependency of these remainder terms on other quantities; e.g. for the first quantity, \(\frac{1}{n^2} \approx .001\) for \(n \geq 100\) but for the second, if \(P=5\), we need \(n\) to be very large. This is where non-asymptotic analysis will be useful, meaning we no longer let \(n \to \infty\).