Chapter 2 Asymptotic Analysis

(This chapter was scribed by Nicholas P Demarco.)

2.1 Asymptotics of ERM

We let $n \to \infty$ in order to obtain a bound on EE (excess risk with respect to $\mathcal{H}$). In particular, we want $L(\hat{\theta})-\inf_{\theta \in \Theta} L(\theta) \leq \frac{C}{n} + o(\frac{1}{n})$ where $C$ is independent of $n$ and $o(\frac{1}{n})$ hides all kinds of dependency except on $n$. Recall that $o(\frac{1}{n})$ corresponds to a term which goes to zero faster than $\frac{1}{n}$, i.e. if $b_n \leq a_n$ and $a_n, b_n \to 0$ as $n\to \infty$ then $b_n = o(a_n)$ and hence $\frac{b_n}{a_n} \to 0$ as $n \to \infty$. The main point is that we can ignore the dependency of the $o(\frac{1}{n})$ term.

Theorem 2.1: Let $(X_1, Y_1), ... (X_n, Y_n)$ be data and $\hat{\theta} = \text{argmin}_{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^n \ell (h_\theta (X_i, Y_i))$ and $\overline{\theta}= \text{argmin}_{\theta \in \Theta} E_{(X,Y)}(\ell (X,Y, \theta))$. Suppose that

$\hat{\theta} \xrightarrow{\mathcal{P}}\overline{\theta}$,
$\nabla^2 L(\overline{\theta})$ is full rank,
other regularity conditions.

Then:

$\sqrt{n}(\hat{\theta} - \overline{\theta})=o_p(1)$.
$\sqrt{n}(\hat{\theta} - \overline{\theta}) \xrightarrow{\mathcal{D}} N(0, \mathbf{\Sigma})$ where $\mathbf{\Sigma}=[\nabla^2 L(\overline{\theta})]^{-1} \text{cov}[\nabla_\theta \ell (\boldsymbol{X}, \boldsymbol{Y}, \theta)|_{\theta = \theta^*} ][\nabla ^2 L(\overline{\theta})]^{-1}$.
$n(L(\hat{\theta})- L(\overline{\theta}))=O_p(1)$
$ n(L()-L()) ||S||_2^2$ where $S\sim N(0, [\nabla^2 L(\overline{\theta})]^{-1} \text{cov}[\nabla_\theta \ell (\boldsymbol{X}, \boldsymbol{Y}, \theta)|_{\theta = \theta^*}][\nabla ^2 L(\overline{\theta})]^{-1})$.
$\lim_{n \to \infty} E[ n(L(\hat{\theta})-L(\overline{\theta})) ] = \frac{1}{2} \text{tr} ( [\nabla^2 L(\overline{\theta} )]^{-1} \text{cov} ( \nabla \ell ( \boldsymbol{X}, \boldsymbol{Y}, \theta^*)$

Recall that $X_n \xrightarrow{\mathcal{P}} X$ if and only if for every $\epsilon >0$, $P(||X_n -X||) \to 0$ as $n \to \infty$ and $X_n \xrightarrow{\mathcal{D}} X$ if and only if $P(X_n \leq t) \to P(X \leq t)$ for all continuity points of $P(X \leq t)$.

Key Takeaways:

If we know $\sqrt{n} ( \hat{\theta} - \overline{\theta})$ is Gaussian then the excess risk can be worked out.
$\hat{\theta} - \overline{\theta}$ decreases in order of $\sqrt{n}$ and excess risk with respect to $\mathcal{H}$ decreases in order of $n$.
Caveats: (1) We assume all conditions are true. (2) Assume $n$ is large so that we can observe the limit as $n \to \infty$.

Key Steps in the Proof:

Taylor expansion of $\nabla \hat{L}(\theta)$ around $\overline{\theta}$.
A Law of Large Numbers: $\hat{L}(\theta) \xrightarrow{\mathcal{P}} L(\theta)$, $\nabla \hat{L} (\theta) \to \nabla L(\theta)$, $\nabla^2 \hat{L}(\theta) \xrightarrow{\mathcal{D}} \nabla^2 L(\theta)$.
The following normality results.

Lemma: If $\boldsymbol{Z}\in \mathbb{R}^p \sim N(0, \mathbf{\Sigma})$ then $\mathbf{A}\boldsymbol{Z}\sim N(0,\mathbf{A}\mathbf{\Sigma}\mathbf{A}')$.

Lemma: If $\boldsymbol{Z}\in \mathbb{R}^p \sim N(0, \mathbf{\Sigma})$ then $\boldsymbol{Z}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{Z}\sim \chi^2_p$.

Proof: We have that:

$0 = \nabla \hat{L}( \hat{\theta}) = \nabla \hat{L} ( \overline{\theta}) + \nabla^2 \hat{L}(\overline{\theta})(\hat{\theta}-\overline{\theta}) +O(||\hat{\theta}-\overline{\theta}||_2^2).$

So \[\hat{\theta} - \overline{\theta} = -[\nabla^2 \hat{L}(\overline{\theta})]^{-1} \nabla\hat{L}(\hat{\theta}) + O(||\hat{\theta}-\overline{\theta}||_2^2)\] and \[\sqrt{n}(\hat{\theta} - \overline{\theta}) ) = -\sqrt{n}[\nabla^2 \hat{L}(\overline{\theta})]^{-1} \nabla\hat{L}(\hat{\theta})+O(\sqrt{n}||\hat{\theta}-\overline{\theta}||_2^2),\] we’ll label this as equation 2.1. On the other hand, let $X_i = \nabla \ell (\boldsymbol{X}_i, \boldsymbol{Y}_i, \theta)$ and $\overline{X}= \nabla \hat{L}(\overline{\theta})$. Note that $E[\nabla \hat{L} ( \overline{\theta})] = \nabla L(\overline{\theta})=0$.

Now we can apply the CLT to obtain \[\sqrt{n}(\overline{X}- E( \overline{X}))) = \sqrt{n} ( \nabla \hat{L}(\overline{\theta}) - 0 ) \xrightarrow{\mathcal{D}} N( 0 , \text{cov}(\boldsymbol{X}_i)),\]

we’ll call this equation 2.2. Note that $\text{cov}(\boldsymbol{X}_i)= \text{cov}( \nabla \ell (\boldsymbol{X}_i, \boldsymbol{Y}_i, \overline{\theta}))$. Next, we combine (2.1) and (2.2) and apply the Laws of Large Numbers and Slusky’s Theorem to obtain that \[\sqrt{n} (\hat{\theta} - \overline{\theta}) \approx -[\nabla ^2 \hat{L}( \overline{\theta})]^{-1} \sqrt{n} ( \nabla \hat{L} (\overline{\theta})- \nabla L(\overline{\theta}) ) \xrightarrow{\mathcal{D}} -[\nabla^2 L(\overline{\theta})]^{-1} N(0, \text{cov}(\boldsymbol{X}_i)) = N(0,-[\nabla^2 L(\overline{\theta})]^{-1}\text{cov}(\boldsymbol{X}_i) [\nabla^2 L(\overline{\theta})]^{-1})\]

This proves part 2. Part 1 follows from part 2 immediately since the sequence converges in distribution to one that is stochastically bounded. For Parts 3 and 4, we’ll work with the Taylor expansion of $L$ at $\overline{\theta}$. We can write $L(\hat{\theta})$ as \[L(\hat{\theta}) = L(\overline{\theta}) + \langle\nabla L(\overline{\theta}), \hat{\theta}- \overline{\theta}\rangle + \frac{1}{2} \langle \hat{\theta} - \overline{\theta}, \nabla^2 L(\overline{\theta})(\hat{\theta}- \overline{\theta})\rangle +o(||\hat{\theta}-\overline{\theta}||^2_2)\]

Since $\nabla(\overline{\theta} ) = 0$, $L(\hat{\theta}) = L(\overline{\theta}) \frac{1}{2} \langle \hat{\theta} - \overline{\theta}, \nabla^2 L(\overline{\theta})(\hat{\theta}- \overline{\theta})\rangle +o(||\hat{\theta}-\overline{\theta}||^2_2)$. So

$n(L(\hat{\theta}) - L(\overline{\theta})) = \frac{n}{2}\langle \hat{\theta} - \overline{\theta}, \nabla^2 L(\overline{\theta})(\hat{\theta}- \overline{\theta})\rangle + o (n || \hat{\theta} - \overline{\theta}||_2^2)$

$= \frac{1}{2} \langle \sqrt{n} ( \hat{\theta}- \overline{\theta}), \nabla^2 L(\overline{\theta}) \cdot \sqrt{n} (\hat{\theta}- \overline{\theta})\rangle$

$= \frac{1}{2} || [\nabla ^2 L(\overline{\theta})]^\frac{1}{2} \cdot \sqrt{n} ( \hat{\theta}-\theta) ||_2^2 \xrightarrow{\mathcal{D}} N(0, [\nabla^2 L(\overline{\theta})]^{-1} \text{cov}(\nabla \ell(\boldsymbol{X}, \boldsymbol{Y}, \theta^*))[\nabla^2 L(\overline{\theta})]^{-1}).$

Hence Part 4 holds. Part 3 follows immediately from Part 4 since the sequence converges in distribution to one that is stochastically bounded.

Now we need Part 5. Let’s consider the following limit.

$\lim_{n \to \infty} E[ n (L (\hat{\theta}) - L (\overline{\theta}))] = E[{\frac{1}{2} ||S||^2_2}]$ $= \frac{1}{2} E(S^T S)$ $= \frac{1}{2} E( \text{tr}(S^TS))$ $= \frac{1}{2} E( \text{tr}(SS^T))$ $= \frac{1}{2} \text{tr} (E(SS^T))$ $= \frac{1}{2} \text{tr}([\nabla^2 L(\overline{\theta})]^{-\frac{1}{2}} \text{cov}(\nabla \ell(\boldsymbol{X}, \boldsymbol{Y}, \theta^*))[\nabla^2 L(\overline{\theta})]^{-\frac{1}{2}})$ $= \frac{1}{2} \text{tr} ([\nabla^2 L(\overline{\theta})]^{-1} \text{cov}(\nabla \ell(\boldsymbol{X}, \boldsymbol{Y}, \theta^*))).$

So Part 5 hold and the proof is complete.

Example: When $l$ is (conditional) negative log-likelihood, the minimizer of $l$ corresponds to the MLE.

Thus, we assume there exists a parametric family for the “true model.”

Theorem: Assume $Y|X \sim P(Y|X, \theta_*)$ is the true model. Consider the MLE where $\ell(X_i , Y_i, \theta)= - \log P(Y_i|X_i, \theta)$. Then:

$\overline{\theta}= \theta_*$ (population MLE recovers true $\theta_*$)
$E[\nabla \ell(X,Y, \overline{\theta})]=0$
$\text{cov}(\nabla \ell(X,Y, \overline{\theta})) = \nabla^2 L(\overline{\theta})$
$\sqrt{n} (\hat{\theta}- \overline{\theta}) \xrightarrow{\mathcal{D}} N(0, [\nabla^2 L(\overline{\theta})]^{-1})$

Proof:

$\text{Pop Loss}= L(\theta) = E[\ell(X_i, Y_i, \theta)]= E[-\log P(Y|X, \theta)]$ (note that we’re dropping the subscript due to all $X_i$’s and $Y_i$’s having the same distribution). So we have,

\[E[-\log P(Y|X, \theta)] = E[ - \log P(Y|X, \theta)+ \log P(Y|X, \theta_*)]- E[\log P(Y|X, \theta_*)]= E[ \log \frac{P(\theta_*)}{P(\theta)} +C.\]

By the Law of Iterated Expectation, we have,

\[E[ \log \frac{P(\theta_*)}{P(\theta)} +C= E[E[\log \frac{P(\theta_*)}{P(\theta)}|X] ]+C.\]

By KL divergence, the above equates to $E[KL(P(\theta_*)|P(\theta)]+C\geq 0$ where equality holds if $\overline{\theta}=\theta_*$.

$0=\nabla L(\overline{\theta}) = \nabla E[\ell(X,Y,\overline{\theta})] = E[\nabla \ell (X,Y, \overline{\theta})]$
$\text{cov} (\nabla \ell(X,Y, \overline{\theta}))= E[ \nabla \ell (X,Y, \overline{\theta})(\nabla \ell (X,Y, \overline{\theta}))^T]$

$=E_X [ E_{Y|X}(\nabla \ell(X,Y \overline{\theta}) \nabla \ell^T (X,Y, \overline{\theta})|X)]$

$=E_X[ P(Y|X, ) ^T dy] $

$=E_X[ \int P(Y|X, \overline{\theta})\nabla \log P(Y|X, \overline{\theta}) (\nabla \log P(Y|X, \overline{\theta}) )^T]$

$=E_X[ \int \frac{\nabla P(Y|X, \overline{\theta}) \nabla^T P(Y|X, \overline{\theta})}{P(Y|X, \overline{\theta})} dy]$

On the right hand side of the conclusion in part 3.

$\nabla^2 L(\overline{\theta}) = \nabla^2 E[ \ell(\overline{\theta})]$

$= E[ \nabla^2 (\ell (\overline{\theta}))]$

$=E[ \nabla^2 ( - \log P(Y|X, \overline{\theta}))]$

$= E_X[ \int P(Y|X, \overline{\theta}) \nabla^2 ( - \log P(Y|X, \overline{\theta})) dy]$

$= E_X[ \int - \nabla \Big( \frac{\nabla^T P(Y|X, \overline{\theta}}{P(Y|X, \overline{\theta}}\Big) P(Y|X, \overline{\theta})dy]$

$= E_X[ \int - \Big( \frac{\nabla^2 P(Y|X, \overline{\theta}) \cdot P(Y|X, \overline{\theta}) - \nabla P(Y|X, \overline{\theta}) \nabla^T P(Y|X, \overline{\theta})}{P(Y|X, \overline{\theta})^2})Pdy]$

$=E_X[\int -\nabla^2P(Y|X, \overline{\theta}) + \frac{\nabla P(Y|X, \overline{\theta}) \nabla ^T P(Y|X, \overline{\theta})}{P(Y|X, \overline{\theta})} dy ].$

Note that \[\int -\nabla^2P(Y|X, \overline{\theta}) = \nabla^2 \int P(Y|X, \overline{\theta}) dy = \nabla^2(1)=0.\]

So the left hand side equals the right hand side.

Part 4 follows from part 3 and the general theorem.

Thus the proof is complete.

2.2 Limitation of Asymptotic Analysis?

We have no idea how large $n$ has to be in order to ignore the remainder term. For example, assume we have $\frac{P}{2n} + \frac{1}{n^2}$ versus $\frac{p}{2n} + \frac{P^{100}}{n^2}$ as results. Asymptotic analysis treats both as the same. Equivalently, we fix the dimension of $P$. We ignore the dependency of these remainder terms on other quantities; e.g. for the first quantity, $\frac{1}{n^2} \approx .001$ for $n \geq 100$ but for the second, if $P=5$, we need $n$ to be very large. This is where non-asymptotic analysis will be useful, meaning we no longer let $n \to \infty$.