Chapter 6 Fast rate under margin condition

(The majority of this chapter was scribed by Xiyong Yan.)

6.1 Noiseless Case

A natural (abeit somewhat naive) case is the completely noiseless case. Here we have \(\eta(X)\in \{0,1\}\ e.w.\ Var(Y|X)=0\) and \(\forall h, L(h)-L(h^*)=E[|2\eta(X)-1|*\mathbb{1}{h(X)\neq h^*(X)}]=P(h(X)\neq h^*(X))\)
let’s now denote \(Z_i(h)=\mathbb{1}{h^*(X_i)\neq y_i}-\mathbb{1}{h(X_i)\neq y_i}\)
[Note: a fixed function minus our usual 0-1 loss]
Notice \(|Z_i(h)|=\mathbb{1}(h^*(X_i)\neq h(X_i))\) and \(Var(Z_i(h))\leq E(Z_i^2(h))=E(|Z_i(h)|)=P(h^*(X_i)\neq h(X_i))=L(h)-L(h^*)\)
We will make use of Bernstein’s inquality,since this case \(Var(Z_i(h))\) will be small. To this end, \(\frac{1}{n}\sum_{i=1}^nVar(Z_i(h))\leq =P(h^*(X)\neq h(X))\triangleq\sigma_n^2\)
We will make a fairly strong assumption about \(\mathcal{H}\) that \(h^*\in \mathcal{H}\).This eliminates the approximation error. In this case \(\bar h=h^*\).

Theorem: Assume \(\mathcal{H}=\{h_1....h_m\},\ h^*\in \mathcal{H}\). \(h^*\in \mathcal{H}\)
\(h^*=\bar h\)=\(argmin\ L(h)\)
\(\hat h=argmin\ L(h)\)

Assume noiseless: \(\eta(x)\in\{0,1\}\ e.w.,\) Then \(L(\hat h)-L(h^*)\leq \frac{4log(M/\delta)}{n}\)

Proof:
Write \(\tilde Z_i(h)\triangleq Z_i(h)-E(Z_i(h))\) Note \(E(\tilde Z_i(h))=0\)
\(\frac{1}{n}\sum_{i=1}^nVar(\tilde Z_i(h))\leq \sigma_n^2\)
\(|\tilde Z_i(h)|\leq2=C\)
We apply Bernstein’s inequality. \(\{\tilde Z_i(h_j)\}_{i=1,...,n}\) for a given \(h_j \in \mathcal{H}\),i=1,…,n. \(P[\frac{1}{n}\sum_{i=1}^n (Z_i(h_j)-E(Z_i(h_j)))>t]\leq exp(-\frac{nt^2}{2\sigma_{h_j}^2+4t/t})\). To bound RHS by “\(\delta/M\),” we need \(exp(-\frac{nt^2}{2\sigma_{h_j}^2}+4t/3)\leq \delta/M\). It is sufficient to let \(nt^2\geq log(M/\delta)2\sigma_{h_j}^2\), solve for t, we have \(t\geq \sqrt{\frac{4\sigma_{h_j}^2log(M/\delta)}{n}}\). Also, \(nt^2 \geq log(M/\delta)4t/3\), so \(t\geq\frac{8}{3n}log(M/\delta)\). Hence,
\(t\geq max\{\sqrt{\frac{4\sigma_{h_j}^2log(M/\delta)}{n}},\frac{8}{3n}log(M/\delta)\}\triangleq t_0{(h_j)}\)
Then, for each j,\(\frac{1}{n}\sum_{i=1}^n (Z_i(h_j)-E(Z_i(h_j)))\geq t_0(h_j)\) with probability at most \(\delta/M\). This implies \(\frac{1}{n}\sum_{i=1}^n (Z_i(h_j)-E(Z_i(h_j)))\geq t_0(h_j)\),There exist \(j\in \{1,...,M\}\) with probability at most \(\delta\). In other words, \(\frac{1}{n}\sum_{i=1}^n (Z_i(h_j)-E(Z_i(h_j)))\leq t_0(h_j)\) for \(\forall j\in\{1,...,M\}\) with probability at least \(1-\delta\). Next, \(L(\hat h)-L(h^*)=\hat L_n(\hat h)-\hat L_n(h^*)+\hat L_n(h^*)-\hat L_n(\hat h)-[L(h^*)-L(\hat h)]\leq \frac{1}{n}\sum_{i=1}^n (Z_i(\hat h)-E(Z_i(\hat h)))\). Thus, \(P[(L(\hat h)-L(h^*))\leq t_0(\hat h)]\geq P[\forall h_j,\frac{1}{n}\sum_{i=1}^n (Z_i(h_j)-E(Z_i(h_j)))\leq t_0(h_j)]\geq 1-\delta\)

Moreover,\(\sigma_{\hat h}^2=P(h^*(X)\neq \hat h(X))=L(\hat h)-L(h^*)\). Thus, with high probability, \[\begin{aligned} L(\hat h)-L(h^*)\leq t_0(\hat h)=max\{\sqrt{\frac{4\sigma_{h_j}^2log(M/\delta)}{n}},\frac{8}{3n}log(M/\delta)\}&&=max\{\sqrt{\frac{4(\sigma_{h_j}^2)L(\hat h)-L(h^*))log(M/\delta)}{n}},\frac{8}{3n}log(M/\delta)\}. \end{aligned}\] Hence, \[\begin{aligned} L(\hat h)-L(h^*)\leq \frac{8}{3n}log(M/\delta) or L(\hat h)-L(h^*)\leq \frac{4log(M/\delta)}{n}.\end{aligned}\] Therefore, \[\begin{aligned} L(\hat h)-L(h^*)\leq max \{\frac{8}{3n}log(M/\delta),\frac{4log(M/\delta)}{n}\}\ \end{aligned}\]

6.2 Noise Condition

The noiseless condition is too strong and unrealistic. Next we consider the case where noise is present but can be controled. For example, instead of \(\eta(X)\in \{0,1\}\), \(\eta\) is simply bounded away from 1/2.

Definition: Massart’s noise condition.
For some \(r\in (1,1/2]\), \(|\eta(X)|\geq r\). We can still obtain fast rate with noise.

Theorem: Suppose \(\mathcal{H}\) is finite, \(h^*\in \mathcal{H}\), if Massart’s noise condition is satisfied with constant \(\gamma\), then \(L(\hat h)-L(h)\leq \frac{2log(M/\delta)}{\gamma n}\), with probability \(1-\delta\). Note \(\gamma=1/2\) gives the noiseless case.

Proof:
The beginning of the proof is identical to the noiseless case. In particular, define \(Z_i(h)= \mathbb{1}{h^*(X_i)\neq y_i}-\mathbb{1}{h(X_i)\neq y_i}\), then \(L(\hat h)-L(h^*)\leq \frac{1}{n}\sum_{i=1}^n (Z_i(\hat h)-E(Z_i(\hat h)))\)
We only need to bound the RHS using Bernstein’s inequality. For j is given, \[\begin{aligned} P[\frac{1}{n}\sum_{i=1}^n (Z_i(h_j)-E(Z_i(h_j)))\geq t]\leq exp(-\frac{nt^2}{2\sigma_{h_j}^2}+4t/3).\\ Take\ t=t_0 \triangleq max\{\sqrt{\frac{4\sigma_{h_j}^2log(M/\delta)}{n}},\frac{8}{3n}log(M/\delta)\}. \end{aligned}\] We have for each j,\(\frac{1}{n}\sum_{i=1}^n (Z_i(h_j)-E(Z_i(h_j)))\geq t_0(h_j)\)

with probability \(\delta/M\). So there exists j, \(\frac{1}{n}\sum_{i=1}^n (Z_i(h_j)-E(Z_i(h_j)))\geq t_0(h_j)\) with probability \(\delta\). This implies \(\forall j, \frac{1}{n}\sum_{i=1}^n (Z_i(h_j)-E(Z_i(h_j)))\leq t_0(h_j)\) with probability \(1-\delta\). Thus, \(L(\hat h)-L(h^*)\leq t_0(\hat h)\) with probability at least \(1-\delta\). Now, since \(2|\eta(X)-1/2|\geq2\gamma\), then, \[\begin{aligned} L(\hat h)-L(h^*)=E(|2\eta(X)-1|\mathbb{1}{\hat h(X)\neq h^*(X)})\geq 2\gamma P(\hat h(X)\neq h^*(X)).\ Note\ that\\ P(\hat h(X)\neq h^*(X))=\sigma_{\hat h}^2\leq(L(\hat h)-L(h^*))/2\gamma \end{aligned}\] .Hence, \[\begin{aligned} L(\hat h)-L(h^*)\leq max\{\sqrt{\frac{4(L(\hat h)-L(h^*))log(M/\delta)/(2\gamma)}{n}},\frac{8}{3n}log(M/\delta)\} \end{aligned}\] with high probability. Then, \[\begin{aligned} L(\hat h)-L(h^*)\leq max\{\frac{2log(M/\delta)}{n\gamma},\frac{8}{3n}log(M/\delta)\}=\frac{2log(M/\delta)}{\gamma n}, \end{aligned}\] as required.

Massart’s condition is still somewhat strong, since it says not X will lead to \(\eta(X)\) near 1/2. The below“margin condition” losses it.
Definition: Mammen-Tsybakov’s noise/margin condition. There exists \(\alpha \in(0,1), C_0>0,t_0\in(0,1/2]\) s.t. \(P[|\eta(X)-1/2|\leq t]\leq C_0*t^{\frac{\alpha}{1-\alpha}}\) for all \(t\in(0,t_0]\).

Note: This is sometimes written as \(P[|\eta(X)-1/2|\leq t]\leq Ct^\alpha, \ \alpha>0\).

Lemma: Under the Tsybakov’s condition with constants \(\alpha,C_0,t_0\), we have \(P[h(X)\neq h^*(X)]\leq C(L(h)-L(\hat h))^\alpha\ \forall h\).

(The below lecture notes were scribed by Zhou Wang.)

::: {.lemma #01nLloss} Under the Tsybakov’s noise condition with constant \(\alpha, c_0, t_0\), \(\forall~h\), we have \[\mathbb{P}[h(X)\neq h^*(X)]\leq C\cdot [L(h)-L(h^*)]^\alpha.\] :::

Proof. Note that \(\mathbb{P}[|\eta(X)-\frac{1}{2}|\leq t]\leq c_0\cdot t^{\frac{\alpha}{1-\alpha}}, ~\forall~ t\in(0, t_0]\). Moreover, \[\begin{aligned} L(h)-L(h^*)&=\mathbb{E}\left[|2\eta(X)-1|\cdot\mathbb{1}[\{]h(X)\neq h^*(X)\}\right]\\ &=\mathbb{E}\left[|2\eta(X)-1|\cdot\mathbb{1}[\{]|\eta(X)-1/2|>t\}\cdot\mathbb{1}[\{]h(X)\neq h^*(X)\}\right]\\ &~+\mathbb{E}\left[|2\eta(X)-1|\cdot\mathbb{1}[\{]|\eta(X)-1/2|\leq t\}\cdot\mathbb{1}[\{]h(X)\neq h^*(X)\}\right]\\ &\geq \mathbb{E}\left[|2\eta(X)-1|\cdot\mathbb{1}[\{]|\eta(X)-1/2|>t\}\cdot\mathbb{1}[\{]h(X)\neq h^*(X)\}\right]\\ &\geq 2t\cdot \mathbb{E}\left[\mathbb{1}[\{]|\eta(X)-1/2|>t, h(X)\neq h^*(X)\}\right]\\ &\geq 2t\cdot \mathbb{P}[h(X)\neq h^*(X)]-2t\cdot \mathbb{P}[|2\eta(X)-1|\leq t]\\ &\geq 2t\cdot \mathbb{P}[h(X)\neq h^*(X)]-2tc_0\cdot t^{\frac{\alpha}{1-\alpha}}\\ &=2t\cdot \mathbb{P}[h(X)\neq h^*(X)]-2c_0\cdot t^{\frac{1}{1-\alpha}}. \end{aligned}\]

Now we can choose \(t\) to be any value in \((0, t_0]\) with form \(t=c\cdot \left(\mathbb{P}[h(X)\neq h^*(X)]\right)^{\frac{1-\alpha}{\alpha}}\), where \(c\) is to be determined. In particular, \[\begin{aligned} L(h)-L(h^*)&\geq 2t\cdot \mathbb{P}[h(X)\neq h^*(X)]-2c_0\cdot t^{\frac{1}{1-\alpha}}\\ &=2c\cdot \left(\mathbb{P}[h(X)\neq h^*(X)]\right)^{\frac{1-\alpha}{\alpha}}\cdot \mathbb{P}[h(X)\neq h^*(X)]-2c_0c^{\frac{1}{1-\alpha}}\left(\mathbb{P}[h(X)\neq h^*(X)]\right)^{\frac{1-\alpha}{\alpha}\cdot\frac{1}{1-\alpha}}\\ &=2c\cdot \left(\mathbb{P}[h(X)\neq h^*(X)]\right)^{\frac{1}{\alpha}}-2c_0c^{\frac{1}{1-\alpha}}\left(\mathbb{P}[h(X)\neq h^*(X)]\right)^{\frac{1}{\alpha}}\\ &\geq c\cdot \left(\mathbb{P}[h(X)\neq h^*(X)]\right)^{\frac{1}{\alpha}} ~~~\mbox{for some}~c~\mbox{small emough}. \end{aligned}\]

Therefore, with \(C=c^{-\alpha}\), we conclude \[\mathbb{P}[h(X)\neq h^*(X)]\leq C\cdot [L(h)-L(h^*)]^\alpha.\]

We can see recent proofs under the noiseless condition or Massart’s condition may be adopted to obtain a fast rate using this lemma, which relates \(\sigma^2_{\hat h}\) with \(L(\hat h)-L(h^*)\).

Theorem 6.1 If Tsybakov’s condition is satisfied with constant \(\alpha, c_0, t_0\), then there exists \(c\) such that with probability at least \(1-\delta\), \[L(\hat h) -L(h^*)\leq c\left[\frac{2}{n}\log\frac{M}{\delta}\right]^{\frac{1}{2-\alpha}}+\frac{8}{3n}\log\frac{M}{\delta}.\]

This is not a fast rate, but an interpolation between the slow rate (\(\alpha\rightarrow0\)) and the fast rate (\(\alpha\rightarrow1\)). \(\hat h\) does not depend on \(\alpha\) at all. It automatically adjusts to the noise level \(\alpha\), which is a nice feature.

Proof. The majority of proof remain the same, until the last few steps. After establishing that \[L(\hat h)-L(h^*)\leq t_0(\hat h)=\max\left\{\sqrt{\frac{4}{n}\sigma^2_{\hat h}\log\frac{M}{\delta}}, \frac{8}{3n}\log\frac{M}{\delta}\right\},\] by Lemma ?? we have \(\sigma^2_{\hat h}=\mathbb{P}[\hat h(X)\neq h^*(X)]\leq C[L(\hat h)-L(h^*)]^\alpha\), thus \[\begin{aligned} L(\hat h)-L(h^*)\leq \max\left\{\sqrt{\frac{4}{n}C\left[L(\hat h)-L(h^*)\right]^\alpha\log\frac{M}{\delta}}, ~\frac{8}{3n}\log\frac{M}{\delta}\right\}, \end{aligned} \] which implies \[L(\hat h)-L(h^*)\leq \sqrt{\frac{4}{n}C\left[L(\hat h)-L(h^*)\right]^\alpha\log\frac{M}{\delta}} ~~\Rightarrow~~ L(\hat h)-L(h^*)\leq \left(\frac{4C}{n}\log\frac{M}{\delta}\right)^{\frac{1}{2-\alpha}},\] and hence \[L(\hat h)-L(h^*)\leq\max\left\{\left(\frac{4C}{n}\log\frac{M}{\delta}\right)^{\frac{1}{2-\alpha}}, ~\frac{8}{3n}\log\frac{M}{\delta}\right\}\leq \left(\frac{4C}{n}\log\frac{M}{\delta}\right)^{\frac{1}{2-\alpha}}+ \frac{8}{3n}\log\frac{M}{\delta}.\]

Remark: Noiseless case and noise case under Massart’s condition share the same fast rate order, but the latter has a larger constant due to the \(\gamma\). For the noise case under Tsybakov’s condition, it interpolates between the slow rate and fast rate.