Chapter 13 Bouning the excess \(\varphi\)-risk
(This chapter was scribed by Zifan Huang.)
In the last chapter, we introduced the general framework of convex relaxation. We also introduced the notion of classification calibration and Fisher consistency which imply that minimizing \(0-1\) risk and \(\varphi\) risk lead to the same Bayes rule at the population level. Furthermore, BJM and Zhang’s results imply that the excess \(\varphi\) risk can bound the excess \(0-1\) risk for certain convex surrogate loss functions.
In the current chapter, we show how to bound the excess \(\varphi\)-risk.
13.1 General case
Let \(\hat{f}\) be the soft classifer minimizing the empirical \(\varphi\)-risk.
\[\hat{f}=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{f\in\mathcal{F}}~\underbrace{\frac{1}{n}\sum_{i=1}^n \varphi(y_if(x_i))}_{\overset{\triangle}{=}~\hat{L}_{\varphi,n}~(f)}.\] After we got that, our goal is to bound \(L_\varphi(\hat{f})=L_\varphi(\bar{f})\), where \(\bar{f}=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{f\in\mathcal{F}} \mathbb{E}[\varphi(yf(x))]\).
Note that as before we are focusing on the estimation error only.
\[\begin{aligned} L_\varphi(\hat{f})~~~~~~~-\underbrace{L_{\varphi^*}}_{\varphi~\mbox{risk for Bayes rule}}=\underbrace{L_\varphi(\hat{f})-L_\varphi(\bar{f})}_{\mbox{estimation error}}+\underbrace{L_\varphi(\bar{f})-L_\varphi(f_{\varphi^*})}_{\mbox{approximation error}}. \end{aligned}\]
Note that our analysis below will be very similar to the analysis in chapter 7 under \(0-1\) loss. Use this opportunity as a refresher.
Assumptions:
\(f\) maps to \([-1,1]\) and \(\varphi\) is \(L\)-Lipschitz i.e. \(|\varphi(z)-\varphi(z')|\le L||z-z'||\), \(\forall\) \(z,z'\).
Road map:
Show that \(L_\varphi(\hat{f})-L_\varphi(\bar{f})\) is bounded by \(2~\sup_{f\in\mathcal{F}}|\hat{L}_{\varphi,n}(f)-L_\varphi(f)|.\)
Define \(g(z_1,...,z_n)\overset{\triangle}{=}\sup_{f\in\mathcal{F}}|\hat{L}_{\varphi,n}(f)-L_\varphi (f)|\) and show that \(g(z_1:z_n)\) concentrates near \(\mathbb{E}[g(z_1,...,z_n)]\) using bounded difference inequality.
Use the symmetrization trick to show that \[\mathbb{E}[g(z_1,...,z_n)]\le~2~Rad(\varphi\circ\widetilde{\mathcal{F}}).\]
Assume \(\varphi\) is \(L\)-Lipschitz and show that \[Rad(\varphi\circ\widetilde{\mathcal{F}})\le L\cdot Rad(\mathcal{F}).\]
Combine all and show that \[L_\varphi(\hat{f})-L_\varphi(\bar{f})\le 4\cdot L\cdot Rad(\mathcal{F})+O(\frac{1}{\sqrt{n}}).\]
Details
As before, we have \[\begin{aligned} L_\varphi(\hat{f})-L_\varphi(\bar{f})&=L_\varphi(\hat{f})-L_{\varphi,n}(\hat{f})\\ &+\hat{L}_{\varphi,n}(\bar{f})-L_\varphi(\bar{f})\\ &+\underbrace{L_{\varphi,n}(\hat{f})-\hat{L}_{\varphi,n}(\bar{f})}_{\le 0}\\ &\le |L_\varphi(\hat{f})-L_{\varphi,n}(\hat{f})|+|\hat{L}_{\varphi,n}(\bar{f})-L_\varphi(\bar{f})|\\ &\le 2\cdot\sup|L_{\varphi}(f)-\hat{L}_{\varphi,n}(\bar{f})|. \end{aligned}\]
Define \[\begin{aligned} g(z_1,...,z_n)&\overset{\triangle}{=}\sup_{f\in\mathcal{F}}|\hat{L}_{\varphi,n}(f)-L_\varphi (f)|\\ &=\sup_{f\in\mathcal{F}}|\frac{1}{n}\sum_{i=1}^n\varphi(y_if(x_i))-\mathbb{E}[\varphi(yf(x))]| \end{aligned}\]
Note that \(\varphi\) is non-increasing and assume the \(\varphi\) is \(L\)-Lipschitz. Then
\[\begin{aligned} &|g(z_1,...,z_i,...,z_n)-g(z_1,...,z_i',...,z_n)|\\ &\le \frac{1}{n}[\varphi(-1)-\varphi(1)]~~~~~~~~~~~~(\mbox{due to}~f\in[-1,1],~\varphi(-1)>\varphi(1)~)\\ &\le\frac{2\cdot L}{n}~~~~~~~~~~~~(\mbox{due to}~\varphi~\mbox{is L-Lipschitz}). \end{aligned}\]
Apply bounded difference inequality. \[\begin{aligned} &P(|g(z_1,...,z_n)-\mathbb{E}[g(z_1,...,z_n)]|>t)\\ &\le2\cdot \underbrace{\exp(-\frac{2t^2}{\sum_{i=1}^n(\frac{2L}{n})^2})}_{\mbox{let this}~=~\delta}. \end{aligned}\]
Then \(\log(\frac{2}{\delta})=\frac{2t^2}{4L^2/n}\Longrightarrow t=L\sqrt{\frac{2}{n}\log(\frac{2}{\delta})}\).
\(\Rightarrow\) w.p. at least \(1-\delta\),
\[\underbrace{g(z_1,...,z_n)}_{=\sup_{f\in\mathcal{F}}|\hat{L}_{\varphi,n}(f)-L_\varphi (f)|}\le\mathbb{E}[g(z_1,...,z_n)]+L\sqrt{\frac{2}{n}\log(\frac{2}{\delta})}.\]
- Use symmetrization to bound \(\mathbb{E}[g(z_1,...,z_n)]\) by the Rademacher Complexity.
Recall \(D=\{z_1,...,z_n\}\), \(D'=\{z_1',...,z_n'\}\) are the original and ghost samples and \(\sigma_i\) is Rademacher random variable.
\[\begin{aligned} \mathbb{E}[g(z_1,...,z_n)]&~~~~~~~~~=~~~~~~~~~~\mathbb{E}[\sup_{f\in\mathcal{F}}|\hat{L}_{\varphi,n}(f)-\underbrace{L_\varphi (f)}_{\mathbb{E}[\hat{L}_{\varphi,n}'(f)]}|]\\ &\overset{Conditional~on~D}{=} \mathbb{E}[\sup_{f\in\mathcal{F}}|\hat{L}_{\varphi,n}(f)-\mathbb{E}[\hat{L}_{\varphi,n}'(f|D)]|]~~~~(\mbox{due to}~D'~\mbox{and}~D~\mbox{are independent})\\ &~~~~~~~~~=~~~~~~~~~\mathbb{E}[\sup_{f\in\mathcal{F}}|\mathbb{E}(\hat{L}_{\varphi,n}(f)-\hat{L}_{\varphi,n}'(f)|D)|]\\ &~~~~~~~~~\le~~~~~~~~~\mathbb{E}[\sup_{f\in\mathcal{F}}|\mathbb{E}(\hat{L}_{\varphi,n}(f)-\hat{L}_{\varphi,n}'(f))||D]~~~~(|u|~is~convex\Rightarrow\mbox{Jensen's ineq.})\\ &~~~~~~~~~\le~~~~~~~~~\mathbb{E}[\mathbb{E}(\sup_{f\in\mathcal{F}}|(\hat{L}_{\varphi,n}(f)-\hat{L}_{\varphi,n}'(f)|)|D]~~~~(\mbox{due to push "sup" inside})\\ &~~~~~~~~~=~~~~~~~~~\mathbb{E}[\sup_{f\in\mathcal{F}}|(\hat{L}_{\varphi,n}(f)-\hat{L}_{\varphi,n}'(f)|]\\ &~~~~~~~~~=~~~~~~~~~\mathbb{E}[\sup_{f\in\mathcal{F}}\frac{1}{n}|\sum_{i=1}^n\varphi(y_if(x_i))-\sum_{i=1}^n\varphi(y_i'f(x_i'))|]\\ &~~\overset{D~and~D'~i.i.d.}{=}~\mathbb{E}[\sup_{f\in\mathcal{F}}\frac{1}{n}|\sum_{i=1}^n\{\sigma_i\varphi(y_if(x_i))-\sigma_i\varphi(y_i'f(x_i'))\}|]\\ &~~\underset{triangle~ineq.}{\le}~~2\cdot \mathbb{E}[\sup_{f\in\mathcal{F}}|\frac{1}{n}\sum_{i=1}^n\sigma_i\varphi(y_if(x_i))|]\\ &~~~~~~~~~=~~~~~~~~~2\cdot Rad_n(\varphi\circ\widetilde{\mathcal{F}}), \end{aligned}\] Where \(\widetilde{\mathcal{F}}=\{(x,y)\mapsto y\cdot f(x):f\in\mathcal{F}\}\).
- Since we assumed \(\varphi\) to be \(L\)-Lipschitz.
\[Rad_n(\varphi\circ\widetilde{\mathcal{F}})\le L\cdot Rad_n(\widetilde{\mathcal{F}}),\] \[\begin{aligned} Rad_n(\widetilde{\mathcal{F}}) &= \mathbb{E}[\sup_{f\in\mathcal{F}}|\frac{1}{n}\sum_{i=1}^n\sigma_i\varphi(y_if(x_i))|]\\ &=\mathbb{E}[\sup_{f\in\mathcal{F}}|\frac{1}{n}\sum_{i=1}^n\widetilde\sigma_if(x_i)|]~~~~~~~\mbox{where} ~\widetilde\sigma_i~\mbox{is also Radmacher r.v.}\\ &=Rad_n(\mathcal{F}). \end{aligned}\]
So \(Rad_n(\varphi\circ\widetilde{\mathcal{F}})\le L\cdot Rad_n(\widetilde{\mathcal{F}})\).
- Combine all.
\[\begin{aligned} &L_\varphi(\hat{f})-L_\varphi(\bar{f})\\ &\le 2\cdot\sup|L_{\varphi}(f)-\hat{L}_{\varphi,n}(\bar{f})|\\ &\le 2\cdot(\mathbb{E}[\sup|L_{\varphi}(f)-\hat{L}_{\varphi,n}(\bar{f})|+L\sqrt{\frac{2}{n}\log(\frac{2}{\delta})}])~~~~~~~~\mbox{w.p. at least}~1-\delta\\ &\le 4\cdot Rad_n(\varphi\circ\widetilde{\mathcal{F}})+2L\sqrt{\frac{2}{n}\log(\frac{2}{\delta})}\\ &\le 4L\cdot Rad_n({\mathcal{F}})+2L\sqrt{\frac{2}{n}\log(\frac{2}{\delta})}\\ \end{aligned}\]
13.2 Analysis of Boosting
In this section, we specialize the above analysis to a particular learn model:
Boosting: The basic idea of boosting is to convert a set of weak learners (weak classifiers) into a strong one by using the weighted average of the weak learners.
More precisely, we consider the following function class is boosting.
\[\mathcal{F}=\{\sum_{j=1}^M\theta_j h_j(\cdot)=|\vec{\theta}|\le 1,~\theta_j\ge 0,~h_j:x\mapsto\{1,-1\},~j=1,...,M~\mbox{are the given weak learners}\}.\]
Let’s first compute the Rademacher complexity of \(\mathcal{F}\), which combined with results from the last section, gives a bound on \(L_\varphi(\hat{f})-L_\varphi(\bar{f})\). Further more, using results from the last chapter, we can obtain a bound on \(L(sign(f))-L(h^*)\).
Note that \(\mathcal{F}\) is a convex hull of
\[\mathcal{H}\overset{\triangle}{=}\{h_1(\cdot),h_2(\cdot),...,h_M(\cdot)\}.\] Therefore, \(Rad_n(\mathcal{F})=Rad_n(\mathcal{H})\).
Note that \(\mathcal{H}\) is finite and \(\sup_{h_j\in\mathcal{H}}\frac{1}{n}\sum_{i=1}^n h_j^2(x_i)=1\).
Therefore \(Rad_n(\mathcal{H})\le\sqrt{\frac{2\cdot\log(|\mathcal{H}|)}{n}}=\sqrt{\frac{2\cdot\log(M)}{n}}\) due to Massart’s finite lemma.
Using the results from the last section, we have, w.p. at least \(1-\delta\), \[L_\varphi(\hat{f})-L_\varphi(\bar{f})\le 4\cdot L\cdot \sqrt{\frac{2\cdot\log(M)}{n}}+2\cdot L\sqrt{\frac{2}{n}\log(\frac{2}{\delta})}.\]
But this is still not the excess risk under \(0-1\) loss, which is \(L(sign(f))-L(h^*)\). This is provided by the next theorem.
Theorem: excess \(0-1\) risk of boosting.
\(\varphi\) is a convex nonincreasing surrogate loss with \(\varphi(0)=1\), \(L\)-Lipschitz and satisfies \[|\eta-\frac{1}{2}|\le c\cdot(1-H_\varphi(\eta))^\nu.~\forall \eta\in(0,1),~\mbox{for}~c>0~\mbox{and}~ \nu\in[0,1].\] Same as in Zhang’s lemma.
Then, w.p. at least \(1-\delta\), \[L(sign(\hat{f}))-L(h^*)\le 2c\cdot(4L\sqrt{\frac{2\cdot\log(M)}{n}})^\nu+2c(2L\sqrt{\frac{2}{n}\log(\frac{2}{\delta})})^\nu+2c(\underbrace{L_\varphi(\bar{f})-L_\varphi^*}_{approx.~\varphi~error})^\nu.\]
Proof. \[\begin{aligned} L(sign(\hat{f}))-L(h^*)&\le 2C(L_\varphi(\hat{f}))-L_\varphi^*)^\nu\\ &\le 2\cdot c(\underbrace{L_\varphi(\hat{f}))-L_\varphi(\bar{f})}_{\mbox{est. error}}+\underbrace{L_\varphi(\bar{f})-L_\varphi^*}_{\mbox{approx. error}})^\nu\\ &\le 2\cdot c(4L\sqrt{\frac{2\cdot\log(M)}{n}}+2L\sqrt{\frac{2}{n}\log(\frac{2}{\delta})}+\mbox{approx. error})^\nu\\ &\le 2c(4L\sqrt{\frac{2\cdot\log(M)}{n}})^\nu+2c(2L\sqrt{\frac{2}{n}\log(\frac{2}{\delta})})^\nu+2c(\mbox{approx. error})^\nu~~~~~~~~~~~~~~~~~\mbox{w.p. at least}~1-\deta. \end{aligned}\]
(Due to for \(\nu\in[0,1]\), \(a_i\ge 0\), \((a_1,a_2,a_3)^\nu\ge a_1^\nu+a_2^\nu+a_3^\nu\))
Remark:
We could also use BJM instead of Zhang’s lemma. Both relate the excess of risk to the excess \(\varphi-\)risk.
“approx. error in \(\varphi-\)risk” is an argument of the bound
exp loss is natural for boosting.