Chapter 12 Convex Relexation of ERM

(This chapter was scribed by Xinhai Zhang.)

So far in this class, we have proved upper bound on the excess risk \(L(\hat h_{erm})-L(h^*)\) of the ERM \[\begin{equation} \hat h_{erm}\triangleq \mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{h\in\mathcal{H}}\frac{1}{n}\sum^n_{i=1}\mathbb{1}(Y_i\neq h(X_i)) \tag{12.1} \end{equation}\]

However, due to the non-convexity of the objective fuction in (12.1), the optimization can’t be solved efficiently. For some choices of \(\mathcal{H}\) and loss functions, the optimization problem can be NP-hard. To overcome the computational issue, the basic idea is to minimize a convex upper band of \(0\)-\(1\) loss function. We should also require that \(\mathcal{H}\) is a convex set. Hence the minimization (12.1) becomes a convex optimization problem which can be solved efficiently

12.1 Definitions

Definition 12.1 (convex set) A set \(C\) is convex if \(\forall x,y\in C\) and \(\lambda\in[0,1]\), \[\lambda x+(1-\lambda)y\in C\]

Definition 12.2 (convex function) A function \(f: D\mapsto\mathbb{R}\) on a domain \(D\) is convex if \(\forall x,y\in D\) and \(\lambda\in[0,1]\), \[f(\lambda x+(1-\lambda)y)\leq \lambda f(x)+(1-\lambda)f(y).\]

The convex relaxation takes 3 steps:

Relabeling
Previously for binary classification \(Y\in\{0,1\}\). Using a mapping \(Y\mapsto 2Y-1\), the class label Y is relabeled, and \(Y\in\{-1,1\}\). Correspondingly, we seek to find \(h:\mathcal{X}\mapsto\{-1,1\}\).
Because “\(h(x)\neq Y\)” \(\iff\) “\(h(x)\cdot Y\leq 0\),” we rewrite the \(0\)-\(1\) loss \(\mathbb{1}(Y\neq h(x))\) as \(\mathbb{1}(Yh(x)\leq 0)\)
Soft-Classifier (discriminant function)
The hypothesis class \(\mathcal{H}\) only contains functions taking values in \(\{-1,1\}\). As a result, if \(\mathcal{H}\) contains more than \(1\) classifier. say \(h_1\) and \(h_2\), then \(\mathcal{H}\) is non-convex since \(\frac{1}{2}h_1+\frac{1}{2}h_2\not\in\mathcal{H}\). Soft-classifiers by-pass this issue.

Definition 12.3 (soft classifier) A soft classifier is a measurable function \(f:\mathcal{X}\mapsto[-\infty,\infty]\). The hard classifier (or simply classifier) associated to a soft-classifier \(f\) is given by \(h=\text{sign} (f)\).

We will denote \(\mathcal{F}\) as a convex set of soft-classifiers. Some common choices of \(\mathcal{F}\) are:

linear functions
\[\mathcal{F}\triangleq\{\boldsymbol{\beta}\cdot\boldsymbol{x}:\boldsymbol{\beta}\in B\}\] where \(B\) is some convex subset of \(\mathbb{R}^d\), e.g. \(L_1\) ball or \(L_2\) ball in \(\mathbb{R}^d\).
The hard classifier \(h(x)=\text{sign}(f(x))\), \(f\in\mathcal{F}\) splits \(\mathbb{R}^d\) into two half-spaces. Their boundary is a hyperplane going through the origin, with norm vector \(\boldsymbol{\beta}\).
Majority voting
Given weak classifiers \(h_1,\ldots, h_M\), let \[\mathcal{F}\triangleq \left\{\sum_{j=1}^Mw_jh_j: w_j\geq 0, \sum_{j=1}^Mw_j=1\right\}\] This is the soft classifier used by boosting.
Let \(\varphi_j\), \(j=1,2,\ldots\) be a famility of orthogonal basis, e.g. Fourier basis or wavelet basis, and let \[\mathcal{F}\triangleq \left\{\sum_{j=1}^\infty \theta_j\varphi_j:(\theta_1,\theta_2,\ldots)\in\mathcal{\Theta}\right\}\]

Given the convex set \(\mathcal{F}\) of soft-classifier, our goal is to seek \(f\in\mathcal{F}\) to minimize \(\frac{1}{n}\sum^n_{i=1}\mathbb{1}(Yf(x)\leq 0)\).
Our strategy, however, is to replace the \(\mathbb{1}(\cdot\leq 0)\) function by a convex upper bound, known as the surrogate loss.

Definition 12.4 (convex surrogate) A function \(\varphi: \mathbb{R}\mapsto\mathbb{R^+}\) is called a convex surrogate if it is a convex, non-increasing function such that \(\varphi(0)=1\) and for all \(z\in\mathbb{R}\), \(\varphi(z)\geq\mathbb{1}(z\leq 0)\).

Some popular convex surrogate loss functions are:

Hinge loss (SVM) \[\varphi(z)=\max(1-z,0)=(1-z)_+\]
Exponential loss (boosting) \[\varphi(z)=\exp(-z)\]
Logistic loss (logistic regression) \[\varphi(z)=\log (1+\exp (-z))\]

To by-pass the issue of minimizing a non-convex objective, we consider minimizing the empirical \(\varphi\)-risk instead, \[\hat L_{\varphi,n}(f)\triangleq \frac{1}{n}\sum^n_{i=1}\varphi(y_if(x_i))\] It is the empirical counter-part of the expected \(\varphi\)-risk, defined as \(L_\varphi(f)=\mathbb{E}[\varphi(Yf(\boldsymbol{x}))]\).

12.2 Classification Calibration and Fisher Consistency

(The rest of this chapter was scribed by Wenshu Dai.)

In this section, we will derive the relation between the excess \(\varphi-\) risk of a soft classifier \(f\) and excess risk under the 0-1 loss of its associated hard classifier \(h=\text{sign}(f)\).

Recall \[\begin{align*} h^*&=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{h\in\{-1,1\}^\mathcal{X}} E(1(Y\not=h(x)))\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (=E(1[Y-f(x)\le 0])\\ &=\text{sign}(\eta(x)-\frac{1}{2})\\ \end{align*}\]

where \(\eta(x)=P(Y=1|X=x)\) which minimizes the 0-1 risk, and \(L(h^*)\) is called the Bayes risk.

For any classifier \(h\), \(L(h)-L(h^*)\) is called the excess risk under the 0-1 loss.

Suppose \(f\) is a soft classifier associated with \(h\) such that \(h=\text{sign}(f)\). Then re-expressing the excess risk, \[L(h)-L(h^*)=E(1(yf(x)\le 0))-L(h^*).\]

Similarly, let

\[L_\varphi^*\triangleq L_\varphi(f_\varphi^*)=\inf_f L_\varphi(f)\] be the \(\varphi-\)risk of the optimal soft classification \[f_\varphi^*\triangleq\text{argmin}_f L\varphi(f). \] Then the excess \(\varphi-\)risk is \[L_\varphi(f)-L_\varphi^*\] for any soft classifier \(f\). We will relate the excess risk to the excess risk to the \(\varphi-\)risk. But as an initial/minimum requirement, we first show that \(f_\varphi^*\) is equivalent to the Bayes rule \(h^*\). That is, minimizing the \(\varphi-\)risk amounts to the same result as minimizing the 0-1 risk at the population level. This is related to the notion of classification calibration, which we state below.

Recall that \[\begin{align*} L_\varphi^*&=\inf_f L_\varphi(f)\\ &=\inf_f E(\varphi(Y,f(x)))\\ &=\inf_f E[\varphi(Yf(x)|X)]\\ &=\inf_f E[P(Y=1|X)\varphi(1\cdot f(x))+P(Y=-1|X)\varphi(-1\cdot f(x))]\\ &=\inf_f E[\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-f(x))] \end{align*}\]

Think of a fixed \(x\), so that both \(\eta(x)\) and \(f(x)\) are fixed. Consider the pointwise version of the above equation.

\[H_\varphi(\eta)\triangleq\inf_{\alpha\in\mathbb R}\{\eta\cdot\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}.\]

Note \(\eta\) and \(\alpha\) are \(\eta(x)\) and \(f(x)\) with \(x\) fixed.

By construction, \[L_\varphi^*=E(H_\varphi(\eta(x))).\] This construction suggests that for \(x\) fixed and \(\eta(x)\) given, there is a unique soft classifier \(f(x)=\alpha\) that minimizes the conditional \(\varphi-\)risk. Recall that Bayes rule is \(h^*(x)=\text{sign}(\eta(x)-\frac{1}{2})\). Hence \(\eta(x)-\frac{1}{2}\) can be viewed as the soft classifier associated with \(h^*\). Intuitively, a good soft classifier should behave like the Bayes rule, e.g. \[f(x)\cdot (\eta(x)-\frac{1}{2})\ge 0\] That is, at \(x\), \(f(x)\) and \(\eta(x)-\frac{1}{2}\) agree on the sign.

Hence, we actually \(H_\varphi^-(\eta)\) below with a constraint \(\alpha(\eta-\frac{1}{2})<0\), i.e., \(f\) and \(\eta-\frac{1}{2}\) disagree!

\[H_\varphi^-(\eta)=\inf_{\alpha\in\mathbb R}\{\eta\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}\] s.t. \(\alpha(\eta-\frac{1}{2})<0\).

Intuitively, a classifier that disagree with \(\eta(x)=\frac{1}{2}\) will perform worse classifier calibration formalizes this intuition.

Definition: Classification Calibration

We say that loss function \(\varphi\) is classification-calibration if for all \(\eta\not=\frac{1}{2}\), \[H_\varphi^-(\eta)>H_\varphi(\eta).\]

In other words, minimizing \(\varphi-\)risk but subject to disagreeing with the Bayes rule will worsen the \(\varphi-\)risk.

** Example: Hinge loss is classification-calibrated.**

The hinge loss is

\[\begin{align*} \varphi(z) &= \begin{cases} 1-z & \text{ if}\ z\le 1,\\ 0 & \ \text{if}\ z>1; \end{cases}\\ &=(1-z)_+\\ &=\max(1-z,0) \end{align*}\]

So we have \[\begin{align*} H_\varphi(\eta)&=\inf_\alpha\{\eta(1-\alpha)_++(1-\eta)(1+\alpha)_+\}\\ &\le \inf_{\alpha\in[-1,1]}\{\eta(1-\alpha)_++(1-\eta)(1+\alpha)_+\}\\ &= \inf_{\alpha\in[-1,1]}\{\eta(1-\alpha)+(1-\eta)(1+\alpha)\}\\ &=\{\eta-\eta\alpha+1-\eta\alpha-\eta+\alpha\}\\ &=\{\alpha(1-2\eta)+1\}\\ &=\begin{cases} 2\eta-1+1=2\eta & \text{ if}\ \eta\le \frac{1}{2}@\alpha=-1,\\ 1-2\eta+1=2(1-\eta) & \ \text{if}\ \eta\ge \frac{1}{2}@\alpha=1. \end{cases}\\ &=2\min(\eta,1-\eta) \end{align*}\]

This is an upper bound on the minimized conditional \(\varphi-\)risk without the constraint.

\[\begin{align*} H_\varphi^-(\eta)&=\inf_{\alpha:\alpha(\eta-\frac{1}{2})\le 0} \{\eta(1-\alpha)_++(1-\eta)(1+\alpha)_+\}\\ &=1\ \ \ \ @\alpha=0 \end{align*}\]

So hinge loss is classification calibrated.

** Example: Exponential loss is also classification calibrated.**

Theorem

For \(\varphi\) convex, \(\varphi\) is C-C if and only if

\(\varphi\) is differentiable at 0,
\(\varphi'(0)<0\).

See Bartlett, Jordan & McAulitte 2006 JASA theorem 2 for complete proof.

We can easily see that all of hinge loss, exponential loss, and logistic loss satisfy the condition that \(\varphi\) is differentiable at 0, and \(\varphi'(0)<0\), so they are all C.C.

C.C essentially says that at each \(x\)(pointwise), it’s better off to agree with the Bayes soft classifier \(\eta(x)=\frac{1}{2}\) to minimize the \(\varphi-\) risk. This can be viewed as the pointwise version of Fisher consistency which we state below.

Definition: Fisher consistency for classification

A classification procedure based on a loss \(\varphi\) is Fisher consistent if \[f_\varphi^*\triangleq\text{argmin}_fE[\varphi(Yf(x))]\] has the same sign as that of \(\eta(x)-\frac{1}{2}\).

In other words, \[\text{sign}(\text{argmin}_{f(x)}E[\varphi(Yf(x))|X=x])=\text{sign}(\eta(x)-\frac{1}{2})\ \ \forall x\]

The next theorem provides conditions of F.C. for a general family of loss functions.

Theorem

Consider a function \(\varphi\) satisfying the following assumptions:

\(\varphi(z)<\varphi(-z)\ \ \forall z,\)
\(\varphi'(0)\not=0\) exists.

If \(E(\varphi(Yf(x))|X=x)\) has a global minimizer \(\bar f(x)\), and \(\eta(x)\not=\frac{1}{2}\), then \[\text{sign}(\bar f(x))=\text{sign}(\eta(x)-\frac12)\] Note the theorem does not assume convexity, differentiability or monotonicity. But it’s clear that all the three popular convex surrogate loss functions satisfy these conditions.

Proof: \[E(\varphi(Yf(x))|X=x)=\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-\alpha)\] Hence, define \[C(\alpha)=\eta(x)\varphi(\alpha)+(1-\eta(x))\varphi(-\alpha),\] we have \[C(\bar f(x))\le C(\alpha)\ \ \forall\alpha\] since \(\bar f(x)=\text{argmin}_\alpha C(\alpha)\).

\(C(\alpha)\) is not minimizes at \(\alpha=0\).

To see that, since \(\varphi'(0)\) exists,

\[\begin{align*} C^{-1}(\alpha)|_{\alpha=0}&=\eta(x)\varphi'(\alpha)+(1-\eta(x))\varphi'(-\alpha)|_{\alpha=0}\\ &=[2\eta(x)-1]\varphi'(0)\not=0 \end{align*}\]

If \(\bar f(x)\) is a global minimizer of \(C(\alpha)\), then

\[C(\bar f(x))\le C(-\bar f(x))\] \[\begin{align*} 0&\ge C(\bar f(x))-C(-\bar f(x))\\ &=[\eta(x)\varphi(\bar f(x))+(1-\eta(x))\varphi(-\bar f(x))]-[\eta(x)\varphi(-\bar f(x))+(1-\eta(x))\varphi(\bar f(x))]\\ &=[2\eta(x)-1][\varphi(\bar f(x))-\varphi(-\bar f(x))] \end{align*}\]

Now if \(\eta(x)>\frac{1}{2}\), then \(\varphi(\bar f(x))-\varphi(-\bar f(x))\le 0\). If \(\eta(x)<\frac{1}{2}\), then \(\varphi(\bar f(x))-\varphi(-\bar f(x))\ge 0\).

\(\Rightarrow\ -\bar f(x)>0,\ \bar f(x)<0\).

Therefore \[\text{sign}(\bar f(x))=\text{sign}(\eta(x)\frac{1}{2})\ \ \forall \eta(x)\not=\frac12.\ \ \ \ \ \ \ \square\]

Remark: since hinge loss, exponential loss, logit loss satisfy the conditions in the theorem, minimizing \(E(\varphi(Yf(x))|X=x)\) amounts to the same as minimizing \[E(1(Yf(x)\le 0)|X).\]

If we are allowed to assume the \(\varphi\) is differentiable and convex(e.g. exponential, logit), then we have a simpler proof of Fisher consistency.

Set the derivative of \(C(\alpha)\) to be 0. \[0=C'(\alpha)=\eta(x)\varphi'(\alpha)-(1-\eta(x))\varphi'(-\alpha)\] \[\Rightarrow\frac{\eta(x)}{1-\eta(x)}=\frac{\varphi'(-\bar\alpha)}{\varphi'(\bar\alpha)}\]

Since \(\varphi\) is convex, we have that \(\varphi'(z)\) is also non-decreasing. So \[\begin{align*} \eta(x)>\frac12&\Leftrightarrow\frac{\eta(x)}{1-\eta(x)}>1\\ &\Leftrightarrow \varphi'(-\bar\alpha)<\varphi'(\bar\alpha)\\ &\Leftrightarrow -\bar\alpha<\bar\alpha \end{align*}\]

For any \(\eta\not=\frac12\), \(H^-_\varphi(\eta)>H_\varphi(\eta)\).

\(\varphi'(0)<0\ (BJM)\), \(\varphi\) is differentiable, convex, and nondecreasing.

\(0=C'(\alpha)=\eta(x)\varphi'(\alpha)\).

12.3 Excess risk bounds

In 12.2, we showed conditions under which a soft classifier that minimizes the expected \(\varphi-\) risk leads to the Bayes rule.

In this section, we will show that the excess 0-1 risk of a classifier \(f\) is connected to its excess \(\varphi-\)risk.

The idea is that if we obtain a classifier \(f\) where \(L_p(f)\to L_\varphi^*=\inf_fL_\varphi(f)\). Then we shall also expect that \[L(\text{sign}(f))\to L^*=\inf_hL(h).\]

We will present 2 comparable results, one based on classification, and the other based on a more specific condition.

Theorem(B.J. M’ 06)

Let \(\varphi\) be convex and classification calibrated. Define \[\varphi(a)\triangleq H_\varphi^-(\frac{1+u}{2})-H_\varphi(\frac{1+u}{2})\] Then for all \(f\), \[\psi(L(\text{sign(f)})-L^*)(\text{excess risk})\le L_\varphi(f)-L_\varphi^*(\text{excess}\ \varphi-\text{risk})\]
If \(\varphi\) is classification calibrated and convex, then \[\psi(u)=\varphi(0)-H_\varphi(\frac{1+u}{2})\]

Remark: For example, with hinge loss \(\varphi(z)=(1-z)_+\), we had previously shown that \[H_\varphi(\eta)=2\min (\eta,1-\eta)\] and \[H_\varphi^-(\eta)=1\]

plug in \(\eta=\frac{1+u}{2}\) tp obtain

\[\begin{align*} \psi(u)&=H_\varphi^-(\frac{1+u}{2})-H_\varphi(\frac{1+u}{2})\\ &=1-2\min(\frac{1+u}{2},\frac{1-u}{2})\\ &=1-\min(1+u,1-u)\\ &=1+\max(-1-u,-1+u)\\ &=\max(u,-u)\\ &=|u| \end{align*}\]

Therefore, part 1 in the theorem implies that \[\begin{align*} &|L(\text{sign}(f))-L^*|\le L_\varphi(f)-L_\varphi^*\\ \Rightarrow&L(\text{sign}(f))-L^*\le L_\varphi(f)-L_\varphi^* \end{align*}\]

So for the hinge loss, the excess \(\varphi-\)risk gives us an upper bound on the excess 0-1 risk.

Proof: \[\begin{align*} &L(\text{sign}(f))-L^*\\ =&L(\text{sign}(f))-L(h^*)\to \text{Bayes rule sign}(\eta(x)-\frac{1}{2})\\ =&E[|2\eta(x)-1|\cdot 1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2}))]\\ \triangleq&E(g(x)) \end{align*}\]

By Jensen’s inequality, if \(\psi\) is convex, then \[\begin{align*} &\psi(L(\text{sign}(f))-L^*)\\ =&\psi(E(g(x)))\\ \le&E(\psi(g(x))) \end{align*}\]

Let’s verify that \(\psi\) is convex.

First, since \(\varphi\) is convex, and \(\varphi'(0)<0\) (due to classification calibration), we have \[\begin{align*} H_\varphi^-(\eta)&=\inf_{\alpha:\alpha(\eta-\frac{1}{2})<0}\{\eta\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}\\ \ge&\inf_{\alpha:\alpha(\eta-\frac{1}{2})<0}\{\varphi(\eta\alpha+(1-\eta)(-\alpha))\}\ \text{Jensen's Ineq}\\ &=\varphi(0)\\ \end{align*}\]

So \(\psi(u)=\varphi(0)-H_\varphi(\frac{1+u}{2})\).

By definition, \(H_\varphi(\eta)=\inf_\alpha\{\eta\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}\) which is a pointwise minimum over many linear functions (these are functions in \(\eta\), indexed by \(\alpha\)); so H_() is concave, and \(\psi(u)\) is convex!

So indeed we have \[\begin{align*} &\psi(L(\text{sign}(f))-L^*)\\ \le & E[\psi(g(x))]\\ &=E[\psi(|2\eta(x)-1|1(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2})))]\\ &=E[\psi(|2\eta(x)-1|)]1(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2}))\\ &=E[\psi(2\eta(x)-1)]1(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2}))\\ &=H^-_\varphi(\eta(x))-H_\varphi(\eta(x))\\ &\le E[\varphi(Yf(x))-H_\varphi(\eta(x))]\\ &=L_\varphi(f)-L_\varphi^*\\ \end{align*}\]

where \(\psi(u)=H^-_\varphi(\frac{1+u}{2})-H_\varphi(\frac{1+u}{2})\). It’s easy to verify that \(psi(0)=0\).

Note that given \(x\), if \(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2})\), i.e. \(f(x)(\eta(x)-\frac{1}{2})<0\), we have \[E[\varphi(Yf(x))|X=x]\ge H_\varphi^-(\eta(x))\] due to the definition of \(H^-_\varphi\).

Here to drop the \(1(\text{sign}\not=\text{sign})\), note that when \(f(x)(\eta(x)-\frac{1}{2})>0\), then \[E(\varphi(Yf(x))|X)\ge H_\varphi(\eta(x))\] due to the definition of \(H_\varphi\).

Remark: the theorem implies that for certain surrogate, there exists a \(\psi-\)transform between the excess \(\varphi-\)risk and the excess 0-1 risk.

In practice, to use the result of this theorem, we need to derive the form of the \(\psi-\)transform. Recall that for hinge loss, \(\psi(u)=|u|\).

The next lemma, due to Tong Zhang 2004, is parallelled to the above theorem, and is based on a different condition.

Lemma. Zhang’s Lemma.

Let \(\varphi:\mathbb R\to \mathbb R^+\) be a convex non-increasing function such that \(\varphi(0)=1\). If there \(\exists\ c>0\), and \(r\in[0,1]\) s.t. \[|\eta-\frac{1}{2}|\le c(1-H_\varphi(\eta))^r,\] then \[L(\text{sign}(f))-L^*\le 2c(L_\varphi(f)-L_\varphi^*)^r.\]

Remark: for hinge loss, we derived earlier that \[\begin{align*} H_\varphi(\eta)&=2\min(\eta,1-\eta)\\ &=1-1+2\min(\eta,1-\eta)\\ &=1-1-2\max(-\eta,\eta-1)\\ &=1-2\max(\frac{1}{2}-\eta,\eta-\frac{1}{2})\\ &=1-2|\frac{1}{2}-\eta|\\ &=1-|1-2\eta(x)| \end{align*}\]

\[\Rightarrow 1-H_\varphi(\eta)=|1-2\eta|=[2|\eta-\frac{1}{2}|]'\] \[\Rightarrow |\eta-\frac{1}{2}|=\frac{1}{2}(1-H_\varphi(\eta))'\ \Rightarrow c=\frac{1}{2},r=1.\] Hence, \[L(\text{sign}(f))-L^*\le(L_\varphi(f)-L_\varphi^*)'\] for hinge loss, which matches the result of B.J.M’ 06.

Proof: first, note that \[\begin{align*} H_\varphi(\eta)&=\inf_\alpha\{\eta\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}\\ &\le \eta\varphi(0)+(1-\eta)\varphi(0)\\ &=\varphi(0)\\ &=1 \end{align*}\]

So \(1-H_\varphi(\eta)\ge 0\) and the lemma statement is well-defined.

Next, \[\begin{align*} &L(\text{sign}(f))-L^*\\ &=E[|2\eta(x)-1|\cdot 1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2}))]\\ &\le 2cE[(1-H_\varphi(\eta))^\gamma1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2}))]\\ &=2cE[((1-H_\varphi(\eta))1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2})))^\gamma]\\ &\le 2c\{E[(1-H_\varphi(\eta))1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2}))]\}^\gamma\\ \end{align*}\]

\(u\) is concave.

To this end, give each \(x\in \mathcal X\), we will show that \[\begin{align*} &(1-H_\varphi(\eta(x)))1(\text{sign}(f(x))\ \ \not=\text{sign}(\eta(x)-\frac{1}{2})\\ &\le E(\varphi(Yf(x))|X=x)-H_\varphi(\eta(x)) \end{align*}\]

Since it will imply the result by integrating with respect to \(x\).

On the RHS, note that \[E(\varphi(Yf(x))X=x)=\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-f(x))\] and \[H_\varphi(\eta(x))=\inf_{f(x)}\{\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-f(x))\}\]

Hence the RHS is non-negative. Therefore the case of \(\text{sign}(f(x))=\text{sign}(\eta(x)-\frac{1}{2})\) [which leads to the LHS=0] follows trivially.

For the case of \(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2})\), we have \[\begin{align*} &E(\varphi(Yf(x))|X=x)\\ &=\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-f(x))\\ &=\varphi(\eta(x)f(x)+(1-\eta(x))(-f(x)))\\ &=\varphi((2\eta(x)-1)f(x))\\ &\ge\varphi(0)\\ &=1 \end{align*}\]

since \(\varphi\) is non-decreasing. This completes the proof of (*) and hence the lemma. \(\square\)

Remarks:

It’s not hard to check the \(c\) and \(\gamma\) values for common surrogate loss functions(exp, logistic).
Comparing BJM theorem and Zhang’s lemma, BJM’s theorem holds, under weaker conditions than that of Zhang’s.