Chapter 12 Convex Relexation of ERM

(This chapter was scribed by Xinhai Zhang.)

So far in this class, we have proved upper bound on the excess risk \(L(\hat h_{erm})-L(h^*)\) of the ERM \[\begin{equation} \hat h_{erm}\triangleq \mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{h\in\mathcal{H}}\frac{1}{n}\sum^n_{i=1}\mathbb{1}(Y_i\neq h(X_i)) \tag{12.1} \end{equation}\]

However, due to the non-convexity of the objective fuction in (12.1), the optimization can’t be solved efficiently. For some choices of \(\mathcal{H}\) and loss functions, the optimization problem can be NP-hard. To overcome the computational issue, the basic idea is to minimize a convex upper band of \(0\)-\(1\) loss function. We should also require that \(\mathcal{H}\) is a convex set. Hence the minimization (12.1) becomes a convex optimization problem which can be solved efficiently

12.1 Definitions

Definition 12.1 (convex set) A set \(C\) is convex if \(\forall x,y\in C\) and \(\lambda\in[0,1]\), \[\lambda x+(1-\lambda)y\in C\]

Definition 12.2 (convex function) A function \(f: D\mapsto\mathbb{R}\) on a domain \(D\) is convex if \(\forall x,y\in D\) and \(\lambda\in[0,1]\), \[f(\lambda x+(1-\lambda)y)\leq \lambda f(x)+(1-\lambda)f(y).\]

The convex relaxation takes 3 steps:

  1. Relabeling
    Previously for binary classification \(Y\in\{0,1\}\). Using a mapping \(Y\mapsto 2Y-1\), the class label Y is relabeled, and \(Y\in\{-1,1\}\). Correspondingly, we seek to find \(h:\mathcal{X}\mapsto\{-1,1\}\).
    Because “\(h(x)\neq Y\)\(\iff\)\(h(x)\cdot Y\leq 0\),” we rewrite the \(0\)-\(1\) loss \(\mathbb{1}(Y\neq h(x))\) as \(\mathbb{1}(Yh(x)\leq 0)\)

  2. Soft-Classifier (discriminant function)
    The hypothesis class \(\mathcal{H}\) only contains functions taking values in \(\{-1,1\}\). As a result, if \(\mathcal{H}\) contains more than \(1\) classifier. say \(h_1\) and \(h_2\), then \(\mathcal{H}\) is non-convex since \(\frac{1}{2}h_1+\frac{1}{2}h_2\not\in\mathcal{H}\). Soft-classifiers by-pass this issue.

Definition 12.3 (soft classifier) A soft classifier is a measurable function \(f:\mathcal{X}\mapsto[-\infty,\infty]\). The hard classifier (or simply classifier) associated to a soft-classifier \(f\) is given by \(h=\text{sign} (f)\).

We will denote \(\mathcal{F}\) as a convex set of soft-classifiers. Some common choices of \(\mathcal{F}\) are:

  • linear functions
    \[\mathcal{F}\triangleq\{\boldsymbol{\beta}\cdot\boldsymbol{x}:\boldsymbol{\beta}\in B\}\] where \(B\) is some convex subset of \(\mathbb{R}^d\), e.g. \(L_1\) ball or \(L_2\) ball in \(\mathbb{R}^d\).
    The hard classifier \(h(x)=\text{sign}(f(x))\), \(f\in\mathcal{F}\) splits \(\mathbb{R}^d\) into two half-spaces. Their boundary is a hyperplane going through the origin, with norm vector \(\boldsymbol{\beta}\).

  • Majority voting
    Given weak classifiers \(h_1,\ldots, h_M\), let \[\mathcal{F}\triangleq \left\{\sum_{j=1}^Mw_jh_j: w_j\geq 0, \sum_{j=1}^Mw_j=1\right\}\] This is the soft classifier used by boosting.

  • Let \(\varphi_j\), \(j=1,2,\ldots\) be a famility of orthogonal basis, e.g. Fourier basis or wavelet basis, and let \[\mathcal{F}\triangleq \left\{\sum_{j=1}^\infty \theta_j\varphi_j:(\theta_1,\theta_2,\ldots)\in\mathcal{\Theta}\right\}\]

  1. Given the convex set \(\mathcal{F}\) of soft-classifier, our goal is to seek \(f\in\mathcal{F}\) to minimize \(\frac{1}{n}\sum^n_{i=1}\mathbb{1}(Yf(x)\leq 0)\).
    Our strategy, however, is to replace the \(\mathbb{1}(\cdot\leq 0)\) function by a convex upper bound, known as the surrogate loss.

Definition 12.4 (convex surrogate) A function \(\varphi: \mathbb{R}\mapsto\mathbb{R^+}\) is called a convex surrogate if it is a convex, non-increasing function such that \(\varphi(0)=1\) and for all \(z\in\mathbb{R}\), \(\varphi(z)\geq\mathbb{1}(z\leq 0)\).

Some popular convex surrogate loss functions are:

  • Hinge loss (SVM) \[\varphi(z)=\max(1-z,0)=(1-z)_+\]

  • Exponential loss (boosting) \[\varphi(z)=\exp(-z)\]

  • Logistic loss (logistic regression) \[\varphi(z)=\log (1+\exp (-z))\]

To by-pass the issue of minimizing a non-convex objective, we consider minimizing the empirical \(\varphi\)-risk instead, \[\hat L_{\varphi,n}(f)\triangleq \frac{1}{n}\sum^n_{i=1}\varphi(y_if(x_i))\] It is the empirical counter-part of the expected \(\varphi\)-risk, defined as \(L_\varphi(f)=\mathbb{E}[\varphi(Yf(\boldsymbol{x}))]\).

12.2 Classification Calibration and Fisher Consistency

(The rest of this chapter was scribed by Wenshu Dai.)

In this section, we will derive the relation between the excess \(\varphi-\) risk of a soft classifier \(f\) and excess risk under the 0-1 loss of its associated hard classifier \(h=\text{sign}(f)\).

Recall \[\begin{align*} h^*&=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{h\in\{-1,1\}^\mathcal{X}} E(1(Y\not=h(x)))\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (=E(1[Y-f(x)\le 0])\\ &=\text{sign}(\eta(x)-\frac{1}{2})\\ \end{align*}\]

where \(\eta(x)=P(Y=1|X=x)\) which minimizes the 0-1 risk, and \(L(h^*)\) is called the Bayes risk.

For any classifier \(h\), \(L(h)-L(h^*)\) is called the excess risk under the 0-1 loss.

Suppose \(f\) is a soft classifier associated with \(h\) such that \(h=\text{sign}(f)\). Then re-expressing the excess risk, \[L(h)-L(h^*)=E(1(yf(x)\le 0))-L(h^*).\]

Similarly, let

\[L_\varphi^*\triangleq L_\varphi(f_\varphi^*)=\inf_f L_\varphi(f)\] be the \(\varphi-\)risk of the optimal soft classification \[f_\varphi^*\triangleq\text{argmin}_f L\varphi(f). \] Then the excess \(\varphi-\)risk is \[L_\varphi(f)-L_\varphi^*\] for any soft classifier \(f\). We will relate the excess risk to the excess risk to the \(\varphi-\)risk. But as an initial/minimum requirement, we first show that \(f_\varphi^*\) is equivalent to the Bayes rule \(h^*\). That is, minimizing the \(\varphi-\)risk amounts to the same result as minimizing the 0-1 risk at the population level. This is related to the notion of classification calibration, which we state below.

Recall that \[\begin{align*} L_\varphi^*&=\inf_f L_\varphi(f)\\ &=\inf_f E(\varphi(Y,f(x)))\\ &=\inf_f E[\varphi(Yf(x)|X)]\\ &=\inf_f E[P(Y=1|X)\varphi(1\cdot f(x))+P(Y=-1|X)\varphi(-1\cdot f(x))]\\ &=\inf_f E[\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-f(x))] \end{align*}\]

Think of a fixed \(x\), so that both \(\eta(x)\) and \(f(x)\) are fixed. Consider the pointwise version of the above equation.

\[H_\varphi(\eta)\triangleq\inf_{\alpha\in\mathbb R}\{\eta\cdot\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}.\]

Note \(\eta\) and \(\alpha\) are \(\eta(x)\) and \(f(x)\) with \(x\) fixed.

By construction, \[L_\varphi^*=E(H_\varphi(\eta(x))).\] This construction suggests that for \(x\) fixed and \(\eta(x)\) given, there is a unique soft classifier \(f(x)=\alpha\) that minimizes the conditional \(\varphi-\)risk. Recall that Bayes rule is \(h^*(x)=\text{sign}(\eta(x)-\frac{1}{2})\). Hence \(\eta(x)-\frac{1}{2}\) can be viewed as the soft classifier associated with \(h^*\). Intuitively, a good soft classifier should behave like the Bayes rule, e.g. \[f(x)\cdot (\eta(x)-\frac{1}{2})\ge 0\] That is, at \(x\), \(f(x)\) and \(\eta(x)-\frac{1}{2}\) agree on the sign.

Hence, we actually \(H_\varphi^-(\eta)\) below with a constraint \(\alpha(\eta-\frac{1}{2})<0\), i.e., \(f\) and \(\eta-\frac{1}{2}\) disagree!

\[H_\varphi^-(\eta)=\inf_{\alpha\in\mathbb R}\{\eta\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}\] s.t. \(\alpha(\eta-\frac{1}{2})<0\).

Intuitively, a classifier that disagree with \(\eta(x)=\frac{1}{2}\) will perform worse classifier calibration formalizes this intuition.

Definition: Classification Calibration

We say that loss function \(\varphi\) is classification-calibration if for all \(\eta\not=\frac{1}{2}\), \[H_\varphi^-(\eta)>H_\varphi(\eta).\]

In other words, minimizing \(\varphi-\)risk but subject to disagreeing with the Bayes rule will worsen the \(\varphi-\)risk.

** Example: Hinge loss is classification-calibrated.**

The hinge loss is

\[\begin{align*} \varphi(z) &= \begin{cases} 1-z & \text{ if}\ z\le 1,\\ 0 & \ \text{if}\ z>1; \end{cases}\\ &=(1-z)_+\\ &=\max(1-z,0) \end{align*}\]

So we have \[\begin{align*} H_\varphi(\eta)&=\inf_\alpha\{\eta(1-\alpha)_++(1-\eta)(1+\alpha)_+\}\\ &\le \inf_{\alpha\in[-1,1]}\{\eta(1-\alpha)_++(1-\eta)(1+\alpha)_+\}\\ &= \inf_{\alpha\in[-1,1]}\{\eta(1-\alpha)+(1-\eta)(1+\alpha)\}\\ &=\{\eta-\eta\alpha+1-\eta\alpha-\eta+\alpha\}\\ &=\{\alpha(1-2\eta)+1\}\\ &=\begin{cases} 2\eta-1+1=2\eta & \text{ if}\ \eta\le \frac{1}{2}@\alpha=-1,\\ 1-2\eta+1=2(1-\eta) & \ \text{if}\ \eta\ge \frac{1}{2}@\alpha=1. \end{cases}\\ &=2\min(\eta,1-\eta) \end{align*}\]

This is an upper bound on the minimized conditional \(\varphi-\)risk without the constraint.

\[\begin{align*} H_\varphi^-(\eta)&=\inf_{\alpha:\alpha(\eta-\frac{1}{2})\le 0} \{\eta(1-\alpha)_++(1-\eta)(1+\alpha)_+\}\\ &=1\ \ \ \ @\alpha=0 \end{align*}\]

So hinge loss is classification calibrated.

** Example: Exponential loss is also classification calibrated.**

Theorem

For \(\varphi\) convex, \(\varphi\) is C-C if and only if

  • \(\varphi\) is differentiable at 0,

  • \(\varphi'(0)<0\).

See Bartlett, Jordan & McAulitte 2006 JASA theorem 2 for complete proof.

We can easily see that all of hinge loss, exponential loss, and logistic loss satisfy the condition that \(\varphi\) is differentiable at 0, and \(\varphi'(0)<0\), so they are all C.C.

C.C essentially says that at each \(x\)(pointwise), it’s better off to agree with the Bayes soft classifier \(\eta(x)=\frac{1}{2}\) to minimize the \(\varphi-\) risk. This can be viewed as the pointwise version of Fisher consistency which we state below.

Definition: Fisher consistency for classification

A classification procedure based on a loss \(\varphi\) is Fisher consistent if \[f_\varphi^*\triangleq\text{argmin}_fE[\varphi(Yf(x))]\] has the same sign as that of \(\eta(x)-\frac{1}{2}\).

In other words, \[\text{sign}(\text{argmin}_{f(x)}E[\varphi(Yf(x))|X=x])=\text{sign}(\eta(x)-\frac{1}{2})\ \ \forall x\]

The next theorem provides conditions of F.C. for a general family of loss functions.

Theorem

Consider a function \(\varphi\) satisfying the following assumptions:

  • \(\varphi(z)<\varphi(-z)\ \ \forall z,\)

  • \(\varphi'(0)\not=0\) exists.

If \(E(\varphi(Yf(x))|X=x)\) has a global minimizer \(\bar f(x)\), and \(\eta(x)\not=\frac{1}{2}\), then \[\text{sign}(\bar f(x))=\text{sign}(\eta(x)-\frac12)\] Note the theorem does not assume convexity, differentiability or monotonicity. But it’s clear that all the three popular convex surrogate loss functions satisfy these conditions.

Proof: \[E(\varphi(Yf(x))|X=x)=\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-\alpha)\] Hence, define \[C(\alpha)=\eta(x)\varphi(\alpha)+(1-\eta(x))\varphi(-\alpha),\] we have \[C(\bar f(x))\le C(\alpha)\ \ \forall\alpha\] since \(\bar f(x)=\text{argmin}_\alpha C(\alpha)\).

  1. \(C(\alpha)\) is not minimizes at \(\alpha=0\).

To see that, since \(\varphi'(0)\) exists,

\[\begin{align*} C^{-1}(\alpha)|_{\alpha=0}&=\eta(x)\varphi'(\alpha)+(1-\eta(x))\varphi'(-\alpha)|_{\alpha=0}\\ &=[2\eta(x)-1]\varphi'(0)\not=0 \end{align*}\]

  1. If \(\bar f(x)\) is a global minimizer of \(C(\alpha)\), then

\[C(\bar f(x))\le C(-\bar f(x))\] \[\begin{align*} 0&\ge C(\bar f(x))-C(-\bar f(x))\\ &=[\eta(x)\varphi(\bar f(x))+(1-\eta(x))\varphi(-\bar f(x))]-[\eta(x)\varphi(-\bar f(x))+(1-\eta(x))\varphi(\bar f(x))]\\ &=[2\eta(x)-1][\varphi(\bar f(x))-\varphi(-\bar f(x))] \end{align*}\]

Now if \(\eta(x)>\frac{1}{2}\), then \(\varphi(\bar f(x))-\varphi(-\bar f(x))\le 0\). If \(\eta(x)<\frac{1}{2}\), then \(\varphi(\bar f(x))-\varphi(-\bar f(x))\ge 0\).

\(\Rightarrow\ -\bar f(x)>0,\ \bar f(x)<0\).

Therefore \[\text{sign}(\bar f(x))=\text{sign}(\eta(x)\frac{1}{2})\ \ \forall \eta(x)\not=\frac12.\ \ \ \ \ \ \ \square\]

Remark: since hinge loss, exponential loss, logit loss satisfy the conditions in the theorem, minimizing \(E(\varphi(Yf(x))|X=x)\) amounts to the same as minimizing \[E(1(Yf(x)\le 0)|X).\]

If we are allowed to assume the \(\varphi\) is differentiable and convex(e.g. exponential, logit), then we have a simpler proof of Fisher consistency.

Set the derivative of \(C(\alpha)\) to be 0. \[0=C'(\alpha)=\eta(x)\varphi'(\alpha)-(1-\eta(x))\varphi'(-\alpha)\] \[\Rightarrow\frac{\eta(x)}{1-\eta(x)}=\frac{\varphi'(-\bar\alpha)}{\varphi'(\bar\alpha)}\]

Since \(\varphi\) is convex, we have that \(\varphi'(z)\) is also non-decreasing. So \[\begin{align*} \eta(x)>\frac12&\Leftrightarrow\frac{\eta(x)}{1-\eta(x)}>1\\ &\Leftrightarrow \varphi'(-\bar\alpha)<\varphi'(\bar\alpha)\\ &\Leftrightarrow -\bar\alpha<\bar\alpha \end{align*}\]

For any \(\eta\not=\frac12\), \(H^-_\varphi(\eta)>H_\varphi(\eta)\).

\(\varphi'(0)<0\ (BJM)\), \(\varphi\) is differentiable, convex, and nondecreasing.

\(0=C'(\alpha)=\eta(x)\varphi'(\alpha)\).

12.3 Excess risk bounds

In 12.2, we showed conditions under which a soft classifier that minimizes the expected \(\varphi-\) risk leads to the Bayes rule.

In this section, we will show that the excess 0-1 risk of a classifier \(f\) is connected to its excess \(\varphi-\)risk.

The idea is that if we obtain a classifier \(f\) where \(L_p(f)\to L_\varphi^*=\inf_fL_\varphi(f)\). Then we shall also expect that \[L(\text{sign}(f))\to L^*=\inf_hL(h).\]

We will present 2 comparable results, one based on classification, and the other based on a more specific condition.

Theorem(B.J. M’ 06)

  1. Let \(\varphi\) be convex and classification calibrated. Define \[\varphi(a)\triangleq H_\varphi^-(\frac{1+u}{2})-H_\varphi(\frac{1+u}{2})\] Then for all \(f\), \[\psi(L(\text{sign(f)})-L^*)(\text{excess risk})\le L_\varphi(f)-L_\varphi^*(\text{excess}\ \varphi-\text{risk})\]

  2. If \(\varphi\) is classification calibrated and convex, then \[\psi(u)=\varphi(0)-H_\varphi(\frac{1+u}{2})\]

Remark: For example, with hinge loss \(\varphi(z)=(1-z)_+\), we had previously shown that \[H_\varphi(\eta)=2\min (\eta,1-\eta)\] and \[H_\varphi^-(\eta)=1\]

plug in \(\eta=\frac{1+u}{2}\) tp obtain

\[\begin{align*} \psi(u)&=H_\varphi^-(\frac{1+u}{2})-H_\varphi(\frac{1+u}{2})\\ &=1-2\min(\frac{1+u}{2},\frac{1-u}{2})\\ &=1-\min(1+u,1-u)\\ &=1+\max(-1-u,-1+u)\\ &=\max(u,-u)\\ &=|u| \end{align*}\]

Therefore, part 1 in the theorem implies that \[\begin{align*} &|L(\text{sign}(f))-L^*|\le L_\varphi(f)-L_\varphi^*\\ \Rightarrow&L(\text{sign}(f))-L^*\le L_\varphi(f)-L_\varphi^* \end{align*}\]

So for the hinge loss, the excess \(\varphi-\)risk gives us an upper bound on the excess 0-1 risk.

Proof: \[\begin{align*} &L(\text{sign}(f))-L^*\\ =&L(\text{sign}(f))-L(h^*)\to \text{Bayes rule sign}(\eta(x)-\frac{1}{2})\\ =&E[|2\eta(x)-1|\cdot 1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2}))]\\ \triangleq&E(g(x)) \end{align*}\]

By Jensen’s inequality, if \(\psi\) is convex, then \[\begin{align*} &\psi(L(\text{sign}(f))-L^*)\\ =&\psi(E(g(x)))\\ \le&E(\psi(g(x))) \end{align*}\]

Let’s verify that \(\psi\) is convex.

First, since \(\varphi\) is convex, and \(\varphi'(0)<0\) (due to classification calibration), we have \[\begin{align*} H_\varphi^-(\eta)&=\inf_{\alpha:\alpha(\eta-\frac{1}{2})<0}\{\eta\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}\\ \ge&\inf_{\alpha:\alpha(\eta-\frac{1}{2})<0}\{\varphi(\eta\alpha+(1-\eta)(-\alpha))\}\ \text{Jensen's Ineq}\\ &=\varphi(0)\\ \end{align*}\]

So \(\psi(u)=\varphi(0)-H_\varphi(\frac{1+u}{2})\).

By definition, \(H_\varphi(\eta)=\inf_\alpha\{\eta\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}\) which is a pointwise minimum over many linear functions (these are functions in \(\eta\), indexed by \(\alpha\)); so H_() is concave, and \(\psi(u)\) is convex!

So indeed we have \[\begin{align*} &\psi(L(\text{sign}(f))-L^*)\\ \le & E[\psi(g(x))]\\ &=E[\psi(|2\eta(x)-1|1(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2})))]\\ &=E[\psi(|2\eta(x)-1|)]1(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2}))\\ &=E[\psi(2\eta(x)-1)]1(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2}))\\ &=H^-_\varphi(\eta(x))-H_\varphi(\eta(x))\\ &\le E[\varphi(Yf(x))-H_\varphi(\eta(x))]\\ &=L_\varphi(f)-L_\varphi^*\\ \end{align*}\]

where \(\psi(u)=H^-_\varphi(\frac{1+u}{2})-H_\varphi(\frac{1+u}{2})\). It’s easy to verify that \(psi(0)=0\).

Note that given \(x\), if \(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2})\), i.e. \(f(x)(\eta(x)-\frac{1}{2})<0\), we have \[E[\varphi(Yf(x))|X=x]\ge H_\varphi^-(\eta(x))\] due to the definition of \(H^-_\varphi\).

Here to drop the \(1(\text{sign}\not=\text{sign})\), note that when \(f(x)(\eta(x)-\frac{1}{2})>0\), then \[E(\varphi(Yf(x))|X)\ge H_\varphi(\eta(x))\] due to the definition of \(H_\varphi\).

Remark: the theorem implies that for certain surrogate, there exists a \(\psi-\)transform between the excess \(\varphi-\)risk and the excess 0-1 risk.

In practice, to use the result of this theorem, we need to derive the form of the \(\psi-\)transform. Recall that for hinge loss, \(\psi(u)=|u|\).

The next lemma, due to Tong Zhang 2004, is parallelled to the above theorem, and is based on a different condition.

Lemma. Zhang’s Lemma.

Let \(\varphi:\mathbb R\to \mathbb R^+\) be a convex non-increasing function such that \(\varphi(0)=1\). If there \(\exists\ c>0\), and \(r\in[0,1]\) s.t. \[|\eta-\frac{1}{2}|\le c(1-H_\varphi(\eta))^r,\] then \[L(\text{sign}(f))-L^*\le 2c(L_\varphi(f)-L_\varphi^*)^r.\]

Remark: for hinge loss, we derived earlier that \[\begin{align*} H_\varphi(\eta)&=2\min(\eta,1-\eta)\\ &=1-1+2\min(\eta,1-\eta)\\ &=1-1-2\max(-\eta,\eta-1)\\ &=1-2\max(\frac{1}{2}-\eta,\eta-\frac{1}{2})\\ &=1-2|\frac{1}{2}-\eta|\\ &=1-|1-2\eta(x)| \end{align*}\]

\[\Rightarrow 1-H_\varphi(\eta)=|1-2\eta|=[2|\eta-\frac{1}{2}|]'\] \[\Rightarrow |\eta-\frac{1}{2}|=\frac{1}{2}(1-H_\varphi(\eta))'\ \Rightarrow c=\frac{1}{2},r=1.\] Hence, \[L(\text{sign}(f))-L^*\le(L_\varphi(f)-L_\varphi^*)'\] for hinge loss, which matches the result of B.J.M’ 06.

Proof: first, note that \[\begin{align*} H_\varphi(\eta)&=\inf_\alpha\{\eta\varphi(\alpha)+(1-\eta)\varphi(-\alpha)\}\\ &\le \eta\varphi(0)+(1-\eta)\varphi(0)\\ &=\varphi(0)\\ &=1 \end{align*}\]

So \(1-H_\varphi(\eta)\ge 0\) and the lemma statement is well-defined.

Next, \[\begin{align*} &L(\text{sign}(f))-L^*\\ &=E[|2\eta(x)-1|\cdot 1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2}))]\\ &\le 2cE[(1-H_\varphi(\eta))^\gamma1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2}))]\\ &=2cE[((1-H_\varphi(\eta))1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2})))^\gamma]\\ &\le 2c\{E[(1-H_\varphi(\eta))1(\text{sign}(f)\not=\text{sign}(\eta(x)-\frac{1}{2}))]\}^\gamma\\ \end{align*}\]

\(u\) is concave.

To this end, give each \(x\in \mathcal X\), we will show that \[\begin{align*} &(1-H_\varphi(\eta(x)))1(\text{sign}(f(x))\ \ \not=\text{sign}(\eta(x)-\frac{1}{2})\\ &\le E(\varphi(Yf(x))|X=x)-H_\varphi(\eta(x)) \end{align*}\]

Since it will imply the result by integrating with respect to \(x\).

On the RHS, note that \[E(\varphi(Yf(x))X=x)=\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-f(x))\] and \[H_\varphi(\eta(x))=\inf_{f(x)}\{\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-f(x))\}\]

Hence the RHS is non-negative. Therefore the case of \(\text{sign}(f(x))=\text{sign}(\eta(x)-\frac{1}{2})\) [which leads to the LHS=0] follows trivially.

For the case of \(\text{sign}(f(x))\not=\text{sign}(\eta(x)-\frac{1}{2})\), we have \[\begin{align*} &E(\varphi(Yf(x))|X=x)\\ &=\eta(x)\varphi(f(x))+(1-\eta(x))\varphi(-f(x))\\ &=\varphi(\eta(x)f(x)+(1-\eta(x))(-f(x)))\\ &=\varphi((2\eta(x)-1)f(x))\\ &\ge\varphi(0)\\ &=1 \end{align*}\]

since \(\varphi\) is non-decreasing. This completes the proof of (*) and hence the lemma.     \(\square\)

Remarks:

  • It’s not hard to check the \(c\) and \(\gamma\) values for common surrogate loss functions(exp, logistic).

  • Comparing BJM theorem and Zhang’s lemma, BJM’s theorem holds, under weaker conditions than that of Zhang’s.