Chapter 4 Binary Classification

(This chapter was scribed by Paul Barber. Proofread and polished by Baozhen Wang.)

In this chapter, we focus on analyzing a particular problem: binary classification. Focus on binary classification is justified because

It encompasses much of what we have to do in practice
\(Y\) is bounded.

In particular, there are some nasty surprises lurking in multicategory classification, so we avoid more complicated general classification here.

4.1 Bayes Classification (Rule)

Suppose \((X_{1},Y_{1}) , \ldots , (X_{n}, Y_{n})\) are iid \(P(X,Y)\), where \(X\) is in some feature space, and \(Y=\left\{ 0,1\right\}\). Denote the marginal distribution of \(X\) by \(P_{X}\). The conditional distribution \(Y|X = x \sim \text{Ber}(\eta (x))\), where \(\eta(x)\triangleq P(Y=1| X=x)=E(Y|X=x)\). \(\eta\) is sometimes called the regression function.

Next, we consider an optimal classifier that knows \(\eta\) perfectly, that is as if we had perfect access to the distribution of \(Y|X\).

Definition 4.1 (Bayes Classifier) \[h^{\ast}(x) = \begin{cases} 1 & \text{ if } \eta(x)> \frac{1}{2} \\ 0 & \text{ if } \eta(x) \le \frac{1}{2} \end{cases} .\]

In other words,

\[h^{\ast} (x)=1 \iff P(Y=1|X=x)>P(Y=0|X=x)\]

The performance metric of \(h\) is classification error: \(L(h)=P(h(x)\neq Y)\), i.e., the risk function under \(0-1\) loss. Bayes Risk equals \(L^{\ast}=L(h^{\ast})\), the latter denoting the classification error associated with \(h^{\ast}\)

Theorem 4.1 For any binary classifier \(h\), the following identity holds:

\[L(h)-L(h^{\ast}) = \int_{\left\{ h \neq h^{\ast} \right\} } |2 \eta(x)-1| P_{x} dx = E_{X \sim P_{X}} \left( |2 \eta(x) -1| \mathbb{1}[h(x) \neq h^{\ast}(x)] \right)\]

where \(\left\{h\neq h^{\ast}\right\}\triangleq\left\{x\in X:\ h(x)\neq h^{\ast}(x)\right\}\).

In particular, the integrand is non-negative, which implies \(L(h^{\ast}) \le L(h),\ \forall h\). Moreover,

\[L(h^{\ast}) = E_{X}[\text{min}(\eta(x),1-\eta(x))] \le \frac{1}{2}\]

Proof. We begin by proving the last part for all \(h\),

\[\begin{align*} L(h) &= P(Y \neq h(x)) \\ &= P(Y=0, h(x)=1)+P(Y=1, h(x)=0) \\ &= E(\mathbb{1}[Y=0,h(x)=1])+E(\mathbb{1}[Y=1,h(x)=0]) \\ &= E_X\left\{E_{Y|X} \mathbb{1}[Y=0,h(x)=1]|X \right\}+ E_X\left\{E_{Y|X} \mathbb{1}[Y=1,h(x)=0]|X \right\} .\end{align*}\]

Now, \(h(x)\) is measurable, so

\[\mathbb{1}[1=0,h(x)=1]\eta(x) +\mathbb{1}[0=0,h(x)=1](1-\eta(x))=\mathbb{1}[h(x)=1](1-\eta(x))\]

Then \(\forall h\),

\[\begin{equation} L(h)= E_{X}\left[\mathbb{1}[h(x)=1] (1-\eta(x))\right]+ E_{X}\left[\mathbb{1}[h(x)=0] \eta(x)\right] \tag{4.1} \end{equation}\]

\[L(h^{\ast})=E[\mathbb{1}[h^{\ast}(x)=1](1-\eta(x))+\mathbb{1}[h^{\ast}(x)=0] \eta(x)] = E[\min(\eta(x),1-\eta(x))]\le \frac{1}{2}\]

Now apply (4.1) to both \(h\) and \(h^{\ast}\),

\[\begin{align*} L(h)-L(h^{\ast})&=E[\mathbb{1}[h(x)=1](1-\eta(x))+\mathbb{1}[h(x)=0]\eta(x)-\mathbb{1}[h^{\ast}(x)=1](1-\eta(x))-\mathbb{1}[h^{\ast}(x)=0] \eta(x)]\\ &=E\left\{[\mathbb{1}[h(x)=1]-\mathbb{1}[h^{\ast}(x)=1]](1-\eta(x)) + [\mathbb{1}[h^{\ast}(x)=0]-\mathbb{1}[h(x)=0]]\eta(x)\right\}\\ &=E\left\{[\mathbb{1}[h(x)=1]-\mathbb{1}[h^{\ast}(x)=1]]\cdot (1-2\eta(x)\right\}\\ &=\begin{cases} 1 & \text{ if } \eta<\frac{1}{2} \\ 0 & \text{ if } \eta =\frac{1}{2} \\ -1 & \text{ if } \eta>\frac{1}{2} \end{cases}\\ &=E[\mathbb{1}[h^{\ast}(x)\neq h(x)]\cdot sgn(1-2\eta(x))(1-2\eta(x))]\\ &=E[\mathbb{1}[h^{\ast}(x)\neq h(x)]\cdot |1-2\eta(x)|] \end{align*}\] This implies \(L(h) \ge L(h^{\ast})\square.\)

Remarks:

\(L(h)-L(h^{\ast})\) is the excess risk.
\(L(h^{\ast})=\frac{1}{2}\iff\eta(x)=\frac{1}{2}\text{ a.s. }\iff\) \(X\) contains no useful information about \(Y\).
Excess risk weights the discrepency between \(h\) and \(h^{\ast}\) according to how far \(h\) is from \(\frac{1}{2}\)

As noted earlier, LDA puts some model (dist’n) on data, e.g.

\[X| Y=y \sim N(\mu_{Y}, \Sigma)\]

Generally, one can compute the Bayes rule by applying the Bayes theorem

\[\eta(x)=P(Y=1|X=x) = \frac{P(X|Y=1)P(Y=1)}{P(X|Y=1)P(Y=1)+P(X|Y=0)P(Y=0)}\]

where \(P(X|Y=y)\) is density or pmf. For \(\pi\triangleq P(Y=1)\), \(P_j\triangleq P(X|Y=j)\) :

\[\eta(x) = \frac{P_{1}(x) \pi}{P_{1}(x) \pi + P_{0}(x) (1-\pi) }\]

The bayes rule is \(h^{\ast} (x) = \mathbb{1}[\eta (x) > \frac{1}{2}]\)

\[\frac{P_{1}(x) \pi}{P_{1}(x) \pi + P_{0}(X)(1-\pi)} > \frac{1}{2} \iff \frac{\pi P_{1}(x)}{(1-\pi)P_{0}(x)} >1 \implies \frac{P_{1}}{P_{0}}(x) > \frac{1-\pi}{\pi}\]

When \(\pi=\frac{1}{2}\), Bayes rule amounts to comparing \(P_{1}(x)\) with \(P_{0}(x)\).

4.2 E.R.M.

Given data \(D_{n} = \left\{ (X_{1},Y_{1}) , \ldots, (X_{n},Y_{n}) \right\}\) we build a classifier \(\hat{h}_{n} (x)\) which is random in two ways:

\(X\) is a random variable
\(\hat{h}_{n}\) depends of the “random” data explicitly.

Our performance metric is still \(L(\hat{h}_{n}) =P(\hat{h}_{n}(x) \neq Y)\). Although we have integrated our \((X,Y)\), \(L(\hat{h_{n}})\) still depends on the data \(D_{n}\). Since this is random, we will consider bounding both \(E(L(\hat{h}_{n})-L(h^{\ast}))\) and \(L(\hat{h}_{n})-L(h^{\ast})\) with high probability.

4.2.1 Plug-in rule

Earlier we discussed two approaches to classification: generative v.s. discriminative. The middle ground is the plug in rule.

Recall \(h^{\ast}(x) = \mathbb{1}[\eta(x) > \frac{1}{2}]\). Bayes rule says we can estimate \(\eta(x)\) using \(\hat{\eta}(x)\) and then plug it into the bayes rule to produce \(\hat{h}(x)\triangleq\mathbb{1}[\hat{\eta}_{n} >\frac{1}{2}]\).

Many possibilities:

If \(\eta(x)\) is smooth, use non-paramtetric regression to estimated \(E(Y|X=x)\).

Nadaraya-watson kernel regression
\(k\)-nearest neighbor.
Wavelets

If \(\eta(x)\) has parametric form, logistic regression

\[\log\left( \frac{\eta(x)}{1-\eta(x)} \right) = x^{T} \beta\]

Widely used, performs well and easy to compute, but not our focus here.

4.3 Learning with finite hypothesis class \(\mathcal{H}\).

Recall the following definition:

\(L(h) = P(Y \neq h(X))\).
\(\hat{L}_{n}(h) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[Y_{i} \neq h(x_{i})]\).
\(\hat{h}_{n} = \text{argmin}_{h \in \mathcal{H}} \hat{L}_{n}(h)\).
\(\bar{h} = \text{argmin}_{h \in \mathcal{H}} L(h)\).
Excess risk wrt \(\mathcal{H}\): \(L(\hat{h}_{n})-L(\bar{h})\).
Excess risk: \(L (\hat{h}_{n})-L(h^{\ast})\)

Ideally, we want to bound the excess risk \[\begin{align*} L(\hat{h}_{n}) -L(h^{\ast}) &= [L(\hat{h}_{n})-L(\bar{h})]+[L(\bar{h})-L(h^{\ast})] \\ .\end{align*}\]

Goal: \(P(L(\hat{h}_{n}-L(\bar{h})) \le \Delta_{n , \delta} (\mathcal{H}))\ge 1- \delta\). How? Try to bound \(|\hat{L}_{n}(h)-L(h)| \ \forall h \in \mathcal{H}\). But

\[\begin{align*} |\hat{L}_{n}(h)-L(h)| &= | \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[h(X_{i}) \neq Y_{i}] - E(\mathbb{1}[h(X_{i}) \neq Y_{i}]) | \\ &= | \bar{Z}-\mu| .\end{align*}\]

\(Z_{i} \in [0,1]\) a.s. by definition. Applying Hoeffding’s inequality, we have

\[P(|\bar{Z}-\mu \le \epsilon) \ge 1- 2 \text{exp}\left( - \frac{2n^2 \epsilon^2}{\sum(b_i-a_i)} \right) = 1- 2 \text{exp}\left( -2n \epsilon^2\right)\]

Let \(\delta=2\exp(-2n\epsilon^2)\), then \(\epsilon=\sqrt{\frac{ \log\left( 2/\delta \right)}{2n}}\) so that \(\forall h,\) \[|\bar{Z}-\mu| \le \sqrt{\frac{ \log\left( 2/\delta \right)}{2n} }\] with high probability.

Theorem 4.2 If \(\mathcal{H} = \left\{ h_{1},h_{2},\ldots, h_{m} \right\}\), then \[L(\hat{h}_{n})-L(\bar{h}) \le \sqrt{ \frac{2 \log \left( 2M/\delta \right)}{n} }\] with probability at least \(1-\delta\), and \[E(L(\hat{h}_{n})-L(\bar{h})) \le \sqrt{ \frac{2 \log 2M}{n}}\].

Proof. From the definition of \(\hat{h}_{n}\), \(\hat{L}_{n}(\hat{h}_{n}) \le \hat{L}_{n}(\bar{h})\) which implies \(\hat{L}_{n}(\hat{h}_{n}-\hat{L}_{n}(\bar{ h})) <0\).

\[\begin{align*} L(\hat{h}_n)-L(\bar{h}) &=\hat{L}_n (\hat{h}_n) - \hat{L}_n(\bar{h})\\ &+L(\hat{h}_n)-\hat{L}_n(\hat{h}_n)\\ &+\hat{L}_n(\bar{h})-L(\bar{h})\\ &\le |L(\hat{h}_n)-\hat{L}_n(\hat{h}_n)|+|L(\hat{h}_{n})-L(\bar{h})|\\ &\le 2 \max_{h \in \mathcal{H}} |\hat{L}(h)-\hat{L}_n(h)| \end{align*}\]

For each \(h_{j} \in \mathcal{H}\), with probability at most \(\frac{\delta}{M}\),

\[\hat{L}_{n}(h_{j})-L(h_{j})| \ge \sqrt{\frac{\log\left( \frac{2 M}{\delta} \right) }{2n}}\]

Note that the event

\[\max_{j\in\left\{1, \dots, M\right\}} |\hat{L}_{n}(h_{j})-L(h_{j})| \ge \sqrt{ \frac{\log \left( \frac{2M}{\delta} \right) }{2n} }\]

is the union of the events

\[|\hat{L}_{n} (h_{j})-L( h_{j}) | \ge \sqrt{ \frac{\log \left( \frac{2M}{\delta} \right) }{2n}} \]

\[P( \bigcup_{j=1}^{M} E_{j}) \le \sum_{j =1 }^{m} P(E_{j})\]

whose probability is \(\le M\cdot\frac{\delta}{M}=\delta\). Therefore

\[P\left( \max_{j\in\left\{1, \ldots, M\right\}} |\hat{L}_{n}(h_{j})-L(h_{j})| < \sqrt{ \frac{\log \left(2M/\delta \right) }{2n} }\right) \ge 1-\delta \implies P\left(L(\hat{h}_{n})-L( \bar{h}) \le 2 \sqrt{\frac{\log\left( 2M/\delta \right) }{2n}} \right)\ge 1- \delta\]

which completes part 1.

To bound \(L(\hat{h}_{n})\) in expectation, we use a trick from probability theory which bounds a max by its sum in a slightly more clever way. Here let \(\left\{ Z_{j} \right\}\) be centered r.v. Then

\[E\left( \max_{j\in\left\{1,\dots, M\right\}}|Z_{j}|\right)= \frac{1}{s} \log \exp s\cdot E(\max_{j}|Z_{j}|)\]

Recall Jensen’s Inequality:

\[f(E(X))\le E(f(X)) \text{ for } f \text{ convex}\]

So we have

\[\begin{align*} E\left(\max_{j\in\left\{1,\dots, M\right\}}|Z_{j}|\right) &\le \frac{1}{s} \log E( \exp (s \max_{j} |Z_{j}|)) \\ &= \frac{1}{s} \log E(\text{exp}(s \max_{j\in\left\{1,\dots, 2M\right\}} Z_{j})) \\ &\le \frac{1}{s} \log E \sum_{j=1}^{2m} \exp(s\cdot Z_j)\\ &\le \frac{1}{s} \log E \sum_{j=1}^{2m} \text{exp}\left( \frac{s^2}{8n} \right) \\ &= \frac{1}{s} \log (2M \text{exp} \frac{s^2}{8n}) \\ &= \frac{\log(2M)}{s} +\frac{s^2}{8ns} .\end{align*}\]

Let \(g(s) = \frac{\log(2M)}{s} +\frac{s}{8n}\). Then \(0=g'(s) = -\frac{\log (2M)}{s^2}+\frac{1}{8}\) implies \(s = \sqrt{8n \log(2M)}\). Plug in \(s\) we have \[E\left(\max_{j\in\left\{1,\dots, M\right\}}|Z_{j}|\right) \le \sqrt{\frac{\log(2M)}{2n}}\square.\]