Chapter 4 Binary Classification
(This chapter was scribed by Paul Barber. Proofread and polished by Baozhen Wang.)
In this chapter, we focus on analyzing a particular problem: binary classification. Focus on binary classification is justified because
- It encompasses much of what we have to do in practice
- \(Y\) is bounded.
In particular, there are some nasty surprises lurking in multicategory classification, so we avoid more complicated general classification here.
4.1 Bayes Classification (Rule)
Suppose \((X_{1},Y_{1}) , \ldots , (X_{n}, Y_{n})\) are iid \(P(X,Y)\), where \(X\) is in some feature space, and \(Y=\left\{ 0,1\right\}\). Denote the marginal distribution of \(X\) by \(P_{X}\). The conditional distribution \(Y|X = x \sim \text{Ber}(\eta (x))\), where \(\eta(x)\triangleq P(Y=1| X=x)=E(Y|X=x)\). \(\eta\) is sometimes called the regression function.
Next, we consider an optimal classifier that knows \(\eta\) perfectly, that is as if we had perfect access to the distribution of \(Y|X\).
Definition 4.1 (Bayes Classifier) \[h^{\ast}(x) = \begin{cases} 1 & \text{ if } \eta(x)> \frac{1}{2} \\ 0 & \text{ if } \eta(x) \le \frac{1}{2} \end{cases} .\]
In other words,
\[h^{\ast} (x)=1 \iff P(Y=1|X=x)>P(Y=0|X=x)\]
The performance metric of \(h\) is classification error: \(L(h)=P(h(x)\neq Y)\), i.e., the risk function under \(0-1\) loss. Bayes Risk equals \(L^{\ast}=L(h^{\ast})\), the latter denoting the classification error associated with \(h^{\ast}\)
Theorem 4.1 For any binary classifier \(h\), the following identity holds:
\[L(h)-L(h^{\ast}) = \int_{\left\{ h \neq h^{\ast} \right\} } |2 \eta(x)-1| P_{x} dx = E_{X \sim P_{X}} \left( |2 \eta(x) -1| \mathbb{1}[h(x) \neq h^{\ast}(x)] \right)\]
where \(\left\{h\neq h^{\ast}\right\}\triangleq\left\{x\in X:\ h(x)\neq h^{\ast}(x)\right\}\).
In particular, the integrand is non-negative, which implies \(L(h^{\ast}) \le L(h),\ \forall h\). Moreover,
\[L(h^{\ast}) = E_{X}[\text{min}(\eta(x),1-\eta(x))] \le \frac{1}{2}\]
Proof. We begin by proving the last part for all \(h\),
\[\begin{align*} L(h) &= P(Y \neq h(x)) \\ &= P(Y=0, h(x)=1)+P(Y=1, h(x)=0) \\ &= E(\mathbb{1}[Y=0,h(x)=1])+E(\mathbb{1}[Y=1,h(x)=0]) \\ &= E_X\left\{E_{Y|X} \mathbb{1}[Y=0,h(x)=1]|X \right\}+ E_X\left\{E_{Y|X} \mathbb{1}[Y=1,h(x)=0]|X \right\} .\end{align*}\]
Now, \(h(x)\) is measurable, so
\[\mathbb{1}[1=0,h(x)=1]\eta(x) +\mathbb{1}[0=0,h(x)=1](1-\eta(x))=\mathbb{1}[h(x)=1](1-\eta(x))\]
Then \(\forall h\),
\[\begin{equation} L(h)= E_{X}\left[\mathbb{1}[h(x)=1] (1-\eta(x))\right]+ E_{X}\left[\mathbb{1}[h(x)=0] \eta(x)\right] \tag{4.1} \end{equation}\]
\[L(h^{\ast})=E[\mathbb{1}[h^{\ast}(x)=1](1-\eta(x))+\mathbb{1}[h^{\ast}(x)=0] \eta(x)] = E[\min(\eta(x),1-\eta(x))]\le \frac{1}{2}\]
Now apply (4.1) to both \(h\) and \(h^{\ast}\),
\[\begin{align*} L(h)-L(h^{\ast})&=E[\mathbb{1}[h(x)=1](1-\eta(x))+\mathbb{1}[h(x)=0]\eta(x)-\mathbb{1}[h^{\ast}(x)=1](1-\eta(x))-\mathbb{1}[h^{\ast}(x)=0] \eta(x)]\\ &=E\left\{[\mathbb{1}[h(x)=1]-\mathbb{1}[h^{\ast}(x)=1]](1-\eta(x)) + [\mathbb{1}[h^{\ast}(x)=0]-\mathbb{1}[h(x)=0]]\eta(x)\right\}\\ &=E\left\{[\mathbb{1}[h(x)=1]-\mathbb{1}[h^{\ast}(x)=1]]\cdot (1-2\eta(x)\right\}\\ &=\begin{cases} 1 & \text{ if } \eta<\frac{1}{2} \\ 0 & \text{ if } \eta =\frac{1}{2} \\ -1 & \text{ if } \eta>\frac{1}{2} \end{cases}\\ &=E[\mathbb{1}[h^{\ast}(x)\neq h(x)]\cdot sgn(1-2\eta(x))(1-2\eta(x))]\\ &=E[\mathbb{1}[h^{\ast}(x)\neq h(x)]\cdot |1-2\eta(x)|] \end{align*}\] This implies \(L(h) \ge L(h^{\ast})\square.\)
Remarks:
- \(L(h)-L(h^{\ast})\) is the excess risk.
- \(L(h^{\ast})=\frac{1}{2}\iff\eta(x)=\frac{1}{2}\text{ a.s. }\iff\) \(X\) contains no useful information about \(Y\).
- Excess risk weights the discrepency between \(h\) and \(h^{\ast}\) according to how far \(h\) is from \(\frac{1}{2}\)
As noted earlier, LDA puts some model (dist’n) on data, e.g.
\[X| Y=y \sim N(\mu_{Y}, \Sigma)\]
Generally, one can compute the Bayes rule by applying the Bayes theorem
\[\eta(x)=P(Y=1|X=x) = \frac{P(X|Y=1)P(Y=1)}{P(X|Y=1)P(Y=1)+P(X|Y=0)P(Y=0)}\]
where \(P(X|Y=y)\) is density or pmf. For \(\pi\triangleq P(Y=1)\), \(P_j\triangleq P(X|Y=j)\) :
\[\eta(x) = \frac{P_{1}(x) \pi}{P_{1}(x) \pi + P_{0}(x) (1-\pi) }\]
The bayes rule is \(h^{\ast} (x) = \mathbb{1}[\eta (x) > \frac{1}{2}]\)
\[\frac{P_{1}(x) \pi}{P_{1}(x) \pi + P_{0}(X)(1-\pi)} > \frac{1}{2} \iff \frac{\pi P_{1}(x)}{(1-\pi)P_{0}(x)} >1 \implies \frac{P_{1}}{P_{0}}(x) > \frac{1-\pi}{\pi}\]
When \(\pi=\frac{1}{2}\), Bayes rule amounts to comparing \(P_{1}(x)\) with \(P_{0}(x)\).
4.2 E.R.M.
Given data \(D_{n} = \left\{ (X_{1},Y_{1}) , \ldots, (X_{n},Y_{n}) \right\}\) we build a classifier \(\hat{h}_{n} (x)\) which is random in two ways:
- \(X\) is a random variable
- \(\hat{h}_{n}\) depends of the “random” data explicitly.
Our performance metric is still \(L(\hat{h}_{n}) =P(\hat{h}_{n}(x) \neq Y)\). Although we have integrated our \((X,Y)\), \(L(\hat{h_{n}})\) still depends on the data \(D_{n}\). Since this is random, we will consider bounding both \(E(L(\hat{h}_{n})-L(h^{\ast}))\) and \(L(\hat{h}_{n})-L(h^{\ast})\) with high probability.
4.2.1 Plug-in rule
Earlier we discussed two approaches to classification: generative v.s. discriminative. The middle ground is the plug in rule.
Recall \(h^{\ast}(x) = \mathbb{1}[\eta(x) > \frac{1}{2}]\). Bayes rule says we can estimate \(\eta(x)\) using \(\hat{\eta}(x)\) and then plug it into the bayes rule to produce \(\hat{h}(x)\triangleq\mathbb{1}[\hat{\eta}_{n} >\frac{1}{2}]\).
Many possibilities:
- If \(\eta(x)\) is smooth, use non-paramtetric regression to estimated \(E(Y|X=x)\).
- Nadaraya-watson kernel regression
- \(k\)-nearest neighbor.
- Wavelets
- If \(\eta(x)\) has parametric form, logistic regression
\[\log\left( \frac{\eta(x)}{1-\eta(x)} \right) = x^{T} \beta\]
Widely used, performs well and easy to compute, but not our focus here.
4.3 Learning with finite hypothesis class \(\mathcal{H}\).
Recall the following definition:
- \(L(h) = P(Y \neq h(X))\).
- \(\hat{L}_{n}(h) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[Y_{i} \neq h(x_{i})]\).
- \(\hat{h}_{n} = \text{argmin}_{h \in \mathcal{H}} \hat{L}_{n}(h)\).
- \(\bar{h} = \text{argmin}_{h \in \mathcal{H}} L(h)\).
- Excess risk wrt \(\mathcal{H}\): \(L(\hat{h}_{n})-L(\bar{h})\).
- Excess risk: \(L (\hat{h}_{n})-L(h^{\ast})\)
Ideally, we want to bound the excess risk \[\begin{align*} L(\hat{h}_{n}) -L(h^{\ast}) &= [L(\hat{h}_{n})-L(\bar{h})]+[L(\bar{h})-L(h^{\ast})] \\ .\end{align*}\]
Goal: \(P(L(\hat{h}_{n}-L(\bar{h})) \le \Delta_{n , \delta} (\mathcal{H}))\ge 1- \delta\). How? Try to bound \(|\hat{L}_{n}(h)-L(h)| \ \forall h \in \mathcal{H}\). But
\[\begin{align*} |\hat{L}_{n}(h)-L(h)| &= | \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[h(X_{i}) \neq Y_{i}] - E(\mathbb{1}[h(X_{i}) \neq Y_{i}]) | \\ &= | \bar{Z}-\mu| .\end{align*}\]
\(Z_{i} \in [0,1]\) a.s. by definition. Applying Hoeffding’s inequality, we have
\[P(|\bar{Z}-\mu \le \epsilon) \ge 1- 2 \text{exp}\left( - \frac{2n^2 \epsilon^2}{\sum(b_i-a_i)} \right) = 1- 2 \text{exp}\left( -2n \epsilon^2\right)\]
Let \(\delta=2\exp(-2n\epsilon^2)\), then \(\epsilon=\sqrt{\frac{ \log\left( 2/\delta \right)}{2n}}\) so that \(\forall h,\) \[|\bar{Z}-\mu| \le \sqrt{\frac{ \log\left( 2/\delta \right)}{2n} }\] with high probability.
Theorem 4.2 If \(\mathcal{H} = \left\{ h_{1},h_{2},\ldots, h_{m} \right\}\), then \[L(\hat{h}_{n})-L(\bar{h}) \le \sqrt{ \frac{2 \log \left( 2M/\delta \right)}{n} }\] with probability at least \(1-\delta\), and \[E(L(\hat{h}_{n})-L(\bar{h})) \le \sqrt{ \frac{2 \log 2M}{n}}\].
Proof. From the definition of \(\hat{h}_{n}\), \(\hat{L}_{n}(\hat{h}_{n}) \le \hat{L}_{n}(\bar{h})\) which implies \(\hat{L}_{n}(\hat{h}_{n}-\hat{L}_{n}(\bar{ h})) <0\).
So
\[\begin{align*} L(\hat{h}_n)-L(\bar{h}) &=\hat{L}_n (\hat{h}_n) - \hat{L}_n(\bar{h})\\ &+L(\hat{h}_n)-\hat{L}_n(\hat{h}_n)\\ &+\hat{L}_n(\bar{h})-L(\bar{h})\\ &\le |L(\hat{h}_n)-\hat{L}_n(\hat{h}_n)|+|L(\hat{h}_{n})-L(\bar{h})|\\ &\le 2 \max_{h \in \mathcal{H}} |\hat{L}(h)-\hat{L}_n(h)| \end{align*}\]
For each \(h_{j} \in \mathcal{H}\), with probability at most \(\frac{\delta}{M}\),
\[\hat{L}_{n}(h_{j})-L(h_{j})| \ge \sqrt{\frac{\log\left( \frac{2 M}{\delta} \right) }{2n}}\]
Note that the event
\[\max_{j\in\left\{1, \dots, M\right\}} |\hat{L}_{n}(h_{j})-L(h_{j})| \ge \sqrt{ \frac{\log \left( \frac{2M}{\delta} \right) }{2n} }\]
is the union of the events
\[|\hat{L}_{n} (h_{j})-L( h_{j}) | \ge \sqrt{ \frac{\log \left( \frac{2M}{\delta} \right) }{2n}} \]
\[P( \bigcup_{j=1}^{M} E_{j}) \le \sum_{j =1 }^{m} P(E_{j})\]
whose probability is \(\le M\cdot\frac{\delta}{M}=\delta\). Therefore
\[P\left( \max_{j\in\left\{1, \ldots, M\right\}} |\hat{L}_{n}(h_{j})-L(h_{j})| < \sqrt{ \frac{\log \left(2M/\delta \right) }{2n} }\right) \ge 1-\delta \implies P\left(L(\hat{h}_{n})-L( \bar{h}) \le 2 \sqrt{\frac{\log\left( 2M/\delta \right) }{2n}} \right)\ge 1- \delta\]
which completes part 1.
To bound \(L(\hat{h}_{n})\) in expectation, we use a trick from probability theory which bounds a max by its sum in a slightly more clever way. Here let \(\left\{ Z_{j} \right\}\) be centered r.v. Then
\[E\left( \max_{j\in\left\{1,\dots, M\right\}}|Z_{j}|\right)= \frac{1}{s} \log \exp s\cdot E(\max_{j}|Z_{j}|)\]
Recall Jensen’s Inequality:
\[f(E(X))\le E(f(X)) \text{ for } f \text{ convex}\]
So we have
\[\begin{align*} E\left(\max_{j\in\left\{1,\dots, M\right\}}|Z_{j}|\right) &\le \frac{1}{s} \log E( \exp (s \max_{j} |Z_{j}|)) \\ &= \frac{1}{s} \log E(\text{exp}(s \max_{j\in\left\{1,\dots, 2M\right\}} Z_{j})) \\ &\le \frac{1}{s} \log E \sum_{j=1}^{2m} \exp(s\cdot Z_j)\\ &\le \frac{1}{s} \log E \sum_{j=1}^{2m} \text{exp}\left( \frac{s^2}{8n} \right) \\ &= \frac{1}{s} \log (2M \text{exp} \frac{s^2}{8n}) \\ &= \frac{\log(2M)}{s} +\frac{s^2}{8ns} .\end{align*}\]
Let \(g(s) = \frac{\log(2M)}{s} +\frac{s}{8n}\). Then \(0=g'(s) = -\frac{\log (2M)}{s^2}+\frac{1}{8}\) implies \(s = \sqrt{8n \log(2M)}\). Plug in \(s\) we have \[E\left(\max_{j\in\left\{1,\dots, M\right\}}|Z_{j}|\right) \le \sqrt{\frac{\log(2M)}{2n}}\square.\]