Chapter 4 Binary Classification
(This chapter was scribed by Paul Barber. Proofread and polished by Baozhen Wang.)
In this chapter, we focus on analyzing a particular problem: binary classification. Focus on binary classification is justified because
- It encompasses much of what we have to do in practice
- Y is bounded.
In particular, there are some nasty surprises lurking in multicategory classification, so we avoid more complicated general classification here.
4.1 Bayes Classification (Rule)
Suppose (X1,Y1),…,(Xn,Yn) are iid P(X,Y), where X is in some feature space, and Y={0,1}. Denote the marginal distribution of X by PX. The conditional distribution Y|X=x∼Ber(η(x)), where η(x)≜. \eta is sometimes called the regression function.
Next, we consider an optimal classifier that knows \eta perfectly, that is as if we had perfect access to the distribution of Y|X.
Definition 4.1 (Bayes Classifier) h^{\ast}(x) = \begin{cases} 1 & \text{ if } \eta(x)> \frac{1}{2} \\ 0 & \text{ if } \eta(x) \le \frac{1}{2} \end{cases} .
In other words,
h^{\ast} (x)=1 \iff P(Y=1|X=x)>P(Y=0|X=x)
The performance metric of h is classification error: L(h)=P(h(x)\neq Y), i.e., the risk function under 0-1 loss. Bayes Risk equals L^{\ast}=L(h^{\ast}), the latter denoting the classification error associated with h^{\ast}
Theorem 4.1 For any binary classifier h, the following identity holds:
L(h)-L(h^{\ast}) = \int_{\left\{ h \neq h^{\ast} \right\} } |2 \eta(x)-1| P_{x} dx = E_{X \sim P_{X}} \left( |2 \eta(x) -1| \mathbb{1}[h(x) \neq h^{\ast}(x)] \right)
where \left\{h\neq h^{\ast}\right\}\triangleq\left\{x\in X:\ h(x)\neq h^{\ast}(x)\right\}.
In particular, the integrand is non-negative, which implies L(h^{\ast}) \le L(h),\ \forall h. Moreover,
L(h^{\ast}) = E_{X}[\text{min}(\eta(x),1-\eta(x))] \le \frac{1}{2}
Proof. We begin by proving the last part for all h,
\begin{align*} L(h) &= P(Y \neq h(x)) \\ &= P(Y=0, h(x)=1)+P(Y=1, h(x)=0) \\ &= E(\mathbb{1}[Y=0,h(x)=1])+E(\mathbb{1}[Y=1,h(x)=0]) \\ &= E_X\left\{E_{Y|X} \mathbb{1}[Y=0,h(x)=1]|X \right\}+ E_X\left\{E_{Y|X} \mathbb{1}[Y=1,h(x)=0]|X \right\} .\end{align*}
Now, h(x) is measurable, so
\mathbb{1}[1=0,h(x)=1]\eta(x) +\mathbb{1}[0=0,h(x)=1](1-\eta(x))=\mathbb{1}[h(x)=1](1-\eta(x))
Then \forall h,
\begin{equation} L(h)= E_{X}\left[\mathbb{1}[h(x)=1] (1-\eta(x))\right]+ E_{X}\left[\mathbb{1}[h(x)=0] \eta(x)\right] \tag{4.1} \end{equation}
L(h^{\ast})=E[\mathbb{1}[h^{\ast}(x)=1](1-\eta(x))+\mathbb{1}[h^{\ast}(x)=0] \eta(x)] = E[\min(\eta(x),1-\eta(x))]\le \frac{1}{2}
Now apply (4.1) to both h and h^{\ast},
\begin{align*} L(h)-L(h^{\ast})&=E[\mathbb{1}[h(x)=1](1-\eta(x))+\mathbb{1}[h(x)=0]\eta(x)-\mathbb{1}[h^{\ast}(x)=1](1-\eta(x))-\mathbb{1}[h^{\ast}(x)=0] \eta(x)]\\ &=E\left\{[\mathbb{1}[h(x)=1]-\mathbb{1}[h^{\ast}(x)=1]](1-\eta(x)) + [\mathbb{1}[h^{\ast}(x)=0]-\mathbb{1}[h(x)=0]]\eta(x)\right\}\\ &=E\left\{[\mathbb{1}[h(x)=1]-\mathbb{1}[h^{\ast}(x)=1]]\cdot (1-2\eta(x)\right\}\\ &=\begin{cases} 1 & \text{ if } \eta<\frac{1}{2} \\ 0 & \text{ if } \eta =\frac{1}{2} \\ -1 & \text{ if } \eta>\frac{1}{2} \end{cases}\\ &=E[\mathbb{1}[h^{\ast}(x)\neq h(x)]\cdot sgn(1-2\eta(x))(1-2\eta(x))]\\ &=E[\mathbb{1}[h^{\ast}(x)\neq h(x)]\cdot |1-2\eta(x)|] \end{align*} This implies L(h) \ge L(h^{\ast})\square.
Remarks:
- L(h)-L(h^{\ast}) is the excess risk.
- L(h^{\ast})=\frac{1}{2}\iff\eta(x)=\frac{1}{2}\text{ a.s. }\iff X contains no useful information about Y.
- Excess risk weights the discrepency between h and h^{\ast} according to how far h is from \frac{1}{2}
As noted earlier, LDA puts some model (dist’n) on data, e.g.
X| Y=y \sim N(\mu_{Y}, \Sigma)
Generally, one can compute the Bayes rule by applying the Bayes theorem
\eta(x)=P(Y=1|X=x) = \frac{P(X|Y=1)P(Y=1)}{P(X|Y=1)P(Y=1)+P(X|Y=0)P(Y=0)}
where P(X|Y=y) is density or pmf. For \pi\triangleq P(Y=1), P_j\triangleq P(X|Y=j) :
\eta(x) = \frac{P_{1}(x) \pi}{P_{1}(x) \pi + P_{0}(x) (1-\pi) }
The bayes rule is h^{\ast} (x) = \mathbb{1}[\eta (x) > \frac{1}{2}]
\frac{P_{1}(x) \pi}{P_{1}(x) \pi + P_{0}(X)(1-\pi)} > \frac{1}{2} \iff \frac{\pi P_{1}(x)}{(1-\pi)P_{0}(x)} >1 \implies \frac{P_{1}}{P_{0}}(x) > \frac{1-\pi}{\pi}
When \pi=\frac{1}{2}, Bayes rule amounts to comparing P_{1}(x) with P_{0}(x).
4.2 E.R.M.
Given data D_{n} = \left\{ (X_{1},Y_{1}) , \ldots, (X_{n},Y_{n}) \right\} we build a classifier \hat{h}_{n} (x) which is random in two ways:
- X is a random variable
- \hat{h}_{n} depends of the “random” data explicitly.
Our performance metric is still L(\hat{h}_{n}) =P(\hat{h}_{n}(x) \neq Y). Although we have integrated our (X,Y), L(\hat{h_{n}}) still depends on the data D_{n}. Since this is random, we will consider bounding both E(L(\hat{h}_{n})-L(h^{\ast})) and L(\hat{h}_{n})-L(h^{\ast}) with high probability.
4.2.1 Plug-in rule
Earlier we discussed two approaches to classification: generative v.s. discriminative. The middle ground is the plug in rule.
Recall h^{\ast}(x) = \mathbb{1}[\eta(x) > \frac{1}{2}]. Bayes rule says we can estimate \eta(x) using \hat{\eta}(x) and then plug it into the bayes rule to produce \hat{h}(x)\triangleq\mathbb{1}[\hat{\eta}_{n} >\frac{1}{2}].
Many possibilities:
- If \eta(x) is smooth, use non-paramtetric regression to estimated E(Y|X=x).
- Nadaraya-watson kernel regression
- k-nearest neighbor.
- Wavelets
- If \eta(x) has parametric form, logistic regression
\log\left( \frac{\eta(x)}{1-\eta(x)} \right) = x^{T} \beta
Widely used, performs well and easy to compute, but not our focus here.
4.3 Learning with finite hypothesis class \mathcal{H}.
Recall the following definition:
- L(h) = P(Y \neq h(X)).
- \hat{L}_{n}(h) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[Y_{i} \neq h(x_{i})].
- \hat{h}_{n} = \text{argmin}_{h \in \mathcal{H}} \hat{L}_{n}(h).
- \bar{h} = \text{argmin}_{h \in \mathcal{H}} L(h).
- Excess risk wrt \mathcal{H}: L(\hat{h}_{n})-L(\bar{h}).
- Excess risk: L (\hat{h}_{n})-L(h^{\ast})
Ideally, we want to bound the excess risk \begin{align*} L(\hat{h}_{n}) -L(h^{\ast}) &= [L(\hat{h}_{n})-L(\bar{h})]+[L(\bar{h})-L(h^{\ast})] \\ .\end{align*}
Goal: P(L(\hat{h}_{n}-L(\bar{h})) \le \Delta_{n , \delta} (\mathcal{H}))\ge 1- \delta. How? Try to bound |\hat{L}_{n}(h)-L(h)| \ \forall h \in \mathcal{H}. But
\begin{align*} |\hat{L}_{n}(h)-L(h)| &= | \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[h(X_{i}) \neq Y_{i}] - E(\mathbb{1}[h(X_{i}) \neq Y_{i}]) | \\ &= | \bar{Z}-\mu| .\end{align*}
Z_{i} \in [0,1] a.s. by definition. Applying Hoeffding’s inequality, we have
P(|\bar{Z}-\mu \le \epsilon) \ge 1- 2 \text{exp}\left( - \frac{2n^2 \epsilon^2}{\sum(b_i-a_i)} \right) = 1- 2 \text{exp}\left( -2n \epsilon^2\right)
Let \delta=2\exp(-2n\epsilon^2), then \epsilon=\sqrt{\frac{ \log\left( 2/\delta \right)}{2n}} so that \forall h, |\bar{Z}-\mu| \le \sqrt{\frac{ \log\left( 2/\delta \right)}{2n} } with high probability.
Theorem 4.2 If \mathcal{H} = \left\{ h_{1},h_{2},\ldots, h_{m} \right\}, then L(\hat{h}_{n})-L(\bar{h}) \le \sqrt{ \frac{2 \log \left( 2M/\delta \right)}{n} } with probability at least 1-\delta, and E(L(\hat{h}_{n})-L(\bar{h})) \le \sqrt{ \frac{2 \log 2M}{n}}.
Proof. From the definition of \hat{h}_{n}, \hat{L}_{n}(\hat{h}_{n}) \le \hat{L}_{n}(\bar{h}) which implies \hat{L}_{n}(\hat{h}_{n}-\hat{L}_{n}(\bar{ h})) <0.
So
\begin{align*} L(\hat{h}_n)-L(\bar{h}) &=\hat{L}_n (\hat{h}_n) - \hat{L}_n(\bar{h})\\ &+L(\hat{h}_n)-\hat{L}_n(\hat{h}_n)\\ &+\hat{L}_n(\bar{h})-L(\bar{h})\\ &\le |L(\hat{h}_n)-\hat{L}_n(\hat{h}_n)|+|L(\hat{h}_{n})-L(\bar{h})|\\ &\le 2 \max_{h \in \mathcal{H}} |\hat{L}(h)-\hat{L}_n(h)| \end{align*}
For each h_{j} \in \mathcal{H}, with probability at most \frac{\delta}{M},
\hat{L}_{n}(h_{j})-L(h_{j})| \ge \sqrt{\frac{\log\left( \frac{2 M}{\delta} \right) }{2n}}
Note that the event
\max_{j\in\left\{1, \dots, M\right\}} |\hat{L}_{n}(h_{j})-L(h_{j})| \ge \sqrt{ \frac{\log \left( \frac{2M}{\delta} \right) }{2n} }
is the union of the events
|\hat{L}_{n} (h_{j})-L( h_{j}) | \ge \sqrt{ \frac{\log \left( \frac{2M}{\delta} \right) }{2n}}
P( \bigcup_{j=1}^{M} E_{j}) \le \sum_{j =1 }^{m} P(E_{j})
whose probability is \le M\cdot\frac{\delta}{M}=\delta. Therefore
P\left( \max_{j\in\left\{1, \ldots, M\right\}} |\hat{L}_{n}(h_{j})-L(h_{j})| < \sqrt{ \frac{\log \left(2M/\delta \right) }{2n} }\right) \ge 1-\delta \implies P\left(L(\hat{h}_{n})-L( \bar{h}) \le 2 \sqrt{\frac{\log\left( 2M/\delta \right) }{2n}} \right)\ge 1- \delta
which completes part 1.
To bound L(\hat{h}_{n}) in expectation, we use a trick from probability theory which bounds a max by its sum in a slightly more clever way. Here let \left\{ Z_{j} \right\} be centered r.v. Then
E\left( \max_{j\in\left\{1,\dots, M\right\}}|Z_{j}|\right)= \frac{1}{s} \log \exp s\cdot E(\max_{j}|Z_{j}|)
Recall Jensen’s Inequality:
f(E(X))\le E(f(X)) \text{ for } f \text{ convex}
So we have
\begin{align*} E\left(\max_{j\in\left\{1,\dots, M\right\}}|Z_{j}|\right) &\le \frac{1}{s} \log E( \exp (s \max_{j} |Z_{j}|)) \\ &= \frac{1}{s} \log E(\text{exp}(s \max_{j\in\left\{1,\dots, 2M\right\}} Z_{j})) \\ &\le \frac{1}{s} \log E \sum_{j=1}^{2m} \exp(s\cdot Z_j)\\ &\le \frac{1}{s} \log E \sum_{j=1}^{2m} \text{exp}\left( \frac{s^2}{8n} \right) \\ &= \frac{1}{s} \log (2M \text{exp} \frac{s^2}{8n}) \\ &= \frac{\log(2M)}{s} +\frac{s^2}{8ns} .\end{align*}
Let g(s) = \frac{\log(2M)}{s} +\frac{s}{8n}. Then 0=g'(s) = -\frac{\log (2M)}{s^2}+\frac{1}{8} implies s = \sqrt{8n \log(2M)}. Plug in s we have E\left(\max_{j\in\left\{1,\dots, M\right\}}|Z_{j}|\right) \le \sqrt{\frac{\log(2M)}{2n}}\square.