Theory of Machine Learning
2022-05-26
Chapter 1 Introduction
This is a book of lecture notes scribed by students in the graduate course Math 605, Theory of Machine Learning, in Spring 2022.
(This chapter was scribed by Nicholas P Demarco.)
1.1 What is Machine Learning?
The main focus of this course will be to examine statistical learning theory, namely the mathematical theory behind machine learning. Roughly speaking, we shall examine this question:
how much data is needed in order to achieve a certain prediction accuracy?
In order to answer this question, we shall determine the desirable accuracy and use the theory developed to identify how much training data are needed.
Statistics and machine learning are obviously related to each other. Both rely upon using data to make decisions and inferences. However, the techniques used for estimation are what sets them apart. Note that statistics had already been a developed discipline by the 1920’s, which was much earlier than machine learning; the maximum likelihood estimation (MLE) is the primary tool used for statistical inference. However, in order to consider MLE’s, we are required to know the probability density function (PDF) for which the data is obtained from. In other words, when considering the likelihood function, we must know the PDF \(f(x)\) up to some parameter value \(\theta\). For instance, in the classical case from Math 502 where we consider an estimate for \(\theta\) from the data set \(X_1, X_2,..., X_n\) which are iid \(N(\theta, 1)\) for \(\theta \in \mathbb{R}\), we are able to maximize the associated likelihood function to find an estimator for \(\theta\).
The example above and implementing maximum likelihood estimation in general is an example of modeling our data using a known model. More recently however, new data types are found in fields such as social sciences and the life science which become more challenging to model since we don’t understand the data generating process. Therefore, we need to be able to consider distribution-free methods, which machine learning algorithms primarily concern. An example of this is the image classification problem in which we have some image as the input \(X\); through some “black-box” algorithm, we are able to provide a label of the image as the output \(Y\).
The difference between the two fields becomes smaller and smaller: statistics concerns more finite-sample properties rather than asymptotics than it used to, it embraces mis-specification, and becomes more computationally heavy. On the other hand, machine learning has adapted more and more probabilistic modeling. At their intersection lies statistical learning theory, which is the focus of our course.
1.2 Supervised Learning
Consider the data set \(D=\{ (X_i, Y_i), i=1,...,n\}\) where \(X_i\) is the feature (input) and \(Y_i\) is the label/response (output). The main goal is to use \(D\) to learn a function \(h: \mathcal{X}\to \mathcal{Y}\) where \(\mathcal{X}\) is the input space and \(\mathcal{Y}\) is the label space. Given two such functions \(h_1\) and \(h_2\), which function is better? To answer this question, we shall define a loss function \(\ell : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}\). Typically, \(\ell(\hat{y}, y) \geq 0\) (\(\hat{y}=h(x)\) is the predicted label and \(y\) is the true label). Thus the loss function of a model \(h\) on space \((\mathcal{X},\mathcal{Y})\) is \(\ell (h(X),Y)\) which is a random quantity even when \(h\) is given, since \(X\) and \(Y\) are drawn from a population with a probability distribution.
Ultimately, we wish to find a model \(h\): \[h=\text{argmin}_h L(h)\triangleq E[\ell(h(X),Y)]\] where \((X,Y) \sim P\) for some distribution \(P\) and \(L(h)\triangleq E[\ell(h(x),y)]\) is known as the risk function with respect to the loss function \(\ell\).
One common example for \(\ell\) is the squared error loss \(\ell (\hat{y},y) = (\hat{y}-y)^2\) in repression analysis. In classification, we can use the 0-1 loss \(\ell (\hat{y},y)= \mathbb{1}[\hat{y} \neq y]\).
Hypothesis Class
In practice, we don’t have a way to optimize over arbitrary function. So, we must constrain the set of functions instead. Let’s call this set \(\mathcal{H}\), which is known as the hypothesis family or class. Often, each \(h \in \mathcal{H}\) can be parameterized by a \(\theta\) so each \(h\) is uniquely identified with a parameter \(\theta\). In other words, we can view \(\ell\) as being a function of \(x,y,\) and \(\theta\) rather than of \(h(x)\) and \(y\).
Given \(h\) and \(\mathcal{H}\), we define the excess risk of \(h\) with respect to \(\mathcal{H}\) as \[\mathcal{E}_\mathcal{H} = L(h)- \inf_{h \in \mathcal{H}} L(h).\] Excess risk measures how much worse we do compared to the optimal \(h\). Note that the \(\inf_{h \in \mathcal{H}} L(h)\) is the best performance for risk over all \(h \in \mathcal{H}\). In general, we can define \(\mathcal{E}\) without \(\mathcal{H}\) as \[\mathcal{E} = L(h)-\inf_{h(\cdot)} L(h)\] for any \(h\). Now \(\mathcal{E}\) measures how much worse we do with a given \(h\) compared to the optimal and \(\inf_{h(\cdot)} L(h)\) the best performance for risk over all arbitrary \(h\).
Empirical Risk Minimization
In practice, we can’t minimize \(L(h)\) since we don’t know the distribution \(p\) to compute \(E(\ell(x,y,h))\) under \((X,Y) \sim p\). Instead, we have training data \(D=\{(X_i, Y_i)\}\) which are iid copies of \((X,Y)\sim p\) and can estimate \(L(h)\). We define empirical risk as \[\hat{L}_n (h _\theta) = \frac{1}{n} \sum_{i=1}^n \ell(x_i, y_i , \theta)\] which is the sample mean of the loss function evaluated at each data point in the training data. ERM is the method of finding \(h_\theta\) to minimize \(\hat{L}_n (h _ \theta)\) which we’ll denote as \(\hat{\theta} = \text{argmin}_{\theta \in \Theta} \hat{L}_n (h_\theta)\) for some parameter set \(\Theta\). Since \((X_i, Y_i)\) and \((X,Y)\) are iid, it is obvious that \(E[\hat{L}_n (h)]=L(h)\).
However, is this a good estimator? What guarantee do we have on the risk for \(\hat{\theta}\)? We hope that \(\min_\theta \hat{L}_n(h_\theta) \approx \min L(h_{\hat{\theta}})\). We can show this by showing that excess risk is bounded by a term that eventually goes to zero as \(n \to \infty\).
1.3 Binary Classifications
Let \((X_i, Y_i)\) be for \(i=1,...,n\) be independent copies of \((X,Y)\) where \((X,Y) \in X \in \{0,1\}\). Under 0-1 loss, risk is \[L(h)= E[ \mathbb{1}[h(X) \neq Y]]=P(h(X) \neq Y)\] which is the probability using class \(h\) we make a misclassification.
Remarks:
Given \(x\), \(Y|X=x\) has a Bernoulli distribution with probability \(\eta (x)=P(Y=1|X=x)\); \(\eta(x)\) is known as the regression function.
If \(\eta (x) = \frac{1}{2}\) then \(Y\) is likely 1 as 0. This also corresponds to the boundary of the input space.
If \(\eta(x) \approx 1\) then \(Y\) is very likely 1 where as if \(\eta(x) \approx 0\), \(Y\) is very likely 0.
The value of \(\eta(x)\) contains information about the distribution of \(Y\).
Define the Bayes optimal rule as \(h^*(x)=\mathbb{1}[\eta(x)\geq \frac{1}{2}]\). Actually, we can prove that \(h^*= \text{argmin}_{h(\cdot)} L(h)\), i.e. \(L(h^*) = \inf_{h(\cdot )} L(h)\) known as the bays risk. We cannot compute Bays rule in practice because we don’t know the underling distribution \((X,Y)\) which is needed to compute \(\eta\).
But, we can compute \(\hat{h} = \text{argmin}_{h \in S} \hat{L}_n(h)\) where \(\hat{L}_n (h) = \frac{1}{n} \sum_{i=1}^n \mathbb{1}[h(x_i) \neq y_i]\) and \(S\) is a hypothesis class to by defined. If \(S\) can be anything, the just let \[\hat{h} (x)= \begin{cases} y_i & x=x_i, (x_i, y_i)\in D \\ 0 & \text{Otherwise} \end{cases}.\]
Then \(\hat{L}_n (\hat{h}) =0\) but \(L(\hat{h}) \neq 0\) so a constrained hypothesis class is needed. When \(\hat{h}\) is constructed from the data, then \(L(\hat{h})= E_{(X,Y)}[ \ell( \hat{h}(X), Y)]\) is random due to \(\hat{h}\) is random while \(X\) and \(Y\) are no longer random. However, when conditioned to the data \(D\), \(L(\hat{h})=E[ \ell (\hat{h(X)},Y)|D ]\), \(\hat{h}\) is fixed.
How to restrict the hypothesis class \(\mathcal{H}\)? We have two schools of thought which examine this question.
Generative School: Restrict set of candidate distribution function; e.g. LDA, QDA, \(X|Y=k \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma})\).
Discriminative School: Restriction on \(\mathcal{H}\) directly not on the distribution; SVM, linear classifier, kernel, ridge regression (\(h(\boldsymbol{x}) = \boldsymbol{x}^T \boldsymbol{\beta}+ \boldsymbol{b}\) such that \(\|\boldsymbol{\beta}\|^2 \leq t^2\).
1.4 Estimation and Approximation
Given a hypothesis class, we can decompose \(\mathcal{E}(\hat{h}_n)\) as \[\mathcal{E}(\hat{h}_n) = L(\hat{h}_n)- L(h^*)=L(\hat{h}_n)-\inf_{h \in \mathcal{H}} L(h) + \inf_{h \in \mathcal{H}} L(h) -L(h^*).\] We call \(EE=L(\hat{h}_n)-\inf_{h \in \mathcal{H}} L(h)\) the estimation error and \(AE=\inf_{h \in \mathcal{H}} L(h) -L(h^*)\) the approximation error. Note that excess risk contains estimation error because we use finite samples in our analysis. Additionally, we saw that if \(\mathcal{H}\) is too large then the estimation error will also be large. The size of \(\mathcal{H}\) contributes to the approximation error since if \(\mathcal{H}\) is too small, then it is hard to find \(h \in \mathcal{H}\) with \(L(h) \approx L(h^*)\). However, if \(\mathcal{H}\) is large enough, then there will be no approximation error. Our main focus in this course will be to examine the estimation error since when \(h\) is fixed, approximation error is a constant and estimation error will vary.
The trade-off in both \(AE\) and \(EE\) is to let \(\mathcal{H}\) grow with \(n\). For now, let’s fix \(\mathcal{H}\) and study the "\(n\)-estimate relationship. Some tools for approaching such analysis are concentration inequality, Hoeffding’s Inequalities, and Bernstein’s Inequalities. Analyzing these allow us to bound \(|\hat{L}_n(h)-L(h)|\) with respect to \(h\) and \(p\). More generally, we’ll examine how close \(\frac{1}{n} \sum_{i=1}^n X_i\) is to \(E(X_i)\).
The common idea in this analysis is to recall that \(\hat{h}_n = \text{argmin}_{h \in \mathcal{H}} \hat{L}_n(h)\) which implies that \(\hat{L}_n(\hat{h}_n) \leq \hat{L}_n(h)\) for all \(h \in \mathcal{H}\). We define \(\overline{h} = \text{argmin}_{h \in \mathcal{H}} L(h)\). Note that we can estimate \(\overline{h}\) with \(\hat{h}\) and that \(\hat{L}_n(\hat{h}_n) \leq \hat{L}_n ( \overline{h})\). So \[EE=L(\hat{h}_n)-L(\overline{h})=L(\hat{h}_n)-\inf_{h \in \mathcal{H}} L(h) L(h) = L(\hat{h}_n)-L(\overline{h}).\]
However, how can we bound this quantity? With algebraic manipulations and the use of the Triangle Inequality, we can obtain that
\[\begin{align*} L(\hat{h}_n)-L(\overline{h}) &= \hat{L}_n(\hat{h}_n) - \hat{L}_n(\overline{h}) + L(\hat{h}_n)-\hat{L}_n(\hat{h}_n)+L(\overline{h})-L(\overline{h})\\ &\leq 0 + |L(\hat{h}_n-\hat{L}_n(\hat{h}_n))| + |\hat{L}_n(\overline{h})-L(\overline{h})|\\ &= |L(\hat{h}_n-\hat{L}_n(\hat{h}_n))| + |\hat{L}_n(\overline{h})-L(\overline{h})|. \end{align*}\]
Recall that \(\hat{L}_n(\hat{h}_n) = \frac{1}{n} \sum_{i=1}^n \ell(\hat{h}_n(x_i)y_i)\) and \(\hat{h}\) is random and depends upon the data set \(D\). So, the quantity on the left is hard to bound because \(\hat{L}_n(\hat{h}_n)\) is an average but not of iid random variables. But, the quantity of the right is easy to bound since \(\overline{h}\) is fixed and the estimator \(\hat{L}_n(\overline{h})\) is a sample mean of iid random variables. However, since they are of similar form, we can overcome this difficult by “supping-out” and obtain that \(EE \leq 2 \sup_{h \in \mathcal{H}} |L(h)- \hat{L}_n(h)|\) which is a uniform bound for both quantities. Note that the suprememum is small if \(\mathcal{H}\) is small but \(\mathcal{H}\) should be large enough for the sake of approximation error.