Chapter 16 Convex optimization and gradient descent
(This chapter was scribed by Zhou Wang.)
In the next few chapters, we will discuss the basics of convex optimization as it applies to machine learning.
16.1 Introduction to convex problems
The central objects of our studies are convex functions and convex sets in \(\mathbb{R}^d\).
Definition 16.1 (Convex sets and convex functions) A set \(\mathcal{C}\subset\mathbb{R}^d\) is said to be convex if it contains all of its segments, that is \[\forall~(\boldsymbol x,\boldsymbol y,\gamma)\in\mathcal{C}\times\mathcal{C}\times[0,1], ~(1-\gamma)\boldsymbol x+\gamma \boldsymbol y\in\mathcal{C}.\] A function \(f:\mathcal{C}\to\mathbb{R}\) is said to be convex if it always lies below its chords, that is \[\forall~(\boldsymbol x,\boldsymbol y,\gamma)\in\mathcal{C}\times\mathcal{C}\times[0,1], ~f\left((1-\gamma)\boldsymbol x+\gamma \boldsymbol y\right)\leq(1-\gamma)f(\boldsymbol x)+\gamma f(\boldsymbol y).\]
We are interested in algorithms that take as input a convex set \(\mathcal{C}\) and a convex function \(f\), and return the minimizer of \(f\) over \(\mathcal{C}\), i.e. \(\min f(\boldsymbol x), ~\mbox{s.t.}~ \boldsymbol x\in\mathcal{C}\). Many convex problems in machine learning take the form: \[\min\limits_{\boldsymbol x\in\mathbb{R}^d}\sum\limits_{i=1}^nf_i(\boldsymbol x)+\lambda\mathcal{R}(\boldsymbol x)\] or rather \(\min\limits_{\boldsymbol\beta\in\mathbb{R}^d}\sum\limits_{i=1}^n f_i(\boldsymbol \beta)+\lambda\mathcal{R}(\boldsymbol \beta)\), where \(f_1, \cdots, f_n\) are convex “cost” functions and \(\mathcal{R}\) is a convex regularization. For example:
- SVM: \(f_i(\boldsymbol\beta)=\max(0, 1-y_i\boldsymbol x_i^\top\boldsymbol\beta)\) and \(\mathcal{R}(\boldsymbol\beta)=\|\boldsymbol\beta\|^2_2\).
- Regularized logistic regression: \(f_i(\boldsymbol \beta)=\log\left(1+\exp(-y_i\boldsymbol x_i^\top\boldsymbol\beta)\right)\) and \(\mathcal{R}(\boldsymbol\beta)=\|\boldsymbol\beta\|^2_2\).
- Linear regression: \(f_i(\boldsymbol\beta)=(y_i-\boldsymbol x_i^\top\boldsymbol\beta)^2\) and
- without regularization: \(\mathcal{R}(\boldsymbol\beta)=0\),
- ridge regression: \(\mathcal{R}(\boldsymbol\beta)=\|\boldsymbol\beta\|^2_2\),
- lasso penalty: \(\mathcal{R}(\boldsymbol\beta)=\|\boldsymbol\beta\|_1\).
We now introduce the key notion of subgradient.
Definition 16.2 (Subgradient) Let \(\mathcal{C}\subset\mathbb{R}^d, f: \mathcal{C}\to \mathbb{R}\). Then \(\boldsymbol g\in\mathbb{R}^d\) is a subgradient of \(f\) at \(\boldsymbol x\in\mathcal{C}\) if for any \(\boldsymbol y\in\mathcal{C}\), one has \[f(\boldsymbol x)-f(\boldsymbol y)\leq \boldsymbol g^\top(\boldsymbol x-\boldsymbol y).\] The set of all such vectors \(\boldsymbol g\) is denoted by \(\partial f(\boldsymbol x)\).
Subgradient essentially corresponds to gradients. But unlike gradient, they always exist for convex functions, even when they are not differentiable.
Theorem 16.1
If \(\mathcal{C}\) is convex, then “\(f: \mathcal{C}\to \mathbb{R}\) is convex” concludes “\(\partial f(\boldsymbol x)\neq \emptyset, \forall~\boldsymbol x\in\mbox{int}(\mathcal{C})\).”
If \(\mathcal{C}\) is convex, then “\(\partial f(\boldsymbol x)\neq \emptyset, \forall~\boldsymbol x\in\mbox{int}(\mathcal{C})\)” concludes “\(f: \mathcal{C}\to \mathbb{R}\) is convex.”
If \(f\) is convex and differentiable at \(\boldsymbol x\), then \(\nabla f(\boldsymbol x)\in\partial f(\boldsymbol x)\).
The proof of the theorem is omitted. Note that it requires the “Supporting Hyperplane Theorem,” which states that for convex \(\mathcal{C}\) and \(\boldsymbol x_0\in\partial\mathcal{C}\) (\(\partial\mathcal{C}\) denotes the boundary of set \(\mathcal{C}\)), \(\exists~ \boldsymbol w\in\mathbb{R}^d\), s.t. \(\boldsymbol w^\top\boldsymbol x\geq \boldsymbol w^\top \boldsymbol x_0, ~\forall~\boldsymbol x\in\mathcal{C}\).
Why convexity? The key to the success in minimizing convex function is that they exhibit a “local to global” phenomenon. See below for an example.
Theorem 16.2 (Local minima are global minima) Let \(f\) be convex and \(\mathcal{C}\) be a convex set. If \(\boldsymbol x\) is a local minimizer of \(f\) on \(\mathcal{C}\), then it is also a global minimizer. Furthermore, this happens iff \(~\boldsymbol 0\in\partial f(\boldsymbol x)\).
Proof. If \(\boldsymbol 0\in\partial f(\boldsymbol x)\), then due to the definition of subgradient, \(f(\boldsymbol x)-f(\boldsymbol y)\leq \boldsymbol 0^\top(\boldsymbol x-\boldsymbol y)=0, \forall \boldsymbol y\in\mathcal{C}\). This implies \(\boldsymbol x\) is a global minimizer. The reverse is clearly true (again due to the definition of subgradient).
Next, suppose \(\boldsymbol x\) is a local minimizer, then for any \(y\in \mathcal{C}\), there exists \(\varepsilon\) small enough s.t. \[\begin{aligned} f(\boldsymbol x)&\leq f((1-\varepsilon)\boldsymbol x+\varepsilon\boldsymbol y) ~~~~~f(\boldsymbol x) ~\mbox{is smaller than its neighborhood}\\ &\leq (1-\varepsilon) f(\boldsymbol x)+\varepsilon f(\boldsymbol y) ~~\mbox{due to}~ f~\mbox{is convex}, \end{aligned}\] which yields \(\varepsilon f(\boldsymbol x)\leq\varepsilon f(\boldsymbol y)\) by manipulating the inequality and hence \(f(\boldsymbol x)\leq f(\boldsymbol y) ~\forall y\in\mathcal{C}\).
The theorem also tells us where the minimum can be. If \(\boldsymbol g^\top(\boldsymbol x-\boldsymbol y)<0\), where \(\boldsymbol g\) is a subgradient at \(\boldsymbol x\), then \(f(\boldsymbol x)-f(\boldsymbol y)=\boldsymbol g^\top(\boldsymbol x-\boldsymbol y)<0\), and hence it is impossible for \(\boldsymbol y\) to be the minimizer. Therefore we should narrow down our search to only those \(\boldsymbol y\) with \(\boldsymbol g^\top(\boldsymbol x-\boldsymbol y)\geq 0\), i.e. \(\boldsymbol g^\top \underbrace{(\boldsymbol y-\boldsymbol x)}_{\Delta \boldsymbol x}\leq0\). This concept leads to the idea of gradient descent, in which we choose \(\Delta x=-\boldsymbol g\).
16.2 Gradient descent
We will investigate variants of the gradient descent algorithm. It is the simplest strategy to minimize a differentiable function on \(\mathbb{R}^d\), starting at some initial points \(\boldsymbol x\in\mathbb{R}^d\). It iterates the following equation, e.g. \(\boldsymbol x_{t+1}=\boldsymbol x_t-\eta\cdot \nabla f(\boldsymbol x_t)\), where \(\eta>0\) is a fixed step size. The rationale behind is to make a small step in the direction that minimizes the local first order Taylor approximation of \(f\) at \(\boldsymbol x\) (a.k.a. the steepest descent direction). The returned minimizer can either be \(\bar{\boldsymbol x}=\frac{1}{k}\sum\limits_{s=1}^k\boldsymbol x_s\) or \(\boldsymbol x^0\in \mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}\limits_{\boldsymbol x\in \{\boldsymbol x_s\}_{s=1}^k}f(\boldsymbol x)\).
The theorem below provides bounds on the optimization error of the gradient descent algorithm.
Theorem 16.3 Let \(f\) be a convex \(L\)-Lipschitz function on \(\mathbb{R}^d\) such that \(\boldsymbol x^*\in\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}\limits_{\boldsymbol x\in\mathbb{R}^d}f(\boldsymbol x)\) exists. Assume that \(\|\boldsymbol x_1-\boldsymbol x^*\|_2\leq R\), where \(\boldsymbol x_1\) is an initialization. Then if \(\eta=\frac{R}{L\sqrt{k}}\) for all \(s\geq1\),
(1) \(f\left(\frac{1}{k}\sum\limits_{s=1}^k\boldsymbol x_s\right)-f(\boldsymbol x^*)\leq\frac{LR}{\sqrt{k}}\);
(2) \(f(\boldsymbol x^0)-f(\boldsymbol x^*)\leq \frac{LR}{\sqrt{k}}\).
Proof. Note that \(\boldsymbol g_s\in\partial f(\boldsymbol x_s)\) satisfies \(\boldsymbol g_s=\frac{1}{\eta}(\boldsymbol x_s-\boldsymbol x_{s+1})\) by the gradient updating formula. Note that the equality \(2\boldsymbol a^\top\boldsymbol b=\|\boldsymbol a\|^2+\|\boldsymbol b\|^2-\|\boldsymbol a-\boldsymbol b\|^2\). Then we have \[\begin{aligned} f(\boldsymbol x_s)-f(\boldsymbol x^*)&\leq \boldsymbol g_s^\top (\boldsymbol x_s-\boldsymbol x^*) ~~~\mbox{definition of subgradient}\\ &=\frac{1}{\eta}(\boldsymbol x_s-\boldsymbol x_{s+1})^\top(\boldsymbol x_s-\boldsymbol x^*)\\ &=\frac{1}{2\eta}\biggl(\|\boldsymbol x_s-\boldsymbol x_{s+1}\|^2+\underbrace{\|\boldsymbol x_s-\boldsymbol x^*\|^2}_{\delta_s^2}-\underbrace{\|\boldsymbol x_{s+1}-\boldsymbol x^*\|^2}_{\delta_{s+1}^2}\biggl)\\ &=\frac{1}{2\eta}\biggl(\eta^2\|\boldsymbol g_s\|^2+\delta_s^2-\delta_{s+1}^2\biggl)\\ &=\frac{\eta}{2}\|\boldsymbol g_s\|^2 +\frac{1}{2\eta}\left(\delta_s^2-\delta_{s+1}^2\right). \end{aligned}\] Since \(f\) is \(L\)-Lipschitz, \(f(\boldsymbol x)-f(\boldsymbol y)\leq L\|\boldsymbol x-\boldsymbol y\|, ~\forall~ \boldsymbol x,\boldsymbol y\in\mathbb{R}^d\), then we have \[\begin{equation} f(\boldsymbol x+\boldsymbol g_x)-f(\boldsymbol x)\leq L\|\boldsymbol g_x\|.(\#eq:lip) \end{equation}\] By the definition of subgradient of \(f\) at \(\boldsymbol x\), we have \[\begin{equation} f(\boldsymbol x)-f(\boldsymbol x+\boldsymbol g_x)\leq \boldsymbol g_x^\top(\boldsymbol x-(\boldsymbol x + \boldsymbol g_x)). \tag{16.1} \end{equation}\] We can conclude \(L\geq \|\boldsymbol g_x\|, \forall~\boldsymbol x\in\mathbb{R}^d\) by combining inequalities (??) and (16.1). Therefore, \(L\) is an upper bound of \(\|\boldsymbol g_x\|\) everywhere.
Following this, we have \[\begin{aligned}
f(\boldsymbol x_s)-f(\boldsymbol x^*)&\leq\frac{\eta L^2}{2}+\frac{1}{2\eta}(\delta_s^2-\delta_{s+1}^2)\\
\Rightarrow ~~\frac{1}{k}\sum\limits_{s=1}^k f(\boldsymbol x_s)-f(\boldsymbol x^*)&\leq \frac{\eta L^2}{2} + \frac{1}{2\eta k}(\delta_1^2-\delta_{k+1}^2)\\
&\leq\frac{\eta L^2}{2} + \frac{\delta_1^2}{2\eta k}\\
&\leq \frac{\eta L^2}{2}+\frac{1}{2\eta k}R^2\\
&=\frac{LR}{\sqrt{k}} ~~~~~\mbox{choose}~\eta=\frac{R}{L\sqrt{k}}
\end{aligned}\]
We note that the LHS above satisfies
(1) \(f\left(\frac{1}{k}\sum\limits_{s=1}^k\boldsymbol x_s\right)-f(\boldsymbol x^*)\leq\mbox{LHS}\), and
(2) \(\min\limits_{\boldsymbol x\in\{\boldsymbol x_s\}_{s=1}^k}f(\boldsymbol x)-f(\boldsymbol x^*)\leq \mbox{LHS}\).
Remark.
- In this theorem, and many forthcoming results, we focus on the optimization error, i.e. how far away the solution returned by the algorithm is from what we really intended. See below illustration \[f^0~\underbrace{\xrightarrow{\hspace{50pt}}}_{\mbox{opt. err}}~\hat f\underbrace{\xrightarrow{\hspace{50pt}}}_{\mbox{est. err}}~\bar f \underbrace{\xrightarrow{\hspace{50pt}}}_{\mbox{opprox. err}}~f^*\]
- Intuitively, since we worked very hard to bound the estimation error and the approximation error, we should not allow the optimization error to have a lower convergence rate (or it would ruin the party).
- The dimension \(d\) does not appear anywhere in the proof. However, the dimension does have an effect since for large dimension, \(L\)-Lipschitz and \(\|\boldsymbol x_1-\boldsymbol x^*\|\leq R\) are both very strong assumptions.
- One obvious flaw with the current result is that the step size \(\eta\) depends on the number of steps \(k\), which does not make sense since we often have no idea on how many steps to go before we start.
We would rather have varying step size \(\eta_s\) that is independent on \(k\). Consider a weighted sum of the inequality \[\begin{aligned} f(\boldsymbol x_s)-f(\boldsymbol x^*)&\leq \frac{\eta_s L^2}{2}+\frac{1}{2\eta_s}(\delta_s^2-\delta_{s+1}^2)\\ \Rightarrow~~\sum\limits_{s=1}^k\eta_s\left(f(\boldsymbol x_s)-f(\boldsymbol x^*)\right)&\leq \sum\limits_{s=1}^k\frac{\eta_s^2 L^2}{2}+\frac{1}{2}(\delta_1^2-\delta_{k+1}^2)\\ \Rightarrow~~\sum\limits_{s=1}^k\eta_s\left(f(\boldsymbol x_s)-f(\boldsymbol x^*)\right)&\leq \frac{L^2}{2}\cdot \sum\limits_{s=1}^k\eta_s^2+\frac{R^2}{2} \end{aligned}\]
We will divide both sides by \(\sum\limits_{s=1}^k\eta_s\) to make LHS of above inequality a weighted average. Moreover, we will need the RHS\(\rightarrow0\) as \(k\rightarrow\infty\). To this end, we need \[\frac{\sum\limits_{s=1}^k\eta_s^2}{\sum\limits_{s=1}^k\eta_s}\rightarrow0, ~~~\mbox{and}~ \sum\limits_{s=1}^k\eta_s\rightarrow\infty~~~\mbox{as}~k\rightarrow\infty.\] One candidate for this requirement is \(\eta_s=\frac{G}{\sqrt{s}}\) for some \(G\). Since \(\sum\limits_{s=1}^k\eta_s^2\leq C_1G^2\log(k)\) and \(\sum\limits_{s=1}^k\eta_s\geq C_2G\sqrt{k}\) for some \(C_1, C_2>0\), we get \[\begin{equation} \frac{\sum\limits_{s=1}^k\eta_s\left(f(\boldsymbol x_s)-f(\boldsymbol x^*)\right) }{\sum\limits_{s=1}^k\eta_s}\leq\frac{L^2}{2}\cdot\frac{ \sum\limits_{s=1}^k\eta_s^2}{\sum\limits_{s=1}^k\eta_s}+\frac{R^2}{2\sum\limits_{s=1}^k\eta_s}\leq\frac{C_1GL^2\log(k)}{2C_2\sqrt{k}}+\frac{R^2}{2C_2G\sqrt{k}}.\tag{16.2} \end{equation}\]
Choosing \(G\) appropriately to minimize the RHS of inequality (16.2), we can let RHS\(\rightarrow0\) at the rate of \(LR\sqrt{\frac{\log(k)}{k}}\).
- This above improvement is still not ideal since we get an extra factor \(\sqrt{\log(k)}\). However, if we look at the sum from \(\frac{k}{2}\) to \(k\) (instead of from \(1\) to \(k\)). We have \(\sum\limits_{s=k/2}^k\eta_s^2\leq C_1^\prime G^2\) and \(\sum\limits_{s=k/2}^k\eta_s\geq C_2^\prime G\sqrt{k}\). Now we have \[\min\limits_{1\leq s\leq k}f(\boldsymbol x_s)-f(\boldsymbol x^*)\leq\min\limits_{k/2\leq s\leq k}f(\boldsymbol x_s)-f(\boldsymbol x^*)\leq\frac{\sum\limits_{s=k/2}^k\eta_s\left(f(\boldsymbol x_s)-f(\boldsymbol x^*)\right) }{\sum\limits_{s=k/2}^k\eta_s}\leq\frac{CLR}{\sqrt{k}},\] which is the same rate as in the Theorem 16.3 and step size is independent on \(k\).
- There is a price to pay in the second improvement above! The proof only works if we have \(\|\boldsymbol x_{\frac{k}{2}}-\boldsymbol x^*\|\leq R\) since we replace \(\boldsymbol x_1\) by \(\boldsymbol x_{\frac{k}{2}}\) in the telescoping sum. In general, this is not true for gradient descent (or difficult to guarantee). But it will be true for the projected gradient descent in the next lecture.