Chapter 17 Project Gradient Descend

(This chapter was scribed by Nicholas P Demarco and Xinhai Zhang.)

In this lecture, we assume that \(f\) is such that for any \(\boldsymbol x \in \mathcal{C}\) and \(g\in\partial f(\boldsymbol x)\) (assuming \(\partial f(\boldsymbol x)\neq\emptyset\)), one has \(\|g\|_2 =L\). Note that by the definition of sub-gradient and Cauchy-Schwarz inequality, this happens if and only if \(f\) is \(L\)-Lipschitz.

17.1 Convergence Rate of Projected Gradient Descent

The important modification we make in PGD is that any updated port \(\boldsymbol x_s\) lies in \(\mathcal{C}\) by projecting it (if necessary) onto \(\mathcal{C}\).

Definition 17.1 (Projection) We define the projection operator \(\pi_\mathcal{C}\) on \(\mathcal{C}\) by \[\pi_\mathcal{C}(\boldsymbol x)=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol y\in\mathcal{C}}\|\boldsymbol y-\boldsymbol x\|\] \(\pi_\mathcal{C}(\boldsymbol x)\) is unique for convex \(\mathcal{C}\).

Lemma 17.1 \(\langle \pi_\mathcal{C}(\boldsymbol x)-\boldsymbol x, \pi_\mathcal{C}(\boldsymbol x)-\boldsymbol z\rangle\leq 0\), \(\forall \boldsymbol z\in\mathcal{C}\) and \(\forall \boldsymbol x\), which implies that \[\|\pi_\mathcal{C}(\boldsymbol x)-\boldsymbol x\|^2+\|\pi_\mathcal{C}(\boldsymbol x)-\boldsymbol z\|^2\leq\|\boldsymbol x-\boldsymbol z\|^2\]

Proof. From the definition of projection, we have that \[\|\boldsymbol x-\pi_\mathcal{C}(\boldsymbol x)\|^2\leq\|\boldsymbol x-\boldsymbol v\|^2\quad\quad\forall\boldsymbol v\in\mathcal{C}\] Fix \(\boldsymbol w\in\mathcal{C}\), define \(\boldsymbol v=(1-t)\pi_\mathcal{C}(\boldsymbol x)+t\boldsymbol w\) for some \(t\in[0,1]\). Since \(\mathcal{C}\) is convex, then we have \(\boldsymbol v\in\mathcal{C}\) as well. \[\begin{aligned}\|\boldsymbol x-\pi_\mathcal{C}(\boldsymbol x)\|^2&\leq\|\boldsymbol x-\boldsymbol v\|^2\\ &=\|\boldsymbol x-(1-t)\pi_\mathcal{C}(\boldsymbol x)-t\boldsymbol w\|^2\\ &=\|\boldsymbol x-\pi_\mathcal{C}(\boldsymbol x)-t(\boldsymbol w-\pi_\mathcal{C}(\boldsymbol x))\|^2\\ &=\|\boldsymbol x-\pi_\mathcal{C}(\boldsymbol x)\|^2+\|t(\boldsymbol w-\pi_\mathcal{C}(\boldsymbol x))\|^2-2t\langle \boldsymbol x-\pi_\mathcal{C}(\boldsymbol x),\boldsymbol w-\pi_\mathcal{C}(\boldsymbol x)\rangle\\ 2t\langle \boldsymbol x-\pi_\mathcal{C}(\boldsymbol x),\boldsymbol w-\pi_\mathcal{C}(\boldsymbol x)\rangle\leq t^2\|\boldsymbol w-\pi_\mathcal{C}(\boldsymbol x)\|^2\\ 2\langle \boldsymbol x-\pi_\mathcal{C}(\boldsymbol x),\boldsymbol w-\pi_\mathcal{C}(\boldsymbol x)\rangle\leq t\|\boldsymbol w-\pi_\mathcal{C}(\boldsymbol x)\|^2\\ \end{aligned}\] The proof is completed by letting \(t\) approaches to 0.

Proof (uniqueness of projection). Assume \(\pi_1,\pi_2\in\mathcal{C}\) satisfy \[\langle\pi_1-\boldsymbol x,\pi_1-\boldsymbol z\rangle\leq 0\quad\quad \forall \boldsymbol{z}\in\mathcal{C}\] \[\langle\pi_2-\boldsymbol x,\pi_2-\boldsymbol z\rangle\leq 0\quad\quad \forall \boldsymbol{z}\in\mathcal{C}\] Take \(\boldsymbol{z}=\pi_2\) in the first inequality and \(\boldsymbol{z}=\pi_1\) in the second. We have \[\langle\pi_1-\boldsymbol x,\pi_1-\pi_2\rangle\leq 0\] \[\langle\pi_2-\boldsymbol x,\pi_2-\pi_1\rangle\leq 0\quad\quad\Leftrightarrow\quad\quad\langle\boldsymbol x-\pi_2,\pi_1-\pi_2\rangle\leq 0\] add the two inequalities we have \[2\langle\pi_1-\pi_2,\pi_1-\pi_2\rangle\leq 0\quad\implies\quad\|\pi_1-\pi_2\|^2=0\quad\implies\quad\pi_1=\pi_2\] \[\begin{aligned} &2\langle\pi_1-\pi_2,\pi_1-\pi_2\rangle\leq0\\ &\implies\|\pi_1-\pi_2\|^2&=0\\ &\implies\pi_1&=\pi_2\end{aligned}\]

The PGD algorithm iterates the following update equation for \(s\geq 1\) \[\begin{cases} \boldsymbol y_{s+1}= \boldsymbol x_s-\eta_sg_s,\quad\quad g_s\in\partial f(\boldsymbol x_s)\\ \boldsymbol x_{s+1}=\pi_{\mathcal{C}}(\boldsymbol y_{s+1}) \end{cases}\] After \(k\) steps, it returns either \(\bar{\boldsymbol x}=\frac{1}{k}\sum^k_{s=1}\boldsymbol x_s\) or \(\boldsymbol x^o\in\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol x\in\{\boldsymbol x_1,\ldots,\boldsymbol x_k\}}f(\boldsymbol x)\). We now prove a rate of convergence for this new method.

Theorem 17.1 Let \(\mathcal{C}\) be a closed convex subset of \(\mathbb{R}^d\) with \(diam(\mathcal{C})\leq R\). Let \(f\) be a convex \(L\)-Lipschitz function on \(\mathcal{C}\) such that \(\boldsymbol x^*\in\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol x\in\mathcal{C}}f(\boldsymbol x)\) exists. Then if \(\eta_s=\frac{R}{L\sqrt{k}}\) (not depending on \(s\)), we have \[f(\bar{\boldsymbol x})-f(\boldsymbol x^*)\leq\frac{LR}{\sqrt{k}}\quad\quad\text{and}\quad\quad f(\boldsymbol x^o)-f(\boldsymbol x^*)\leq\frac{LR}{\sqrt{k}}\]

Proof. Again, recall the identity \[2\boldsymbol a^T\boldsymbol b=\|\boldsymbol a\|^2+\|\boldsymbol b\|^2-\|\boldsymbol a- \boldsymbol b\|^2\] \[\begin{aligned}&\quad f(\boldsymbol x_s)-f(\boldsymbol x^*)\\ \text{(definition of sub-gradient)}\quad&\leq g_s^T(\boldsymbol x_s-\boldsymbol x^*)\quad (\text{assuming}\quad g_s\in\partial f(\boldsymbol x_s))\\ &=\frac{1}{\eta_s}(\boldsymbol x_s-\boldsymbol y_{s+1})^T(\boldsymbol x_s-\boldsymbol x^*)\\ &=\frac{1}{2\eta_s}(\|\boldsymbol x_s-\boldsymbol y_{s+1}\|^2+\|\boldsymbol x_s-\boldsymbol x^*\|^2-\|\boldsymbol y_{s+1}-\boldsymbol x^*\|^2) \end{aligned}\] Next \[\begin{aligned} &\quad\|\boldsymbol y_{s+1}-\boldsymbol x^*\|^2\\ &=\|\boldsymbol y_{s+1}-\boldsymbol x_{s+1}\|^2+\|\boldsymbol x_{s+1}-\boldsymbol x^*\|^2+2\langle\boldsymbol y_{s+1}-\pi_{\mathcal{C}}(\boldsymbol y_{s+1}),\pi_{\mathcal{C}}(\boldsymbol y_{s+1})-\boldsymbol x^*\rangle\\ \text{(lemma)}\quad&\geq 0+\|\boldsymbol x_{s+1}-\boldsymbol x^*\|^2+0 \end{aligned}\] Also, since \(f\) is \(L\)-Lipschitz, we have \[\|\boldsymbol x_s-\boldsymbol y_{s+1}\|^2=\|\eta_sg_s\|^2=\eta_s^2\|g_s\|^2\leq\eta_s^2L^2\] Combining the above, \[f(\boldsymbol x_s)-f(\boldsymbol x^*)\leq\frac{1}{2\eta_s}(\eta_s^2L^2+\|\boldsymbol x_s-\boldsymbol x^*\|^2-\|\boldsymbol x_{s+1}-\boldsymbol x^*\|^2)\] Take the sum from \(s=1\) to \(s=k\) and divide it by \(k\) we have \[\begin{aligned}\frac{1}{k}\sum_{s=1}^kf(\boldsymbol x_s)-f(\boldsymbol x^*)&\leq\frac{1}{2\eta_sk}(k\eta_s^2L^2+\|\boldsymbol x_1-\boldsymbol x^*\|^2-\|\boldsymbol x_{k+1}-\boldsymbol x^*\|^2)\\ &\leq\frac{\eta_sL}{2}+\frac{1}{2\eta_sk}\|\boldsymbol x_1-\boldsymbol x^*\|^2\\ &\leq\frac{\eta_sL}{2}+\frac{1}{2\eta_sk}R^2 \end{aligned}\] The RHS is minimized when \(\eta=\frac{R}{L\sqrt{k}}\) and the minimum is \(\frac{RL}{\sqrt{k}}\). On the other hand, \(f(\bar{\boldsymbol x})-f(\boldsymbol x^*)\) and \(f(\boldsymbol x^o)-f(\boldsymbol x^*)\) are both bounded by the LHS.

  • One can shows that this rate can not be improved. This is disappointing since to get the optimization error of \(\epsilon\), we need to update \(O(\frac{1}{\epsilon^2})\) times. We can explore more restrictive assumptions on the function \(f\) to get a better rate.
  • The computational bottleneck of PGD algorithm is often the projection step, which is convex optimization by itself. Fortunately, in some cases, this problem may admit analytical solution (e.g. $ is an Euclidean ball), or an easy / fast combinatorial algorithm to solve it (e.g. \(\mathcal{C}\) is an \(L_1\) ball).
  • Like in the case of GD, the step size depends on the number of steps, which is undesirable. We can use the same proof improvement introduced after GD by using step size \(\eta_s=\frac{R}{RL\sqrt{s}}\). This will lead to the same convergence rate and the step size is independent of \(R\).
  • Moreover, unlike in the GD case, this proof improvement comes without a price to pay interms of whether \(\|\boldsymbol x_{\frac{k}{2}}-\boldsymbol x^*\|\leq R\) since \(\boldsymbol x_s\) is always at most \(R\) away from \(\boldsymbol x^*\) due to the projection step.

17.2 Examples (SVM and Boosting)

As we saw before, the SVM minimization is \[\min_{\boldsymbol\alpha\in\mathbb{R}^n}\frac{1}{n}\sum^n_{i=1}\max(0,1-y_if_{\boldsymbol\alpha}(x_i))\quad\quad\text{s.t.}\quad\boldsymbol\alpha^T\boldsymbol K\boldsymbol\alpha\leq c^2\] where \(f_{\boldsymbol\alpha}(x_i)=\sum_{j=1}^n\alpha_jk(x_j,x_i)=\boldsymbol\alpha^Tk\boldsymbol e_i\) and \(\boldsymbol e_i=(0,\ldots,0,1,0,\ldots,0)^T\) whose \(i\)-th element is 1 and 0 elsewhere.
The convex set is \(\mathcal{C}=\{\boldsymbol\alpha:\boldsymbol\alpha^T\boldsymbol K\boldsymbol\alpha\leq c^2\}\), which is an ellipsoid. The projection onto the ellipsoid \(\mathcal{C}\) is not too hard. We can solve \(\boldsymbol\alpha\) using PGD.
For convenience, let \(\ell_i(\boldsymbol\alpha)=max(0,1-y_if_{\boldsymbol\alpha}(x_i))\). In order to obtain the optimization error bound, we need to determine the exact values of diameter \(R\) and Lipschitz constant \(L\).
First, to find \(L\), start with the gradient of \(\ell_i(\boldsymbol\alpha)\) \[\nabla\{max(0,1-y_i\boldsymbol e_i^Tk\boldsymbol\alpha)\}=\mathbb{1}(1-y_i\boldsymbol e_i^Tk\boldsymbol\alpha\leq 0)(-1)(y_ik\boldsymbol e_i)\] So the gradient of this objective function \(\frac{1}{n}\sum_{i=1}^n\ell_i(\boldsymbol\alpha)\) has a norm \[\begin{aligned} \left\|\frac{1}{n}\sum_{i=1}^n\nabla\ell_i(\boldsymbol\alpha)\right\| &=\left\|\frac{1}{n}\sum_{i=1}^n\mathbb{1}(1-y_i\boldsymbol e_i^Tk\boldsymbol\alpha\leq 0)(y_ik\boldsymbol e_i)\right\|\\ &\leq\frac{1}{n}\sum_{i=1}^n\|k\boldsymbol e_i\|\\ &=\frac{1}{n}\sum_{i=1}^n\sqrt{\sum_{j=1}^nk(\boldsymbol x_j,\boldsymbol x_i)^2}\\ &\leq\frac{1}{n}\sum_{i=1}^n\sqrt{\sum_{j=1}^n|k(\boldsymbol x_j,\boldsymbol x_j)||k(\boldsymbol x_i,\boldsymbol x_i)|}\\ &\leq\frac{1}{n}\sum_{i=1}^n\sqrt{nk_{max}^2}\\ &=k_{max}\sqrt{n}\triangleq L \end{aligned}\] To find \(R\), we compute \(diam\{\boldsymbol\alpha:\boldsymbol\alpha^T\boldsymbol K\boldsymbol\alpha\leq c^2\}\) (an ellipsoid), which is \(2\max_{\boldsymbol\alpha:\boldsymbol\alpha^T\boldsymbol K\boldsymbol\alpha\leq c^2}\|\boldsymbol\alpha\|\). Students in Math 570 should know that the maximum is attained when \(\boldsymbol\alpha\) is along the eigenvector for the largest eigenvalue of \(\boldsymbol K^{-1}\) (i.e., eigenvector of \(\boldsymbol K\) for its smallest eigenvalue).
Alternatively, consider the following inequality. \[\lambda_{min}(\boldsymbol K)\boldsymbol\alpha^T\boldsymbol\alpha\leq\boldsymbol\alpha^T\boldsymbol K\boldsymbol\alpha\leq c^2\] Then we know \[\sqrt{\boldsymbol\alpha^T\boldsymbol\alpha}=\|\boldsymbol\alpha\|\leq\frac{c}{\lambda_{min}(\boldsymbol K)}\] We should set \(\quad R=\frac{2c}{\lambda_{min}(\boldsymbol K)}\).
Q: How small can \(\lambda_{min}(\boldsymbol K)\) be?
Imagine that \(\boldsymbol x_1,\ldots,\boldsymbol x_n\sim N(0,\boldsymbol I_d)\) and consider linear kernel. Then \[\boldsymbol K_{n\times n}=\boldsymbol X_{n\times d}\boldsymbol X^T \quad\quad\text{where}\quad \boldsymbol X=\begin{bmatrix}\boldsymbol x_1^T\\\boldsymbol x_2^T\\\vdots\\\boldsymbol x_n^T\end{bmatrix}\] is the data matrix.
Using results from random matrix theory, when \(n,d\to\infty\) with \(\frac{n}{d}=r\) constant, we have \[\lambda_{min}(\frac{1}{d}\boldsymbol K)\to(1-\sqrt{r})^2 \quad\quad\text{for large $d$ and $n$ with $d\gg n$.} \] Take an approximation \[\lambda_{min}(\boldsymbol K)\approx d(1-\sqrt{\frac{n}{d}})^2\approx d(1-2\sqrt{\frac{n}{d}})\geq \frac{d}{2}\] This means that we don’t have to worry about \(\lambda_{min}(\boldsymbol K)\) being too small.
Now, collecting the expression for \(L\) and \(R\) combining with the theorem from the last subsection, the optimization error bound is \[\frac{LR}{\sqrt{k}}=\frac{2k_{max}\sqrt{n}c}{\sqrt{k}\sqrt{\lambda_{min}(\boldsymbol K)}}\] Note that statistical error is often \(\frac{1}{\sqrt{n}}\), or fast rate \(\frac{1}{n}\) in special cases. It’s reasonable to control the optimization error to be up to \(\frac{1}{n}\), i.e., \[\frac{LR}{\sqrt{k}}=\frac{2k_{max}\sqrt{n}c}{\sqrt{k}\sqrt{\lambda_{min}(\boldsymbol K)}}\leq\frac{1}{n}\quad\implies\quad k\geq \frac{4k_{max}^2n^3c^2}{\lambda_{min}(\boldsymbol K)}\] In other words, we need at least \(\frac{4k_{max}^2n^3c^2}{\lambda_{min}(\boldsymbol K)}\) steps in solving SVM using PGD. This is not too hard considering that \(n\) is often not large in SVM.

Next we analyze boosting, whose minimization is \[\min_{\alpha\in\mathbb{R}^N,\|\alpha\|_1\leq 1}\frac{1}{n}\sum_{i=1}^n\phi(y_if_\alpha(\boldsymbol{x}_i))\] where \(\phi\) is often the exponential loss, \(\phi(z)=e^{-z}\), which is L-Lipschitz with \(L=e\) and \(f_\alpha=\sum^N_{j=1}\alpha_jf_j\) with \(f_j, j=1,\ldots,N\) being the weak learners and \(f_i\in[-1,1]\). Be mindful that \(N\) is the number of weak learners and \(n\) is the sample size.
Take the gradient of the loss w.r.t. \(\alpha\), we have \[\nabla(\frac{1}{n}\sum^n_{i=1}\phi(y_if_\alpha(\boldsymbol{x}_i)))=\frac{1}{n}\sum^n_{i=1}\phi'(y_if_\alpha(\boldsymbol{x}_i))y_i\begin{bmatrix}f_1(\boldsymbol{x}_i)\\\vdots\\f_N(\boldsymbol{x}_i)}\end{bmatrix}\triangleq \vec F(\boldsymbol{x}_i)\] The norm of the above gradient is \[\begin{aligned} &\quad\|\nabla(\frac{1}{n}\sum^n_{i=1}\phi(y_if_\alpha(\boldsymbol{x}_i)))\|\\ &\leq\frac{1}{n}\sum^n_{i=1}\|\phi'(y_if_\alpha(\boldsymbol{x}_i))y_i\vec F(\boldsymbol{x}_i)\|\\ &\leq\frac{1}{n}\sum^n_{i=1}L\|\vec F(\boldsymbol{x}_i)\|\\ &\leq L\sqrt{N} \end{aligned}\] We then define \(\tilde L\triangleq L\sqrt{N}\) as the Lipschitz constant of the objective. Note that \(diam\{\alpha:\|\alpha\|_1\leq 1\}=2\triangleq R\). Therefore the opt error is \[\frac{R\tilde L}{\sqrt{K}}=\frac{2L\sqrt{N}}{\sqrt{K}}\] For \(\frac{2L\sqrt{N}}{\sqrt{K}}\leq \frac{1}{n}\), then we need \(K\sim Nn^2\). This can be bad for large \(N\) and \(n\).