Chapter 18 Mirror Descent

(The first part of this chapter was scribed by Xinhai Zhang.)

18.1 Dual space and mirror map

Reviewing the proof in analyzing boosting, \(\vec\alpha\) is chosen from a space with constrained \(L_1\) norm, but due to the use of PGD, we bounded the \(L_2\) norm of the gradient.

Boosting is an example where we conduct optimization in a non-euclidean space (e.g. \(L_1\) space) while the dual of the \(L_1\) norm is \(L_2\) norm itself, the dual of \(L_1\) norm is \(L_\infty\) norm (i.e., the sup norm).

To better exploit the special geometrical structure of the space, and to obtain a better rate of convergence, we can use the mirror descent algorithm.

To discuss the intuition, let’s abstract for a moment, we saw that PGD works in a Hilbert space. Suppose now we are optimizing in more general Banach space \(B\). In particular, the norm is not derived from an inner product. (think of \(B= L_1\) space). In this case, the gradient \(\nabla f(\boldsymbol{x})\) are elements of the dual space \(B^*\). Then the gradient descent does not even make sense since the GD update \(\boldsymbol{x}-\eta\nabla f(\boldsymbol{x})\) can’t be performed. (Noe that in Hilbert space this was not a problem, since the dual space of \(\mathcal{H}\) is isometric to \(\mathcal{H}\) by Riesz representation theorem.)

Definition 18.1 (Dual Space) Let \(V\) be a vector space over a field \(\mathcal{F}\) (usually \(\mathbb{R}\)). A linear functional on \(V\) is a linear map \(V\mapsto \mathcal{F}\). The dual space of \(V\), denoted by \(V^*\), is the space of all linear functionals on \(V^*\).

The main idea of minor descent is the following. Now that directly conducting GD is inappropriate, in order to perform optimization, in the primal space \(B\), one can first map a point \(x\in B\) in the primal space to the dual space \(B^*\). Then perform the GD update in the dual space \(B^*\), and finally map back the resulting point to the primal space \(B\). Note that at each update, the new print in the primal space might be outside of the constraint set \(C\subset B\), in which case it should be projected into the constraint set \(C\).

Ingredient of mirror descent:
- Potential function also known as mirror map \(\Phi : B\mapsto \mathbb{R}\). The gradient of \(\Phi\) is used to map points from the primal to the dual.
- Bregman divergence associated with \(\Phi\), which is a measure of distance between two points in the primal.
- The projection is done via Bregman divergence.

Definition 18.2 (Dual Norm) For any given arbitrary norm \(\|\cdot\|\) on \(\mathbb{R}^d\), the dual norm \(\|\cdot\|_*\) is defined as \[\|g\|_*=\sup_{\boldsymbol{x}\in\mathbb{R}^d, \|\boldsymbol{x}\|\leq 1} g^T\boldsymbol{x}\]

  • The dual of the Euclidean (\(L_2\)) norm is itself. This comes directly from C-S inequality.
  • The dual of the \(L_\infty\) norm is \(L_1\) norm since inequality \(\boldsymbol{x}^T\boldsymbol{y}\leq\|\boldsymbol{x}\|_\infty\|\boldsymbol{y}\|_1\) holds trivally and is attained when \(\boldsymbol{x}=sign(\boldsymbol{y})\)
  • The dual to the dual norm is the original norm
  • The dual norm of \(L_p\) norm is \(L_q\) norm with \(\frac{1}{p}+\frac{1}{q}=1\) (due to Holder’s inequality).
  • By definition of dual norm, we have \(\boldsymbol{x}^T\boldsymbol{y}\leq\|\boldsymbol{x}\|\|\boldsymbol{y}\|_*\) a generalization of the C-S inequality which corresponds to the \(L_2\) norm.

Definition 18.3 A convex function \(\Phi\) on a convex set \(B\) is said to be
1. L-Lipschitz w.r.t. \(\|\cdot\|\) if \(\forall \boldsymbol{x}\in B\), \(g\in\partial\Phi(\boldsymbol{x})\), we have \(\|g\|_*\leq L\)
2. \(\alpha\)-strongly covnex w.r.t. \(\|\cdot\|\) if \(\forall \boldsymbol{x},\boldsymbol{y}\in B\) and \(\forall g\in\partial\Phi\), we have \[\Phi(\boldsymbol{x})-\Phi(\boldsymbol{y})\leq g^T(\boldsymbol{x}-\boldsymbol{y})-\frac{\alpha}{2}\|\boldsymbol{x}-\boldsymbol{y}\|^2\] (i.e. \(\Phi(y)\geq \Phi(\boldsymbol{x})+g^T(\boldsymbol{y}-\boldsymbol{x})+\frac{\alpha}{2}\|\boldsymbol{y}-\boldsymbol{x}\|^2)\)

We say that \(\Phi: B\mapsto\mathbb{R}\) is a mirror map (potential function) if
1. \(\Phi\) is strictly convex and differentiable
2. \(\|\nabla\Phi(\boldsymbol{x})\|_*\to\infty\) as \(\boldsymbol{x}\to\partial B\) (boundary of \(B\))

In mirror descent, a point \(\boldsymbol{x}\in B\) is mapped to \(\nabla\Phi(\boldsymbol{x})\in B^*\), from which one takes a GD updates to get \(\nabla(\boldsymbol{x})-\eta\nabla f(\boldsymbol{x})\). (Note both are in the dual space.)
Property 2 ensures that there must exist \(y\in B\) s.t. \[\nabla\Phi(\boldsymbol{y})=\nabla\Phi(\boldsymbol{x})-\eta\nabla f(\boldsymbol{x})\] In case \(y\) is outside of the constraint set \(C\), projection is done using Bregman divergence.

Definition 18.4 (Bregman divergence associated to a function) For a given differentiable, \(\alpha\)-strongly convex function \(\Phi: B\to\mathbb{R}\), its associated Bregman divergence is \[D_\Phi(\boldsymbol{x},\boldsymbol{y})=\Phi(\boldsymbol{x})-\Phi(\boldsymbol{y})-\nabla\Phi(\boldsymbol{y})^T(\boldsymbol{x}-\boldsymbol{y})\] or \[D_\Phi(\boldsymbol{y},\boldsymbol{x})=\Phi(\boldsymbol{y})-\Phi(\boldsymbol{x})-\nabla\Phi(\boldsymbol{x})^T(\boldsymbol{y}-\boldsymbol{x})\]

  • Think of this divergence as approximation error of first order Taylor linear approximation of \(\Phi(\boldsymbol{x})\) at \(\boldsymbol{y}\)
  • \(D_\Phi(\boldsymbol{x},\boldsymbol{y})\neq D_\Phi(\boldsymbol{y},\boldsymbol{x})\) since \(\nabla\Phi(\boldsymbol{y})\neq \nabla\Phi(\boldsymbol{x})\)
  • If \(\Phi\) is convex, the \(D_\Phi(\boldsymbol{x},\boldsymbol{y})\geq 0\) due to definition of sub-gradient. (or because Hessian of \(\Phi\) is p.s.d)
  • If \(\Phi\) is \(\alpha\)-strongly convex, then \[D_\Phi(\boldsymbol{y},\boldsymbol{x})\geq\frac{\alpha}{2}\|\boldsymbol{y}-\boldsymbol{x}\|^2\] That is, the Bregman divergence is an upper bound of the Euclidean norm.

Proposition 18.1 Given convex function \(\Phi\) on \(B\), with \(\boldsymbol{x},\boldsymbol{y},\boldsymbol{z}\in B\), \[[\nabla\Phi(\boldsymbol{x})-\nabla\Phi(\boldsymbol{y})]^T(\boldsymbol{x}-\boldsymbol{z})=D_\Phi(\boldsymbol{x},\boldsymbol{y})+D_\Phi(\boldsymbol{z},\boldsymbol{x})-D_\Phi(\boldsymbol{z},\boldsymbol{y})\]

Proof. \[\begin{aligned} RHS&=\Phi(\boldsymbol{x})-\Phi(\boldsymbol{y})-\nabla\Phi(\boldsymbol{y})^T(\boldsymbol{x}-\boldsymbol{y})\\ &+\Phi(\boldsymbol{z})-\Phi(\boldsymbol{x})-\nabla\Phi(\boldsymbol{x})^T(\boldsymbol{z}-\boldsymbol{x})\\ &-[\Phi(\boldsymbol{z})-\Phi(\boldsymbol{y})-\nabla\Phi(\boldsymbol{y})^T(\boldsymbol{z}-\boldsymbol{y})] \end{aligned}\]

Definition 18.5 (Bregman Projection) Given a mirror map \(\Phi\) which is strictly convex and differentiable, and for all \(\boldsymbol{x}\in B\) and closed convex \(C\subset B\), we define \[\pi_C^\Phi(\boldsymbol{x})=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol{z}\in C \cap B}D_\Phi(\boldsymbol{z},\boldsymbol{x})\]

18.2 Mirror Descent

We can now describe the mirror descent algorithm based on a mirror map \(\Phi\).
Let \(\boldsymbol{x}_1=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol{x}\in C\cap B}\Phi(\boldsymbol{x})\). For \(s\geq 1\), do the following:
1. Let \(y_{s+1}\in B\) be such that \[\nabla\Phi(y_{s+1})=\nabla\Phi(\boldsymbol{x}_s)-\eta g_s\] where \(g_s\in\partial f(\boldsymbol{x}_s)\)
2. Let \(x_{s+1}=\pi_C^\Phi(y_{s+1})\). After \(k\) steps, either \(\bar{\boldsymbol{x}}=\frac{1}{k}\sum^k_{s=1}\boldsymbol{x}_s\) or \(\boldsymbol{x}_o=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol{x}\in\{\boldsymbol{x}_i,\ldots,\boldsymbol{x}_k\}}f(\boldsymbol{x})\)

Lemma 18.1 Let \(\boldsymbol{z}\in C\cap B\), then \(\forall \boldsymbol{y}\in B\), \[[\nabla\Phi(\pi_C^\Phi(\boldsymbol{y}))-\nabla\Phi(\boldsymbol{y})]^T(\pi_C^\Phi(\boldsymbol{y})-\boldsymbol{z})\leq 0\] Moreover, \(D_\Phi(\boldsymbol{z},\pi_C^\Phi(\boldsymbol{y}))+D_\Phi(\pi_C^\Phi(\boldsymbol{y}),\boldsymbol{y})\leq D_\Phi(\boldsymbol{z},\boldsymbol{y})\)$

Proof. Define \(h(t)=D_\Phi(\pi+t(\boldsymbol{z}-\pi),\boldsymbol{y})\) where \(\pi\triangleq \pi_C^\Phi(\boldsymbol{y})\). Since \(\pi=\pi_C^\Phi(\boldsymbol{y})=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol{x}\in C\cap B}D_\Phi(\boldsymbol{x},\boldsymbol{y})\), \(h(t)\) is minimized at \(t=0\), and hence \[h'(0)=\nabla_\boldsymbol{x} D_\Phi(\boldsymbol{x},\boldsymbol{y})^T\vert_{\boldsymbol{x}=\pi}(\boldsymbol{z}-\pi)\geq 0\] (it can’t be negative since \(h(0)\) is the minimum). Recall that \[\begin{aligned} &\quad D_\Phi(\boldsymbol{x},\boldsymbol{y})=\Phi(\boldsymbol{x})-\Phi(\boldsymbol{y})-\nabla\Phi(\boldsymbol{y})^T(\boldsymbol{x}-\boldsymbol{y})\\ &\implies \nabla_\boldsymbol{x} D_\Phi(\boldsymbol{x},\boldsymbol{y})=\nabla\Phi(\boldsymbol{x})-\nabla\Phi(\boldsymbol{y})\\ &\implies [\nabla\Phi(\pi)-\nabla\Phi(\boldsymbol{y})]^T(\pi-\boldsymbol{z})\leq 0\quad\quad\text{(first result)}\\ \text{(proposition)}&\implies D_\Phi(\pi,\boldsymbol{y})+D_\Phi(\boldsymbol{z},\pi)-D_\Phi(\boldsymbol{z},\boldsymbol{y})\leq 0 \end{aligned}\]

Theorem 18.1 Let \(f\) be convex, \(L\)-Lipschitz w.r.t. some arbitrary norm \(\|\cdot\|\). Assume that the mirror map is \(\alpha\)-strongly convex on \(C\cap B\) with \(\|\cdot\|\). Let \[R^2=\sup_{\boldsymbol{x}\in C\cap B}\Phi(\boldsymbol{x})-\min_{\boldsymbol{x}\in C\cap B}\Phi(\boldsymbol{x})\] Then mirror descent with \(\eta=\frac{R}{L}\sqrt{\frac{2\alpha}{K}}\) gives \[f(\bar{\boldsymbol{x}}\text{ or } \boldsymbol{x}_o)-f(\boldsymbol{x}^*)\leq RL\sqrt{\frac{2}{\alpha K}}\]

Proof. First \[\begin{aligned} &\quad f(\boldsymbol{x}_s)-f(\boldsymbol{x}^*)\\ \text{($f$ is convex)} &\leq g_s^T(\boldsymbol{x}_s-\boldsymbol{x}^*)\\ \text{(def of mirror descent)} &=\frac{1}{\eta}(\nabla\Phi(\boldsymbol{x}_s)-\nabla\Phi(\boldsymbol{y}_{s+1}))^T(\boldsymbol{x}_s-\boldsymbol{x}^*)\\ \text{(proposition)}&=\frac{1}{\eta}[D_\Phi(\boldsymbol{x}_s,\boldsymbol{y}_{s+1})+D_\Phi(\boldsymbol{x}^\star,\boldsymbol{x}_s)-D_\Phi(\boldsymbol{x}^*,\boldsymbol{y}_{s+1})]\\ \text{(lemma)} &\leq \frac{1}{\eta}[D_\Phi(\boldsymbol{x}_s,\boldsymbol{y}_{s+1})+D_\Phi(\boldsymbol{x}^*,\boldsymbol{x}_s)\\ &\quad -D_\Phi(\boldsymbol{x}^*,\boldsymbol{x}_{s+1})-D_\Phi(\boldsymbol{x}_{s+1},\boldsymbol{y}_{s+1})] \end{aligned}\] Consider \[\begin{aligned} &\quad D_\Phi(\boldsymbol{x}_s,\boldsymbol{y}_{s+1})-D_\Phi(\boldsymbol{x}_{s+1},\boldsymbol{y}_{s+1})\\ \text{(def of Bregman divergence)}&=[\Phi(\boldsymbol{x}_s)-\Phi(\boldsymbol{y}_{s+1})-\nabla\Phi(\boldsymbol{y}_{s+1})^T(\boldsymbol{x}_s-\boldsymbol{y}_{s+1})]\\ &-[\Phi(\boldsymbol{x}_{s+1})-\Phi(\boldsymbol{y}_{s+1})-\nabla\Phi(\boldsymbol{y}_{s+1})^T(\boldsymbol{x}_{s+1}-\boldsymbol{y}_{s+1})]\\ &=\Phi(\boldsymbol{x}_s)-\Phi(\boldsymbol{x}_{s+1})-\nabla\Phi(\boldsymbol{y}_{s+1})^T(\boldsymbol{x}_s-\boldsymbol{x}_{s+1})\\ \text{($\alpha$-strongly convex)}&\leq \nabla\Phi(\boldsymbol{x}_s)^T(\boldsymbol{x}_s-\boldsymbol{x}_{s+1})-\frac{\alpha}{2}\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|^2-\nabla\Phi(\boldsymbol{y}_{s+1})^T(\boldsymbol{x}_s-\boldsymbol{x}_{s+1})\\ \text{(def of mirror descent)}&=\eta g_s^T(\boldsymbol{x}_s-\boldsymbol{x}_{s+1})-\frac{\alpha}{2}\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|^2\\ (\|g_s\|_*\leq L)\quad&\leq \eta L\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|-\frac{\alpha}{2}\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|^2\\ &(\text{Holder's inequality and RHS maximized when $\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|^2=\frac{\eta L}{2}$})\\ &\leq \frac{(\eta L)^2}{2\alpha} \end{aligned}\]

(The rest of the chapter was scribed by Jingze Liu.)

$$ \[\begin{align*} f(x_s)-f(x^*) &\leq \frac{1}{\eta}(D_{\Phi}(x_s,y_{s+1})+D_{\Phi}(x^*,x_s)-D_{\Phi}(x^*,x_{s+1})-D_{\Phi}(x_{s+1},y_{s+1}))\\ &\leq \frac{(\eta L)^2}{2\alpha} \end{align*}\]

$$

As we did before, we use telescope sum from \(s=1\) to \(s=k\)

\[ \begin{align*} \frac{1}{k}\sum_{s=1}^k(f(x_s)-f(x^*)) &\leq \frac{1}{\eta}(\frac{(\eta L)^2}{2\alpha}+\frac{D_{\Phi}(x^*,x_1)}{k}-\frac{D_{\Phi}(x^*,x_{k})}{k})\\ &\leq \frac{(\eta L)^2}{2\alpha}+\frac{R^2}{k\eta} \end{align*} \]

Where we use the fact that \[ \begin{align*} D_{\Phi}(x^*,x_1)&\leq\Phi(x')-\Phi(x_1)-\nabla\Phi(x_1)^{\top}(x^*-x_1)\\ &=\Phi(x^*)-\Phi(x_1)\\ &\leq R^2 \end{align*} \]

Lastly, let \(\eta=\sqrt{\frac{2\alpha R^2}{kL^2}}\), RHS=\(\sqrt{\frac{2RL}{\alpha k}}\)

Example:

Euclidean setup (“Ball” setup) \(B=\mathbb{R}^d, \Phi (x)=\frac{1}{2}||x||^2, \nabla \Phi(x)=x\)

$$ \[\begin{align*} D_{\Phi}(x,y)&=\Phi(x)-\Phi(y)-\nabla \Phi^{\top}(y)(x-y)\\ &=\frac{1}{2}||x||^2-\frac{1}{2}||y||^2-y^{\top}(x-y)\\ &=\frac{1}{2}||y-x||^2 \end{align*}\]

$$ Thus in this case, Bregman Projection is exactly the same as the Euclidean projection, and mirror descent recovers the PGD.

Note that \(\Phi(x)\) is \(\alpha\)-strongly convex with \(\alpha=1\)

Since \(D_{\Phi}(x,y)\geq\frac{1}{2}||y-x||^2\)

\(l_1\)-setup/simplex setup

\(B=\mathbb{R}^d_+\backslash \{0\}=\{x_j\geq0,j=1,...,d\}\)

Define \(\Phi\) to be the negative entropy \(\Phi(x)=\sum_{i=1}^dx_jlog(x_j)\)

\[ \begin{align} \nabla \Phi(x) &= \begin{bmatrix} 1+log_{x_{1}} \\ 1+log_{x_{2}} \\ \vdots \\ 1+log_{x_{d}} \end{bmatrix} \end{align} \] Thus the update in the dual space is \(\nabla \Phi(y_{s+1})=\nabla \Phi(x_s)-\eta g_s\) where \(g_s \in \partial f(x_s)\)

\(\Rightarrow 1+log(y_{s+1,j})=1+log(x_s,j)-\eta g_{s,j},\) \(j=1,2,...d\)

\(\Rightarrow y_{s+1,j}=x_{s,j}exp(-\eta g_{s,j})\)

So \(y_{s+1}=x_s exp(-\eta g_s)\)

This is called exponential G.D or M.D with multiplicative weights.

The Bregman divergence assocaited to \(\Phi\) is \[ \begin{align*} D_{\Phi}(x,y)&=\Phi(x)-\Phi(y)-\nabla\Phi(y)^{\top}(x-y)\\ &=\sum_{j=1}^d x_j log(x_j)-\sum_{j=1}^{d} log(y_j)x_j-\sum_{j=1}^d (x_j-y_j)\\ &=\sum_{j=1}^d x_j log(\frac{x_j}{y_j})-\sum_{j=1}^d(x_j-y_j) \end{align*} \]

Consider the onstraint being the simples

\[\Delta_d={x\in R^d,\sum_{j=1}^dx_j=1,x_j\geq0}\] We can now show that the projection w.r.t simple renormalization \(y \mapsto y/||y||_1\)

To this end, write the Lagrangian \(\alpha=\sum_{j=1}^d x_jlog(x_j/y_j)-\sum_{j=1}^d(x_j-y_j)-\lambda(\sum_{j=1}^d x_j-1)\)

For each \(j=1,2,...,d\), we have \(\frac{\partial L}{\partial x_j}=log(\frac{x_j}{y_j})+\frac{x_j}{x_j/y_j} \cdot \frac{1}{y_j}-1-\lambda=0\)

From \(log(\frac{x_j}{y_j}=\lambda)\), we can get \(x_j=e^\lambda\cdot y_j\)

\[ \begin{align*} \sum_{j=1}^d x_j=1 &\Rightarrow 1=e^\lambda \cdot \sum_{i=1}^d y_j\\ &\Rightarrow e^\lambda=\frac{1}{\sum_{i=1}^d y_j}\\ & \Rightarrow {\prod}_{\Delta_d}^\Phi(y)=\frac{y}{||y||_1} \end{align*} \]

In summary M.D is

\(y_{s+1}=x_s exp(-\eta \cdot g_s)\)

\(x_{s+1}=y_{s+1}/||y_{s+1}||_1\)