Chapter 18 Mirror Descent

(The first part of this chapter was scribed by Xinhai Zhang.)

18.1 Dual space and mirror map

Reviewing the proof in analyzing boosting, $\vec\alpha$ is chosen from a space with constrained $L_1$ norm, but due to the use of PGD, we bounded the $L_2$ norm of the gradient.

Boosting is an example where we conduct optimization in a non-euclidean space (e.g. $L_1$ space) while the dual of the $L_1$ norm is $L_2$ norm itself, the dual of $L_1$ norm is $L_\infty$ norm (i.e., the sup norm).

To better exploit the special geometrical structure of the space, and to obtain a better rate of convergence, we can use the mirror descent algorithm.

To discuss the intuition, let’s abstract for a moment, we saw that PGD works in a Hilbert space. Suppose now we are optimizing in more general Banach space $B$. In particular, the norm is not derived from an inner product. (think of $B= L_1$ space). In this case, the gradient $\nabla f(\boldsymbol{x})$ are elements of the dual space $B^*$. Then the gradient descent does not even make sense since the GD update $\boldsymbol{x}-\eta\nabla f(\boldsymbol{x})$ can’t be performed. (Noe that in Hilbert space this was not a problem, since the dual space of $\mathcal{H}$ is isometric to $\mathcal{H}$ by Riesz representation theorem.)

Definition 18.1 (Dual Space) Let $V$ be a vector space over a field $\mathcal{F}$ (usually $\mathbb{R}$). A linear functional on $V$ is a linear map $V\mapsto \mathcal{F}$. The dual space of $V$, denoted by $V^*$, is the space of all linear functionals on $V^*$.

The main idea of minor descent is the following. Now that directly conducting GD is inappropriate, in order to perform optimization, in the primal space $B$, one can first map a point $x\in B$ in the primal space to the dual space $B^*$. Then perform the GD update in the dual space $B^*$, and finally map back the resulting point to the primal space $B$. Note that at each update, the new print in the primal space might be outside of the constraint set $C\subset B$, in which case it should be projected into the constraint set $C$.

Ingredient of mirror descent:
- Potential function also known as mirror map $\Phi : B\mapsto \mathbb{R}$. The gradient of $\Phi$ is used to map points from the primal to the dual.
- Bregman divergence associated with $\Phi$, which is a measure of distance between two points in the primal.
- The projection is done via Bregman divergence.

Definition 18.2 (Dual Norm) For any given arbitrary norm $\|\cdot\|$ on $\mathbb{R}^d$, the dual norm $\|\cdot\|_*$ is defined as \[\|g\|_*=\sup_{\boldsymbol{x}\in\mathbb{R}^d, \|\boldsymbol{x}\|\leq 1} g^T\boldsymbol{x}\]

The dual of the Euclidean ($L_2$) norm is itself. This comes directly from C-S inequality.
The dual of the $L_\infty$ norm is $L_1$ norm since inequality $\boldsymbol{x}^T\boldsymbol{y}\leq\|\boldsymbol{x}\|_\infty\|\boldsymbol{y}\|_1$ holds trivally and is attained when $\boldsymbol{x}=sign(\boldsymbol{y})$
The dual to the dual norm is the original norm
The dual norm of $L_p$ norm is $L_q$ norm with $\frac{1}{p}+\frac{1}{q}=1$ (due to Holder’s inequality).
By definition of dual norm, we have $\boldsymbol{x}^T\boldsymbol{y}\leq\|\boldsymbol{x}\|\|\boldsymbol{y}\|_*$ a generalization of the C-S inequality which corresponds to the $L_2$ norm.

Definition 18.3 A convex function $\Phi$ on a convex set $B$ is said to be
1. L-Lipschitz w.r.t. $\|\cdot\|$ if $\forall \boldsymbol{x}\in B$, $g\in\partial\Phi(\boldsymbol{x})$, we have $\|g\|_*\leq L$
2. $\alpha$-strongly covnex w.r.t. $\|\cdot\|$ if $\forall \boldsymbol{x},\boldsymbol{y}\in B$ and $\forall g\in\partial\Phi$, we have \[\Phi(\boldsymbol{x})-\Phi(\boldsymbol{y})\leq g^T(\boldsymbol{x}-\boldsymbol{y})-\frac{\alpha}{2}\|\boldsymbol{x}-\boldsymbol{y}\|^2\] (i.e. $\Phi(y)\geq \Phi(\boldsymbol{x})+g^T(\boldsymbol{y}-\boldsymbol{x})+\frac{\alpha}{2}\|\boldsymbol{y}-\boldsymbol{x}\|^2)$

We say that $\Phi: B\mapsto\mathbb{R}$ is a mirror map (potential function) if
1. $\Phi$ is strictly convex and differentiable
2. $\|\nabla\Phi(\boldsymbol{x})\|_*\to\infty$ as $\boldsymbol{x}\to\partial B$ (boundary of $B$)

In mirror descent, a point $\boldsymbol{x}\in B$ is mapped to $\nabla\Phi(\boldsymbol{x})\in B^*$, from which one takes a GD updates to get $\nabla(\boldsymbol{x})-\eta\nabla f(\boldsymbol{x})$. (Note both are in the dual space.)
Property 2 ensures that there must exist $y\in B$ s.t. \[\nabla\Phi(\boldsymbol{y})=\nabla\Phi(\boldsymbol{x})-\eta\nabla f(\boldsymbol{x})\] In case $y$ is outside of the constraint set $C$, projection is done using Bregman divergence.

Definition 18.4 (Bregman divergence associated to a function) For a given differentiable, $\alpha$-strongly convex function $\Phi: B\to\mathbb{R}$, its associated Bregman divergence is \[D_\Phi(\boldsymbol{x},\boldsymbol{y})=\Phi(\boldsymbol{x})-\Phi(\boldsymbol{y})-\nabla\Phi(\boldsymbol{y})^T(\boldsymbol{x}-\boldsymbol{y})\] or \[D_\Phi(\boldsymbol{y},\boldsymbol{x})=\Phi(\boldsymbol{y})-\Phi(\boldsymbol{x})-\nabla\Phi(\boldsymbol{x})^T(\boldsymbol{y}-\boldsymbol{x})\]

Think of this divergence as approximation error of first order Taylor linear approximation of $\Phi(\boldsymbol{x})$ at $\boldsymbol{y}$
$D_\Phi(\boldsymbol{x},\boldsymbol{y})\neq D_\Phi(\boldsymbol{y},\boldsymbol{x})$ since $\nabla\Phi(\boldsymbol{y})\neq \nabla\Phi(\boldsymbol{x})$
If $\Phi$ is convex, the $D_\Phi(\boldsymbol{x},\boldsymbol{y})\geq 0$ due to definition of sub-gradient. (or because Hessian of $\Phi$ is p.s.d)
If $\Phi$ is $\alpha$-strongly convex, then \[D_\Phi(\boldsymbol{y},\boldsymbol{x})\geq\frac{\alpha}{2}\|\boldsymbol{y}-\boldsymbol{x}\|^2\] That is, the Bregman divergence is an upper bound of the Euclidean norm.

Proposition 18.1 Given convex function $\Phi$ on $B$, with $\boldsymbol{x},\boldsymbol{y},\boldsymbol{z}\in B$, \[[\nabla\Phi(\boldsymbol{x})-\nabla\Phi(\boldsymbol{y})]^T(\boldsymbol{x}-\boldsymbol{z})=D_\Phi(\boldsymbol{x},\boldsymbol{y})+D_\Phi(\boldsymbol{z},\boldsymbol{x})-D_\Phi(\boldsymbol{z},\boldsymbol{y})\]

Proof. \[\begin{aligned} RHS&=\Phi(\boldsymbol{x})-\Phi(\boldsymbol{y})-\nabla\Phi(\boldsymbol{y})^T(\boldsymbol{x}-\boldsymbol{y})\\ &+\Phi(\boldsymbol{z})-\Phi(\boldsymbol{x})-\nabla\Phi(\boldsymbol{x})^T(\boldsymbol{z}-\boldsymbol{x})\\ &-[\Phi(\boldsymbol{z})-\Phi(\boldsymbol{y})-\nabla\Phi(\boldsymbol{y})^T(\boldsymbol{z}-\boldsymbol{y})] \end{aligned}\]

Definition 18.5 (Bregman Projection) Given a mirror map $\Phi$ which is strictly convex and differentiable, and for all $\boldsymbol{x}\in B$ and closed convex $C\subset B$, we define \[\pi_C^\Phi(\boldsymbol{x})=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol{z}\in C \cap B}D_\Phi(\boldsymbol{z},\boldsymbol{x})\]

18.2 Mirror Descent

We can now describe the mirror descent algorithm based on a mirror map $\Phi$.
Let $\boldsymbol{x}_1=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol{x}\in C\cap B}\Phi(\boldsymbol{x})$. For $s\geq 1$, do the following:
1. Let $y_{s+1}\in B$ be such that \[\nabla\Phi(y_{s+1})=\nabla\Phi(\boldsymbol{x}_s)-\eta g_s\] where $g_s\in\partial f(\boldsymbol{x}_s)$
2. Let $x_{s+1}=\pi_C^\Phi(y_{s+1})$. After $k$ steps, either $\bar{\boldsymbol{x}}=\frac{1}{k}\sum^k_{s=1}\boldsymbol{x}_s$ or $\boldsymbol{x}_o=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol{x}\in\{\boldsymbol{x}_i,\ldots,\boldsymbol{x}_k\}}f(\boldsymbol{x})$

Lemma 18.1 Let $\boldsymbol{z}\in C\cap B$, then $\forall \boldsymbol{y}\in B$, \[[\nabla\Phi(\pi_C^\Phi(\boldsymbol{y}))-\nabla\Phi(\boldsymbol{y})]^T(\pi_C^\Phi(\boldsymbol{y})-\boldsymbol{z})\leq 0\] Moreover, $D_\Phi(\boldsymbol{z},\pi_C^\Phi(\boldsymbol{y}))+D_\Phi(\pi_C^\Phi(\boldsymbol{y}),\boldsymbol{y})\leq D_\Phi(\boldsymbol{z},\boldsymbol{y})$$

Proof. Define $h(t)=D_\Phi(\pi+t(\boldsymbol{z}-\pi),\boldsymbol{y})$ where $\pi\triangleq \pi_C^\Phi(\boldsymbol{y})$. Since $\pi=\pi_C^\Phi(\boldsymbol{y})=\mathop{\mathrm{\mathop{\mathrm{argmin}}\limits}}_{\boldsymbol{x}\in C\cap B}D_\Phi(\boldsymbol{x},\boldsymbol{y})$, $h(t)$ is minimized at $t=0$, and hence \[h'(0)=\nabla_\boldsymbol{x} D_\Phi(\boldsymbol{x},\boldsymbol{y})^T\vert_{\boldsymbol{x}=\pi}(\boldsymbol{z}-\pi)\geq 0\] (it can’t be negative since $h(0)$ is the minimum). Recall that \[\begin{aligned} &\quad D_\Phi(\boldsymbol{x},\boldsymbol{y})=\Phi(\boldsymbol{x})-\Phi(\boldsymbol{y})-\nabla\Phi(\boldsymbol{y})^T(\boldsymbol{x}-\boldsymbol{y})\\ &\implies \nabla_\boldsymbol{x} D_\Phi(\boldsymbol{x},\boldsymbol{y})=\nabla\Phi(\boldsymbol{x})-\nabla\Phi(\boldsymbol{y})\\ &\implies [\nabla\Phi(\pi)-\nabla\Phi(\boldsymbol{y})]^T(\pi-\boldsymbol{z})\leq 0\quad\quad\text{(first result)}\\ \text{(proposition)}&\implies D_\Phi(\pi,\boldsymbol{y})+D_\Phi(\boldsymbol{z},\pi)-D_\Phi(\boldsymbol{z},\boldsymbol{y})\leq 0 \end{aligned}\]

Theorem 18.1 Let $f$ be convex, $L$-Lipschitz w.r.t. some arbitrary norm $\|\cdot\|$. Assume that the mirror map is $\alpha$-strongly convex on $C\cap B$ with $\|\cdot\|$. Let \[R^2=\sup_{\boldsymbol{x}\in C\cap B}\Phi(\boldsymbol{x})-\min_{\boldsymbol{x}\in C\cap B}\Phi(\boldsymbol{x})\] Then mirror descent with $\eta=\frac{R}{L}\sqrt{\frac{2\alpha}{K}}$ gives \[f(\bar{\boldsymbol{x}}\text{ or } \boldsymbol{x}_o)-f(\boldsymbol{x}^*)\leq RL\sqrt{\frac{2}{\alpha K}}\]

Proof. First \[\begin{aligned} &\quad f(\boldsymbol{x}_s)-f(\boldsymbol{x}^*)\\ \text{($f$ is convex)} &\leq g_s^T(\boldsymbol{x}_s-\boldsymbol{x}^*)\\ \text{(def of mirror descent)} &=\frac{1}{\eta}(\nabla\Phi(\boldsymbol{x}_s)-\nabla\Phi(\boldsymbol{y}_{s+1}))^T(\boldsymbol{x}_s-\boldsymbol{x}^*)\\ \text{(proposition)}&=\frac{1}{\eta}[D_\Phi(\boldsymbol{x}_s,\boldsymbol{y}_{s+1})+D_\Phi(\boldsymbol{x}^\star,\boldsymbol{x}_s)-D_\Phi(\boldsymbol{x}^*,\boldsymbol{y}_{s+1})]\\ \text{(lemma)} &\leq \frac{1}{\eta}[D_\Phi(\boldsymbol{x}_s,\boldsymbol{y}_{s+1})+D_\Phi(\boldsymbol{x}^*,\boldsymbol{x}_s)\\ &\quad -D_\Phi(\boldsymbol{x}^*,\boldsymbol{x}_{s+1})-D_\Phi(\boldsymbol{x}_{s+1},\boldsymbol{y}_{s+1})] \end{aligned}\] Consider \[\begin{aligned} &\quad D_\Phi(\boldsymbol{x}_s,\boldsymbol{y}_{s+1})-D_\Phi(\boldsymbol{x}_{s+1},\boldsymbol{y}_{s+1})\\ \text{(def of Bregman divergence)}&=[\Phi(\boldsymbol{x}_s)-\Phi(\boldsymbol{y}_{s+1})-\nabla\Phi(\boldsymbol{y}_{s+1})^T(\boldsymbol{x}_s-\boldsymbol{y}_{s+1})]\\ &-[\Phi(\boldsymbol{x}_{s+1})-\Phi(\boldsymbol{y}_{s+1})-\nabla\Phi(\boldsymbol{y}_{s+1})^T(\boldsymbol{x}_{s+1}-\boldsymbol{y}_{s+1})]\\ &=\Phi(\boldsymbol{x}_s)-\Phi(\boldsymbol{x}_{s+1})-\nabla\Phi(\boldsymbol{y}_{s+1})^T(\boldsymbol{x}_s-\boldsymbol{x}_{s+1})\\ \text{($\alpha$-strongly convex)}&\leq \nabla\Phi(\boldsymbol{x}_s)^T(\boldsymbol{x}_s-\boldsymbol{x}_{s+1})-\frac{\alpha}{2}\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|^2-\nabla\Phi(\boldsymbol{y}_{s+1})^T(\boldsymbol{x}_s-\boldsymbol{x}_{s+1})\\ \text{(def of mirror descent)}&=\eta g_s^T(\boldsymbol{x}_s-\boldsymbol{x}_{s+1})-\frac{\alpha}{2}\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|^2\\ (\|g_s\|_*\leq L)\quad&\leq \eta L\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|-\frac{\alpha}{2}\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|^2\\ &(\text{Holder's inequality and RHS maximized when $\|\boldsymbol{x}_s-\boldsymbol{x}_{s+1}\|^2=\frac{\eta L}{2}$})\\ &\leq \frac{(\eta L)^2}{2\alpha} \end{aligned}\]

(The rest of the chapter was scribed by Jingze Liu.)

$$ \[\begin{align*} f(x_s)-f(x^*) &\leq \frac{1}{\eta}(D_{\Phi}(x_s,y_{s+1})+D_{\Phi}(x^*,x_s)-D_{\Phi}(x^*,x_{s+1})-D_{\Phi}(x_{s+1},y_{s+1}))\\ &\leq \frac{(\eta L)^2}{2\alpha} \end{align*}\]

As we did before, we use telescope sum from $s=1$ to $s=k$

\[ \begin{align*} \frac{1}{k}\sum_{s=1}^k(f(x_s)-f(x^*)) &\leq \frac{1}{\eta}(\frac{(\eta L)^2}{2\alpha}+\frac{D_{\Phi}(x^*,x_1)}{k}-\frac{D_{\Phi}(x^*,x_{k})}{k})\\ &\leq \frac{(\eta L)^2}{2\alpha}+\frac{R^2}{k\eta} \end{align*} \]

Where we use the fact that \[ \begin{align*} D_{\Phi}(x^*,x_1)&\leq\Phi(x')-\Phi(x_1)-\nabla\Phi(x_1)^{\top}(x^*-x_1)\\ &=\Phi(x^*)-\Phi(x_1)\\ &\leq R^2 \end{align*} \]

Lastly, let $\eta=\sqrt{\frac{2\alpha R^2}{kL^2}}$, RHS=$\sqrt{\frac{2RL}{\alpha k}}$

Example:

Euclidean setup (“Ball” setup) $B=\mathbb{R}^d, \Phi (x)=\frac{1}{2}||x||^2, \nabla \Phi(x)=x$

$$ \[\begin{align*} D_{\Phi}(x,y)&=\Phi(x)-\Phi(y)-\nabla \Phi^{\top}(y)(x-y)\\ &=\frac{1}{2}||x||^2-\frac{1}{2}||y||^2-y^{\top}(x-y)\\ &=\frac{1}{2}||y-x||^2 \end{align*}\]

$$ Thus in this case, Bregman Projection is exactly the same as the Euclidean projection, and mirror descent recovers the PGD.

Note that $\Phi(x)$ is $\alpha$-strongly convex with $\alpha=1$

Since $D_{\Phi}(x,y)\geq\frac{1}{2}||y-x||^2$

$l_1$-setup/simplex setup

$B=\mathbb{R}^d_+\backslash \{0\}=\{x_j\geq0,j=1,...,d\}$

Define $\Phi$ to be the negative entropy $\Phi(x)=\sum_{i=1}^dx_jlog(x_j)$

\[ \begin{align} \nabla \Phi(x) &= \begin{bmatrix} 1+log_{x_{1}} \\ 1+log_{x_{2}} \\ \vdots \\ 1+log_{x_{d}} \end{bmatrix} \end{align} \] Thus the update in the dual space is $\nabla \Phi(y_{s+1})=\nabla \Phi(x_s)-\eta g_s$ where $g_s \in \partial f(x_s)$

$\Rightarrow 1+log(y_{s+1,j})=1+log(x_s,j)-\eta g_{s,j},$ $j=1,2,...d$

$\Rightarrow y_{s+1,j}=x_{s,j}exp(-\eta g_{s,j})$

So $y_{s+1}=x_s exp(-\eta g_s)$

This is called exponential G.D or M.D with multiplicative weights.

The Bregman divergence assocaited to $\Phi$ is \[ \begin{align*} D_{\Phi}(x,y)&=\Phi(x)-\Phi(y)-\nabla\Phi(y)^{\top}(x-y)\\ &=\sum_{j=1}^d x_j log(x_j)-\sum_{j=1}^{d} log(y_j)x_j-\sum_{j=1}^d (x_j-y_j)\\ &=\sum_{j=1}^d x_j log(\frac{x_j}{y_j})-\sum_{j=1}^d(x_j-y_j) \end{align*} \]

Consider the onstraint being the simples

\[\Delta_d={x\in R^d,\sum_{j=1}^dx_j=1,x_j\geq0}\] We can now show that the projection w.r.t simple renormalization $y \mapsto y/||y||_1$

To this end, write the Lagrangian $\alpha=\sum_{j=1}^d x_jlog(x_j/y_j)-\sum_{j=1}^d(x_j-y_j)-\lambda(\sum_{j=1}^d x_j-1)$

For each $j=1,2,...,d$, we have $\frac{\partial L}{\partial x_j}=log(\frac{x_j}{y_j})+\frac{x_j}{x_j/y_j} \cdot \frac{1}{y_j}-1-\lambda=0$

From $log(\frac{x_j}{y_j}=\lambda)$, we can get $x_j=e^\lambda\cdot y_j$

\[ \begin{align*} \sum_{j=1}^d x_j=1 &\Rightarrow 1=e^\lambda \cdot \sum_{i=1}^d y_j\\ &\Rightarrow e^\lambda=\frac{1}{\sum_{i=1}^d y_j}\\ & \Rightarrow {\prod}_{\Delta_d}^\Phi(y)=\frac{y}{||y||_1} \end{align*} \]

In summary M.D is

$y_{s+1}=x_s exp(-\eta \cdot g_s)$

$x_{s+1}=y_{s+1}/||y_{s+1}||_1$