Chapter 15 Learning with kernel

(This chapter was scribed by Jingze Liu.)

we have seen that every kernel defines a space of functions, $ $, which is RKHS. This is exactly the hypothesis class since the beginning of the semester. Now let’s talk about how to learn with kernels.

We will start with kernelized ridge regression where we have a sample $S={(x_i,y_i)}_{i=1}^{n} $ and want to find a function \(f \in \mathscr{H}\) that fits the date. A natural objective function is the empirical squared error loss plus a penalty for the complexity of \(f\), measured by \(||f||_\mathscr{H}^2\)

\[\hat f \in \arg\min_{f \in \mathscr{H}} \sum_{i=1}^{n}\frac{1}{2}(f(x_i)-y_i)^2 + \frac{\lambda}{2}||f||_{\mathscr{H}}^2 \] More generally, a learning problem can be posted as

\[\hat f \in \arg\min_{f \in \mathscr{H}} \hat L_n(f) + Q(||f||_\mathscr{H}^2) ~~~~~~~~~~~~~~~ (*)\]

where \(\hat L_n= \frac{1}{n}\sum_{i=1}^{n} l(f(x_i),y_i)\) is the empirical risk on a sample, and \(Q:[0,\infty) \mapsto [0,\infty)\) is strictly increasing

e.g. \[Q(||f||_\mathscr{H})^2 = \frac{\lambda}{2} ||f||_\mathscr{H}^2\]

This optimization problem may seem daunting since it is optimizing over a potentially very large function space \(\mathscr{H}\). But the following representer theorem reassures us that all minimizers can be written as a linear combination of the kernel functions evaluated at the training points.

15.1 Representer theorem

Theorem: (Representer theorem)

Let V denote the span of the representers of n data points in \(\mathscr{H}\)

\[ \begin{align*} V &\triangleq span({k(x_i,\cdot)}: i=1,...,n)\\ &=\{\sum_{i=1}^{n} \alpha_i k(x_i,\cdot): \vec \alpha \in \mathbb{R}^n\} \end{align*} \] Then \[\hat f = \arg\min_{f \in \mathscr{H}} \hat L_n(f) + Q(||f||_\mathscr{H}^2) \in V\]

Proof. The key is to use the fact that an RKHS has an inner product structure, which allows us to use linear algebra.

Define the orthogonal complement:

\[V_{\bot} \triangleq \{g \in \mathscr{H}: \langle f,g \rangle=0, \forall f \in V\} \] Then any \(\hat f \in \mathscr{H}\) can be decomposed to a point in V and a point in \(V_{\bot}\), \[\hat f= f + f_{\bot}\]

The idea is that the loss is unchanged by \(f_{\bot}\), but the penalty grows with non-zero \(f_{\bot}\). So we must have \(f_{\bot}=0\). To see that, the loss depends on \(\hat f\) only through \(\{ \hat f(x_i), i=1,...,n\}\), which can be written as

\[\hat f(x_i)= f(x_i)+f_{\bot}(x_i)\]

So the loss part doesn’t depend on \(f_\bot\)

On the other hand, \[Q(||f||_\mathscr{H})^2 = Q(||f||_{\mathscr{H}}^2 +||f_\bot||_{\mathscr{H}}^2+0)\] Since Q is strictly increasing, we must have \(f_\bot =0\) to minimize the penalty. Therefore \(\hat f \in V\).

Note the representer theorem does not requires the loss function \(l\) to be convex.

The theorem tells us the \(\alpha_i'\)s that minimizes \((*)\) exist. but how to find the α’s depends on the actual loss function and regularizer. Let’s now look at some examples.

15.2 Examples of kernel learning

  • Examples (kernelized ridge regression)

Recall the optimization problem for kernelized ridge regression \[ \min_{f \in \mathscr{H}} \sum_{i=1}^{n} \frac{1}{2} (f(x_i)-y_i)^2 + \frac{\lambda}{2}||f||_{\mathscr{H}}^2 \]

where \(\mathscr{H}\) is a RKHS with kernel function \(k(\cdot,\cdot)\). By the representer’s theorem

\[\hat f= \sum_{i=1}^{n} \alpha_i k(x_i,\cdot)\]

Hence, we have equivalent optimization

\[\min_{\alpha_1,...,\alpha_n \in \mathbb{R}} \sum_{i=1}^{n} \frac{1}{2} (\sum_{j=1}^{n} \alpha_j k(x_j,x_i)-y_i)^2 + \frac{\lambda}{2} \vec{\alpha} K \vec{\alpha}\] where \(K \in \mathbb{R}^{n \times n}\) is kernel matrix for the training sample. \(Y \in \mathbb{R}^n\) is the vector of training responses.

Differentiating with respect to \(\alpha\) and setting to zero

\[K(K\alpha - Y) + \lambda K \alpha=0\]

Rearranging:

\[(K^2+ \lambda K)\alpha= KY\]

\[\alpha= Y(K+\lambda I)^{-1}\] To predict on a new data point x, we use \(f(x)=\sum_{j=1}^{n} \alpha_j k(x_j,x)=\vec{\alpha}\cdot c\) where \[\begin{align} c &= \begin{bmatrix} k(x_{1},x) \\ \vdots \\ k(x_{n},x) \end{bmatrix} \end{align}\]

and then predict \[y= c^{\top}\cdot \alpha=[k(x_1,x,...,k(x_n,x))] \cdot [K+\lambda I]^{-1}Y\]

(The rest of this chapter was scribed by Baozhen Wang.)

  • Connection between ridge & kernel ridge:

Recall ridge regression: \(\beta = (X^T X + \lambda I)^{-1} X^T Y\), which is equivalent to \[\begin{alignat*}{3} &(X^T X + \lambda &&I)\beta &&= X^T Y\\ (\Leftrightarrow&) X^T X \beta \ + &&\lambda \beta &&= X^T Y\\ (\Leftrightarrow&) &&\lambda \beta &&= X^TY-X^T X \beta\\ (\Leftrightarrow&) &&\ \ \beta && = X^T \left[\frac{1}{\lambda} (Y - X\beta) \right] \end{alignat*}\] Let \[\vec{\alpha} = \frac{1}{\lambda} (Y - X\beta),\] we have \[\lambda \vec{\alpha} = Y - X\beta = Y - X X^T \vec{\alpha},\] equivalently \[(X X^T + \lambda I) \alpha = Y,\] hence \[\alpha = (X X^T + \lambda I)^{-1} Y.\]

Example 15.1 (Kernel PCA) :

  • Review of PCA (but imagine that \(x \mapsto \Phi (x)\) using a linear kernel feature map).
    • Suppose we have \(x_1, \dots, x_n \in X\), and a feature map \(\phi: X\mapsto \mathbb{R}^d.\) Assume that the features are centered, and the centered feature matrix is \(\Phi_{n\times d}\). Then the sample covariance matrix is \[S_n \triangleq \frac{1}{n} \Phi^T \Phi.\] Note \[\Phi = \underbrace{\begin{bmatrix}\cdots & \phi(x_1)^T & \cdots \\ \cdots & \phi(x_2)^T & \cdots \\ & \vdots & \\ \cdots & \phi(x_n)^T & \cdots \end{bmatrix}}_{\text{d}} \]
    • PCA seeks on eigen decomposition of \(S\) \[S\cdot \vec{v}_j = \lambda_j \cdot \vec{v}_j,\] where each \(\vec{v}_j\) is called a loading vector.
    • The PC scores for data vector \(x\) is \[v_j^T \phi(x) = \left\langle \phi (x), v_j \right\rangle\] The squared PC reconstruction error for \(x\) is \[\| \sum_{j=1}^r \left\langle \phi(x), \vec{v}_j \right\rangle \cdot \vec{v}_j - \phi(x)\|^2.\] In fact, PCA can be viewed as how to seek \(\vec{v}_1,\dots, \vec{v}_r\) so that \[\sum_{i=1}^n \| \sum_{j=1}^r \left\langle \phi(x_i), \vec{v}_j \right\rangle \cdot \vec{v}_j - \phi(x_i)\|^2\] is minimized.
  • Heuristic derivation of kernel PCA
    • Each loading vector \(\vec{v}\) is a linear combination of the components of the feature, i.e. \[v = \Phi^T \vec{\alpha} = \sum_{i=1}^n \alpha_i \phi (x_i).\] Alternatively, think PCA as a learning problem, and due to the representer’s theorem \(v \in\) RKHS, or more precisely, \[v = \sum_{i=1}^n \alpha_i \phi(x_i) = \sum_{i=1}^n \alpha_i k(x_i, \cdot) = \Phi^T \cdot \vec{\alpha},\] which implies \[\Phi \cdot v = \Phi^T \cdot \Phi \cdot \alpha = K \cdot \alpha.\] Multiply \(\Phi\) to \(S \cdot v =\lambda \cdot v\), yielding \[\begin{alignat*}{2} \frac{1}{n} \Phi \cdot \Phi^T &\Phi \cdot v &&= \lambda \cdot \Phi \cdot v\\ \frac{1}{n} &K^2 \cdot \alpha &&= \lambda \cdot K \cdot \alpha \end{alignat*}\] One solution is \[K \cdot \alpha = (n\lambda) \cdot \alpha,\] which implies \((n\lambda, \alpha)\) is a pair of eigenvalue - vector of \(K\). And the PC score is \[\begin{align*} &\left\langle \phi(x), v \right\rangle \\ =& \langle \phi(x), \sum_{i=1}^n \alpha_i \phi(x_i) \rangle\\ =& \sum_{i=1}^n \alpha_i \left\langle \phi(x), \phi(x_i) \right\rangle \\ =& \sum_{i=1}^n \alpha_i k(x, x_i) \end{align*}\]
    • In practice, we first compute \(k\) and take eigen-decomposition to obtain \(\vec{\alpha}_j, j=1,\dots\) and hence \(v=\sum_{i=1}^n \alpha_i \phi(x_i)\) (symbolically).
    • After that, only kernel evaluations will be needed. The derivation for \(d=\infty\) case is similar and the result is the same (Nothing practical will change). Note that the loading vector is now a loading function \[v(x)=\sum_{i=1}^n \alpha_i k(x_i, x) \text{ with } \|v\|_\mathcal{H} = 1\]

Example 15.2 (Support Vector Machine) :

  • Review of linear SVM (derived in Math 535). Here we focus on abstract formulations with kernel.
    • Primal Problem \[(\star) \ \ \min{f=\tilde{f} + b: \tilde{f}\in\mathcal{H}, b\in\mathbb{R}, s.t. \sum_{i=1}^n \varphi(y_i f(x_i))+\frac{\lambda}{2}\|\tilde{f}\|^2_\mathcal{H}},\] where \(\varphi\) is hinge loss, \(\mathcal{H}\) is RKHS of the linear functions, and \(\tilde{f}(x) = \beta^T x\) with inner product \[\left\langle \tilde{f}, \tilde{g} \right\rangle_\mathcal{H} = \beta_{\tilde{f}}^T\cdot \beta_{\tilde{g}}\] and \[\|\tilde{f}\|_{\mathcal{H}}^2 = \|\beta_{\tilde{f}}\|_2^2\] The prediction function is \(f(x) = \tilde{f}(x) + b\), while there is no harm to just think about \(\mathcal{H}\) as \(\mathbb{R}^d\) (i.e. the vector space of \(\beta\)).
    • We are going to think of it as RKHS to reinforce the newly learned concept. It’s clear that for every \(x'\in X,\) its representer in \(\mathcal{H}\) is \[g_{x'}=\left\langle x', \cdot\right\rangle \in \mathcal{H},\] because \[L_{x'}(\tilde{f}) = \underbrace{\tilde{f}(x')}_{\text{$=\left\langle \beta, x'\right\rangle$}} = \underbrace{\left\langle g_{x'}, \tilde{f}\right\rangle_\mathcal{H}}_{\text{$=\left\langle x', \cdot\right\rangle$}}\] and hence, the feature map is \(\varphi(x')=\left\langle x', \cdot \right\rangle \in \mathcal{H}\) and kernel \(k(x,x') = \left\langle x, x' \right\rangle\).
    • In Math 535, we showed that using the Lagrange duality, the dual problem is \[\begin{alignat*}{2} \max_{\alpha_1,\dots,\alpha_n\in\mathbb{R}} &-\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j &&x_i^T x_j + \sum_{j=1}^n \alpha_i \\ & \text{s.t.} \sum_{i=1}^n \alpha_i y_i = 0, &&0\leq x_i\leq \frac{1}{\lambda}. \end{alignat*}\] The objective here is quadratic in \(\vec{\alpha}\) and the whole optimization problem can either be solved by existing quadratic programming routines, or by SMO (sequential minimal optimization) which is a variant of coordinate descent. Note that the objective depends on the data through \(x_i^T x_j\) only, and hence for other kernels, we simply replace \(x_i^T x_j\) by \(k(x_i,x_j)\).
    • Now back to the primal problem. Due to the representer’s theorem, the \(\hat{f}\) that optimizes \((\star)\) has the form of \[\hat{f}(x) = \sum_{j=1}^n \alpha_j k(x_j,x) + b\] For linear SVM or linear kernel, this is just \[\hat{f}(x) = \underbrace{\sum_{j=1}^n \alpha_j\cdot x_j^T}_\text{$\triangleq \beta^T$} x + b \] But there is nothing that prevnts us from using other kernels. In any case, we have \[\begin{align*} \min_{\vec{\alpha}\in\mathbb{R}^n,b\in\mathbb{R}} &\sum_{i=1}^n \max\left(0, 1-y_i\left(\sum_{j=1}^n \alpha_j k(x_j, x_i) + b\right)\right) + \frac{\lambda}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j k(x_i,x_j)\\ =& \sum_{i=1}^n \varphi\left(y_i \left(\sum_{j=1}^n \alpha_j k(x_j, x_i) + b \right) \right) + \frac{\lambda}{2} \vec{\alpha}^T K \vec{\alpha} \end{align*}\] which has only \((n+1)\) unknowns.
  • Remarks:
    • For linear SVM, we may either solve the primal problem with \(p+1\) unknowns, or the dual problem with \(n\) unknowns.
    • For SVM with other kernels, we may either solve the dual problem with n unknowns (QP, SMO), or the primal problem after invoking the representer’s theorem and reducing it to a \((n+1)-\)unknown problem.

15.3 Risk Bound of Kernel SVM

Let’s pick up where we left in Chapter 13, where we analyzed the risk bound of boosting. After having introduced kernel methods, we now analyse SVM the same way we analyzed boosting.

Theorem 15.1 (Excess Risk for SVM) Let \(\varphi\) be a L-lipschitz convex surrogate loss with \(\varphi(0)=1\) (e.g. for hinge loss, with \(L=1\)). Assume that \(\mathcal{H}\) is a RKHS with PSD kernel \(k\) satisfying \[\sup_x |k(x,x)|=k_\max < \infty\] and \[|\eta - \frac{1}{2}| \leq c\cdot \left(1-H_\varphi(\eta)\right)^r, \] \(\forall \eta \in (0,1)\) and for \(c>0, r\in [0,1]\).

Let \(\hat{h}_\varphi = \text{sign} (\hat{f}_\varphi)\) where \[\hat{f}_\varphi = \underset{f\in\mathcal{F}}{\text{argmin}} \frac{1}{n} \sum_{i=1}^n \varphi(y_i f(x_i)),\] where \(\mathcal{F} \triangleq \left\{f\in\mathcal{H}, \|f\|_\mathcal{H} \leq \lambda \right\}\) (That is, ERM with constrained kernel norm)

Then, w.p. at least \(1-\delta\), \[L(\hat{h}_\varphi) - L(h^*) \leq 2c \left(4L \lambda \sqrt{\frac{k_\max}{n}} \right)^r + 2c\left(2L\cdot \sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)} \right)^r + 2c\left(\text{a.e.}\right)^r\]

Proof. The beginning of the proof follows that of the boosting \[\begin{align*} L(\hat{h}_\varphi) - L(h^*) &\leq 2c\left(L_\varphi(\hat{f}_\varphi)-L^*_\varphi)\right)^r \text{ (Zhang's Lemma)}\\ &= 2c\left(L_\varphi(\hat{f}_\varphi) - L_\varphi(\bar{f}) + L_\varphi(\bar{f}) - L^*_\varphi\right)^r\\ &\leq 2c\left(4L\cdot\text{Rad}_n(\mathcal{F}) + 2L\cdot\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)} + \text{a.e.}\right)^r\\ &\leq 2c\left(4L\cdot\text{Rad}_n(\mathcal{F})\right)^r + 2c\left(2L\cdot\sqrt{\frac{2}{n}\log\left(\frac{2}{\delta}\right)}\right)^r + 2c\left(\text{a.e.}\right)^r. \end{align*}\] We only need to compute \(\text{Rad}_n(\mathcal{F})\). \[\begin{align*} \text{Rad}_n(\mathcal{F}) &= E\left[\sup_{f\in\mathcal{F}}|\frac{1}{n} \sum_{i=1}^n \sigma_i f(x_i)| \right] \\ &=\frac{1}{n} E\left[\sup_{f\in\mathcal{F}}|\sum_{i=1}^n \sigma_i \left\langle k(x_i,\cdot),f\right\rangle_\mathcal{H} | \right]\\ &=\frac{1}{n} E\left[\sup_{f\in\mathcal{F}}| \langle \sum_{i=1}^n \sigma_i k(x_i,\cdot),f\rangle_\mathcal{H} | \right]\\ (\text{C-S})&\leq \frac{1}{n} E\left[\sup_{f\in\mathcal{F}}\sqrt{ \|\sum_{i=1}^n \sigma_i k(x_i,\cdot)\|_\mathcal{H}^2 \|f\|_\mathcal{H}^2 } \right]\\ &\leq \frac{\lambda}{n} E\left[\sqrt{ \|\sum_{i=1}^n \sigma_i k(x_i,\cdot)\|_\mathcal{H}^2} \right]\\ (\sqrt{\cdot} \text{ concave })&\leq \frac{\lambda}{n} E\sqrt{\left[ \|\sum_{i=1}^n \sigma_i k(x_i,\cdot)\|_\mathcal{H}^2 \right]}\\ &=\frac{\lambda}{n} \sqrt{E\left(\sum_{i=1}^n\sum_{j=1}^n \sigma_i \sigma_j k(x_i,x_j) \right)}\\ &=\frac{\lambda}{n} \sqrt{\sum_{i=1}^n\sum_{j=1}^n E\left(\sigma_i \sigma_j\right) k(x_i,x_j) }\\ &=\frac{\lambda}{n} \sqrt{\sum_{i=1}^n E\left(\sigma_i^2\right) k(x_i,x_j) }\\ &\leq \frac{\lambda}{n} \sqrt{n\cdot k_\max}\\ &=\lambda \sqrt{\frac{k_\max}{n}} \end{align*}\]