Chapter 14 Kernel Methods

(This chapter was scribed by Zifan Huang.)

So far in this class, we have studied excess risk (estimation error), $L(\hat{f})-L(\bar{f})$, which measures how far our estimated predictor $\hat{h}$ is away from the best possible predictor $\bar{h}\in\mathcal{H}$ in terms of the expected risk. But it is only the half of the story, since what we really care about is \[L(\hat{h})-L(h^*)=\underbrace{L(\hat{h})-L(\bar{h})}_{\mbox{est. error}}+\underbrace{L(\bar{h})-L(h^*)}_{\mbox{approx. error}}.\]

This approximation error has to do with how rich/expensive $\mathcal{H}$ is.

The kernel method allow us to tap on a very rich class of models.

We have mainly focused on dinner models, where the prediction is $\langle\beta,x\rangle$ between the weight/coefficient vector and the input $x$. However, real data often exhibit highly non-linear relationship. The kernel methods allow us to consider non-linear models easily.

$*$ Specifically, since all our algorithms so far only require convexity in $\beta$ to ensure that the loss function is convex is $\beta$, not $x$, we can in fact sneakily replace $x$ with an arbitrary feature vector $\phi(x)\in\mathbb{R}^q$.

Example. $\phi(x)=(1,x,x^2)\in\mathbb{R}^3$, for all $x\in\mathbb{R}$.
Example. $\phi(x)=($count of “a” in $x$, count of “b” in $x,...)$ where $x$ is a string.
A dilemma: in spired by the above example, we could represent very complicated non-linear function by simply augmenting the $\phi(x)$. But the problem is that to achieve a desired level of approximation, $\phi(x)$ would have to be very high-dimensional resulting in large computational cost. Another challenge is that for some problems, it might be hard to design good feature directly.
To address the above dilemma and challenge, kernel offers:

a computational efficient way of working with high (or even infinite) dimensional $\phi(x)$ “implicity.”
a different perspective on designing features which can be more natural for certain applications.

Before we formally define kernels. Let’s provide some intuitions via some examples.

Example: (Recast algorithms)

Consider stochastic gradient descent on squared loss $l((x,y),\beta)=\frac{1}{2}(y-\beta\cdot\phi(x))^2$.

The gradient descent update is (assuming $\beta_o=0$) \[\beta_{t+1}=\beta_t+\eta\cdot(y_t-\beta_t\cdot\phi(x_t))\cdot\phi(x_t)\] where $\eta$ is some small constant, $(y_t-\beta_t\cdot\phi(x_t))\cdot\phi(x_t)=$negative gradient of the loss function w.r.t $\beta$ evaluate at $x_t$.

Let’s define $\alpha_t\overset{\triangle}{=}\eta\cdot(y_t-\beta_t\cdot\phi(x_t))$.

We can set that after $t$ steps of update,

\[\beta_t=\sum_{i=1}^{t-1} \alpha_i\cdot \phi(x_i),\] which is linear combination of $\phi(x_i)$’s and the prediction on a new data point $x$ is \[\beta_t\cdot\phi(x)=\sum_{i=1}^{t-1} \alpha_i\cdot\langle\phi(x_i),\phi(x)\rangle,\]

which is a linear combination of the inner product between feature vectors.

This suggests that we can work with high (or infinite) dimensions as long as we can compute these inner products efficiently. When do we save in space in this example?

Here we store the N-many $\alpha_i$’s instead of the p-dimension $\vec{\beta}$. If $p\gg N$, then we win. Otherwise, when we have large sample size, we must resort to the feature augmentation approach before.

Example: (Computing time with quadratic features).

Let the original input $x$ be in $\mathbb{R}^p$. Consider feature map with all the quadratic terms.

\[\phi(x)=(x^2_1,x_2^2,...,x^2_p,\sqrt{2}x_1x_2,...,\sqrt{2}x_1x_p,\sqrt{2}x_2x_3,...,\sqrt{2}x_2x_p,\sqrt{2}x_3x_4,...,\sqrt{2}x_{p-1}x_p).\] There are $O(p^2)$ dimensions.

Explicit computing the $\beta$ and predicting for $x$ using $\langle \beta,\phi(x)\rangle$ takes $O(p^2)$ time.
Implicit computing: for each $x'$ in the training data computing $\langle\phi(x),\phi(x')\rangle=(x^Tx')^2$ takes $O(p)$ time.

Predicting for a new data $x$ requires doing the above for each of $N$ data points, which takes up $O(N_p)$ time in total.

Note that mathematically speaking, we are still doing SGD. All we have done is a computational shortcut, called kernel trick.

Setting of kernel trick:

We start with a feature map $\phi:x\mapsto \mathbb{R}^q$.
We can recast our algorithms (e.g. SGD) in a way so that they only depend on the inner products $\langle\phi(x),\phi(x')\rangle$.
We could compute the inner products efficiently.

Main idea: Since these algorithms only depend on the inner products, we can just cut to the chase and directly write down “kernel function” $k(\cdot,\cdot)$ that corresponds to the inner product $k(x,x')=\langle\phi(x),\phi(x')\rangle$, by passing the need to define explicitly the feature maps $\phi(x)$.

This is a key conceptual change, we shift our attention from $\phi(x)$ to $k(\cdot,\cdot)$ which can be viewed as a similarity, and can be more natural from a modeling perspective.

14.1 kernels definitions and examples

We will first define kernels without reference to feature maps. Later on, we will establish the connections.

Define: (Kernel)

A function $k: X\times X\mapsto\mathbb{R}$ is a positive semi-definite (PSD) kernel or simply kernel iff for every finite set of points $x_1,...,x_n\in X$, the kernel matrix $K\in\mathbb{R}^{n\times n}$, define by $K_{ij}\overset{\triangle}{=}k(x_i,x_j)$, is PSD.

Let’s give some examples of kernels, and then verify that they are indeed kernels using the definition. Below we assume $x\in\mathbb{R}^p$.

Linear kernel: $k(x,x')=\langle x,x'\rangle$
Polynomial kernel: $k(x,x')=(1+\langle x,x'\rangle)^d$

Note: We can check that the corresponding feature has dimension $O(p^d) which is exponential in $d$.

Gaussian kernel: $k(x,x')=\exp(\frac{-||x-x'||^2_2}{2\sigma^2})$

$*$ Fixing $x'$, $k(x,x')$ is proportional to the $MVN(\mu=x',\sigma^2\cdot\mathbb{I}_p)$ density.

$*$ It’s a smooth function w.r.t. both $x$ and $x'$. The smoothness is controlled by $\sigma^2$. $\sigma^2\uparrow$, smoothness $\uparrow$.

$*$ The corresponding feature $\phi(x)$ is infinite-dimensional (which we see later). So computationally, we had no choice to use kernels.

$*$ The “go-to” kernel in machine learning. Define a very expressive hypothesis class.

Mis-match kernel between sequences (application in NLP and computational biology)

$*$ We can define kernels over other data types. Supppose $\Sigma^*$ is the set of all possible sequences (strings) over some alphabet $\Sigma$. For any two sequences, such as “format” and “fmt,” we can define a kernel to measure their similarity.

$*$ A string $u\in\Sigma^*$ is a subsequence of $V\in\Sigma^*$ if there exists a sequence of indices $\vec{i}=(i_1,i_2,...,i_{|u|})$ such that $1\le i_1\le i_2\le...\le i_{|u|}$ and $u_j=v_{i_j}$. In this case, we can write $u=v[\vec{i}]$ e.g. $v=$“format,” $u=$“fmt,” $i=[1,4,6]$.

$*$ Define a kernel between two sequences $x$, $x'$ is defined as the sum of exponential length of common sub-sequences.
\[R(x,x')=\sum_{u\in\Sigma^*}\sum_{(\vec{i},\vec{j}),x[\vec{i}]=x'[\vec{j}]=u}\lambda^{|\vec{i}|+|\vec{j}|},~\mbox{for some}~\lambda\in[0,1].\]

Non-example: $k(x,x')={\bf1}[||x-x'||_2\le 1]$

$*$ show that $k$ is not a kernel!

General Principles for checking building kernels

Base case: for any function $f:x\mapsto \mathbb{R}$.

$k(x,x')\overset{\triangle}{=} f(x)\cdot f(x')$ is a PSD kernel.

Proof. The kernel matrix can be return as $K=U\cdot U^T$, where $u=(f(x_1),f(x_2),...,f(x_n))^T,$ which is PSD, denoted $K\succeq 0$.

Strategy: given two kernels $k_1$ and $k_2$, we can create a new kernel $k$ using basic operations. To show that $k$ is a kernel, we only need to show that “$K_1\succeq 0$, $K_2\succeq 0$” $\Longrightarrow$ “$K\succeq 0$,” where $K_1$, $K_2$ and $K$ are the corresponding kernel matrices of $k_1$, $k_2$ and $k$.
“Sum”: $k(x,x')=k_1(x,x')+k_2(x,x')$.

Proof. Clear that $K=K_1+K_2\succeq 0$ if $K_1,K_2\succeq 0$.

“Product”: $k(x,x')=k_1(x,x')\cdot k_2(x,x')$.

Then $K$ under $k$ will be \[K=K_1\circ K_2,\] where $\circ$ is Hadamard product (element wise product).

Question: If $K_1,K_2\succeq 0$, then is $K_1\circ K_2\succeq 0$?

Answer: Yes!

Let $K_1=\sum_{i=1}^n \lambda_i u_iu_i^T$, $K_2=\sum_{j=1}^n \tau_j u_ju_j^T$.

\[\begin{aligned} \Rightarrow K_1\circ K_2&=(\sum_{i=1}^n \lambda_i u_iu_i^T)\circ(\sum_{j=1}^n \tau_j u_ju_j^T)\\ &=\sum_{i=1}^n\sum_{j=1}^n(\lambda_i u_iu_i^T)\circ(\tau_j u_ju_j^T)\\ &=\sum_{i=1}^n\sum_{j=1}^n(\lambda_i \tau_j)(u_i\circ v_j)(u_i\circ v_j)^T \end{aligned}\]

Note that if $\lambda_i\ge 0$, $\tau\ge 0$, then $\lambda_i\cdot\tau_j\ge 0$.

$\Rightarrow$ each term is PSD.

$\Rightarrow$ sum of terms is PSD.

(The notes from this point onward in this chapter was scribed by Jingze Liu.)

linear kernel: let $f(x)=x_i$,then $f(x)\cdot f(x')=x_i \cdot x_i'$ is a kernel. So is $\sum_{i=1}^{p}x_i \cdot x_i'$
polynomial kernel: 1+linear kernel is a valid kernel. By the product rule, $(1+\langle x,x'\rangle)^2$ is also a kernel.\ Repeat $(\alpha-1)$ times $\Rightarrow (1+\langle x,x'\rangle)^d$ is also a kernel.
Gaussian kernel:

\[\begin{align*} k(x,x') &= exp(-\frac{||x-x'||^2}{2\sigma^2})\\ &=exp(-\frac{||x||^2+||x'||^2-2\langle x,x' \rangle}{2\sigma^2})\\ &=exp(-\frac{||x||^2}{2\sigma^2})exp(-\frac{||x'||^2}{2\sigma^2})exp(\frac{\langle x,x' \rangle}{\sigma^2}) .\end{align*}\]

each term is a kernel.

Sum of a finite number of kernels is a kernel.
Kernels are closed under taking limits, since the set PSD matrices is closed.

14.2 Three views of kernel methods.

To lay down the math foundation for kernel methods, we will develop three views of kernel methods.

Feature map: maps from a data point $x \in \mathscr{X}$ to an element of an inner-product space. This allows consideration of the property of a single data point.
kernel $k(\cdot,\cdot)$: take two data points $x,x' \in \mathscr {X}$, and return a real number. This allows to consider the similarity between 2 data points.
RKHS $\mathscr{H}$: a class of fchs $f:x\rightarrow R$ equipped with a norm $||\cdot||_\mathscr{H}$ for measuring the complexity of the function. This allows to consider the prediction function itself.

We will define the three views seperately, and eventually show that ther are equivalent.

Define: (Hilbert space) A Hilbert space $\mathscr{H}$ is a complete vector space with and inner product $\langle\cdot,\cdot\rangle$: $\mathscr{H}\times \mathscr{H} \mapsto \mathbb{R}$,that satisfies the following three properties:

symmetry: $\langle f,g\rangle=\langle g,f\rangle$
linearity: $\langle \alpha_1f_1+\alpha_2f_2,g\rangle=\alpha_1\langle f_1,g\rangle+\alpha_2\langle f_2,g\rangle$
positive definite: $\langle f,f\rangle \geq0$ with equality only if f=0.

The inner product induces a norm:$||f||_\mathscr{H}\triangleq \sqrt{\langle f,f\rangle}$

Example:

Euclidean space $\langle u,v \rangle=\sum_{i=1}^{d}u_i^\top v_i$
square summable sequences

\[ l^2 \triangleq {(u_i)_{i \geq 1}: \sum_{i=1}^{\infty} u_i^2 < \infty }\\ \langle u,v\rangle = \sum_{i=1}^{\infty} u_i^\top v_i \]

square integrable functions on [0,1]

\[ L^2([0,1])={f: \int_{0}^{1} f^2(x)dx < \infty}\\ \langle f,g\rangle=\int_{0}^{1} f(x)g(x)dx \] A Hilbert space generalizes $\mathbb{R}^p$ and can be viewed as a class of infinite-dimensional feature vectors.

Define: (Feature Map)

Given a Hilbert space $\mathscr{H}$, a feature map $f: X\mapsto \mathscr{H}$ maps a data point $x\in \mathscr{X}$ to an infiniite-dimesnsional feature vector $\phi(x)\in \mathscr{H}$.

Theorem: (Feature map defines a kernel)

Let $\phi:x\mapsto \mathscr{H}$ be a feature map. Then $k(x,x')\triangleq \langle \phi(x),\phi(x')\rangle$ is a valid kernel. The key idea is taht we only need to check the validity on any n data points, which reduces to a finite probelm. Hence let $X_1,X_2,..., X_n$ be a set of n points. Let k be the kernel matrix with $k_{ij}=\langle \phi(x_i),\phi(x_j)\rangle$.To show that, K is a positive semi-definite matrix, consider any any $\vec{\alpha} \in R^n$, we have \[\begin{align*} \vec{\alpha}K\vec{\alpha}&= \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j \langle \phi(x_i),\phi(x_i) \rangle\\ &=\langle \sum_{i=1}^{n} \alpha{i}\phi{x_i}, \sum_{j=1}^{n} \alpha{j}\phi{x_j}\rangle\\ & \geq 0 .\end{align*}\]

Theorem: (Kernel defines feature map) For every kernel k, there exists a Hilbert space $\mathscr{H}$ and a feature map $\phi: X \mapsto \mathscr{H}$ such that$k(x,x')=\langle \phi(x_i),\phi(x_i) \rangle$

We will prove this results fully later, since it requires some sophisticated setting. But to get some intuition, we prove it for the special case when the sample input space is finite.

Proof. For special case where x is bounded, let $x={x_1,...,x_n}$ and define $k \in \mathbb{R}^{n \times n}$ accordingly $k_{ij}=k(x_i,x_j)$.

Since k and K are semi-positive definite, we can write $K=\Phi \cdot \Phi^\top$

Let the ith row of $\Phi$ be the feature vector denoted as $\Phi(x_i) \in R^n$, we can verify that $k(x_i,x_j)=\langle \phi(x_i),\phi(x_j)\rangle$

Note: the feature map is not unique: since $\widetilde{\Phi}=\Phi \cdot R$ is also the matrix of feature vectors with R being an orthogonal matrix.

Remark: if $|X|=\infty$,then we need to generalize our notion of feature from $R^n$ to an infinite-dimensional space.

14.3 Reproducing Kernel Hilbert Space (RKHS)

We now introduce RKHS, a special type of Hilbert Space, which is the third view on kernel methods. Initially, we will us it to show that “every kernel corresponds a feature map.”

Note the RKHS stands on its own right as an object worth studying, since they allow us to work on the prediction functions. For example, in linear ridge regression, we typically fit a coefficient vector $\beta$, constraining or regularizing the norm $||\beta_2||$, and use $f(x)=\langle \beta,\phi(x)\rangle$ to make predictions. Can we get a handle on this prediction function directly, and a norm $||f||_\mathscr{H}$ measuring its complexity? What is the space of all prediction functions that we “implicitly” consider when using a kernel? How is it related to the kernel?

We already introduced Hilbert space of functions, but not all Hilbert spaces are suitable for machine learning RKHS specifies what we really need.

Define::(Bounded functions)

Given a Hilbert space $\mathscr{H}$, a functional $L: \mathscr{H} \mapsto \mathbb{R}$ is bounded iff $\exists$ M>0, s.t. $|L(f)|\leq M\cdot ||f||_\mathscr{H}$

Example:: for $X=\mathbb{R}^d$, and $\mathscr{H}={f_c(x)=c^\top \cdot x: c \in \mathbb{R}^d}$ be the space of linear functions. Then the evaluation functional is

\[ L_*(f_c)=f_c(x)=c^{\top} \cdot x \]

Intuitively, $L_*(f)$ is a projection operator that maps $f \in \mathscr{H}$ to a number in $\mathbb{R}$. Hence it’s linear (in “f”).

Define::(RKHS) A Reproducing Kernel Hilbert Space $\mathscr{H}$ is a Hilbert Space over functions $f: x\mapsto \mathbb{R}$ such that for each $x \in \mathscr{X}$, the evaluation functional $L_x$ is bounded.

Non-Example:

Let $\mathscr{H}$ be the set of all square integrable continuous functions from [0,1] to $\mathbb{R}$, with $\langle f,g\rangle= \int_{0}^{1} f \cdot g dx$
Consider $f_\epsilon (x)=max(0, 1- \frac{|x-1/2|}{\epsilon})$, which is 0 e.w exept for a small spike at 1/2 up to 1.

Consider the evaluation functional $L_x$ with $x=\frac{1}{2}$, $L_{\frac{1}{2}}(f_\epsilon (x))=f_\epsilon (\frac{1}{2})=1$, So there can not exist an M, s.t.

\[L_{\frac{1}{2}}(f_\epsilon (x))\leq M \cdot ||f_\epsilon||_{\mathscr{H}}\]

which converges to 0 as $\epsilon \rightarrow 0$.So this $\mathscr{H}$ is not RKHS.

Theorem: (RKHS defines a kernel)

every RKHS $H$ over function $f:\mathscr{X} \mapsto \mathbb{R}$ defines a unique kernel $k: \mathscr{X} \times \mathscr{X} \mapsto \mathbb{R}$.

To prove this, we need the Riesz presentation theorem.

Theorem: (Riesz Representation theorem)

In a Hilbert space $\mathscr{H}$, for every bounded linear functional $L: \mathscr{H} \mapsto \mathbb{R}$ , there exists a unique representer $g \in \mathscr{H}$, such that $L(f) \equiv \langle f,g \rangle_\mathscr{H}$,$L_x(cf)=(cf)(x)= c \cdot f(x)= c \cdot L_x(f)$

Therefore, due to Riesz representation theorem, we can see that for each $x \in \mathscr{X}$, there exists a unique $g_x \in \mathscr{H}$ such that $L(f) \equiv \langle f,g_x \rangle$

This leads to the reproducing property namely

\[f(x)=\langle g_x,f \rangle ~~~~\forall f \in \mathscr{H}\]

That is, “functional evaluation” = “inner product.” Now, let’s define $k(x,x') \triangleq g_x(x')$Applying the reproducing property one more time, with $f=g_x$, yields

\[ g_x(x')= \langle g_x',g_x\rangle \]

If we define a feature map $\phi(x) \triangleq g_x \in \mathscr{H}$, then we can use a previous theorem “feature map defines kernel” to conclude that k(x,x’) is a valid kernel. In summary, any RKHS $\mathscr{H}$ gives rise to a kernel k with the reproducing property, and hence is called the reproducing kernel of $\mathscr{H}$.

The key of the proof is the Riesz representation theorem, which turns function evaluation to inner product.

To complete the connection between RKHSes and kernels, we need to show the converse, namely that a kernel defines an unique RKHS:

Theorem: (Moore-Aronszajn theorem)

For every kernel $k$, there exists a unique RKHS $\mathscr{H}$ with reproducing kernel $k$.

Proof. – Let $k$ be a kernel. We will construct a RKHS $\mathscr{H}$ from the functions ${k(x,\cdot): x\in \mathscr{X}}$

First define $\mathscr{H}_0$ to contain all finite linear combinations of the form:

\[ f(x) = \sum_{i=1}^n \alpha_ik(x_i,x),\] for all $n, \alpha_{1:n}, x_{1:n}$. By construction, $\mathscr{H}_0$ is a vector space (not necessarily complete though). This is a very natural class of functions: just a linear combination of basis functions centered around the points $x1, . . . , xn$.

Second, define the inner product between $f(x)= \sum_{i=1}^{n} \alpha_i k(x_i,x)$ and $g(x)=\sum_{j=1}^{n'} \alpha_j k(x_j,x)$ as follows:

\[\langle f,g\rangle = \sum_{i=1}^{n}\sum_{j=1}^{n'} \alpha_i \beta_j k(x_i,x_j').\]

Let’s check that our definition of $\langle \cdot,\cdot\rangle$ is an actual inner product:

Symmetry $(\langle f,g \rangle =\langle g,f \rangle)$ : by symmetry of k
Linearity $(\langle \alpha_1 f_1 +\alpha_2 f_2,g\rangle)= \alpha_1 \langle f_1,g \rangle +\alpha_2 \langle f_2,g \rangle)$: by definition of $f$ and $g$(they are just a linear sum of terms).
Positive definiteness $(\langle f,f\rangle \ge 0 )$ with equality only if $f=0$
For any $f \in \mathscr{H_0}$, we have $\langle f,f\rangle$ = $\alpha^{\top}K \alpha \ge 0$ by the positive semidefinite property of kernels.
Now we will show that $\langle f,f\rangle = 0$ implies $f=0$. Let $f(x)= \sum_{i=1}^{n} \alpha_i k(x_i,x)$. Take any $x \in \mathscr{X}$ and define $c=[k(x_1,x), ...,k(x_n,x)]^{\top}$. Since

\[ \begin{pmatrix} K & c \\ c^{\top} & k(x,x) \end{pmatrix} \succeq 0 \quad \] We must have that

\[ \alpha^{\top} K \alpha +2bc^{\top}\alpha +b^2 k(x,x) \ge 0 \] for all b. Note that $\langle f,f \rangle = \alpha^{\top} K \alpha =0$. We now argue that $c^{\top}\alpha=0$.

If $c^{\top}\alpha >0$, then as $b$ approaches 0 from the negative side, we have that LHS is strictly negative((since the $b$ term will dominate the $b^2$ term), which is a contradiction. If $c^{\top}\alpha<0$, then as b approaches 0 from positive side, we get a contradiction as well. Therefore, $f(x) = c^{\top} \alpha=0$

So far we have a valid Hilbert space, but we need to still check that all evaluation functionals $L_x$ are bounded to get an RKHS. Also, we should check that $k(x,\cdot)$ is indeed a representer of function evaluationals. Take any $f \in \mathscr{H_0}$. Then:

\[\begin{align*} f(x)&= \sum_{i=1}^{n} \alpha_i k(x_i,x)\\ &= \langle f,k(x,\cdot)\rangle\\ &=\langle R_x,f\rangle \end{align*}\]

Boundedness of $L_x$ follows from $|L_x(f)|=|\langle f,k(x,\cdot)\rangle|\le ||f||_{\mathscr{H}}||k(x,\cdot)||= ||f||_{\mathscr{H}} k(x,x)$ by Cauchy-Schwartz.

Finally, let H be the completion of $\mathscr{H}_0$ (by including all limit points of sequences in $\mathscr{H}_0$). This is the only part of the proof that we’re punting on. For details, see the references at the end of this section.
This proves “kernel defines feature maps” because the RKHS H is an inner product space by construction.

Summary:

A feature map $\phi : \mathscr{X} → \mathscr{H}:$ maps points in $\mathscr{X}$ to some inner product space $\mathscr{H}$.
A (positive semidefinite) kernel $k : \mathscr{X} \times \mathscr{X} → \mathbb{R}$ has the property that every derived kernel matrix $K$ is positive semidefinite.
A reproducing kernel Hilbert Space (RKHS) H containing functions $f : \mathscr{X} → \mathbb{R}$ such that function evaluations are bounded linear operators.
Moore-Aronszajn theorem establishes connection between kernels and RKHS.
Given RKHS, can set $\phi(x)=g_x=k(x,\cdot)$ to map $x$ to its representer in RKHS, which exists due to the Riesz Representation Theorem.
$k(x,x')=\langle \phi(x),\phi(x')\rangle_\mathscr{H}$ every kernel corresponds to sum of inner product and vice versa.