How A Gaussian Process Arises as a Limit of Linear Regression

Gaussian process (GP) models are very useful in machine learning tasks. Not only it naturally offers probabilistic interpretation but also shines in the tasks where data is not so abundant. It has connections to many other machine learning methods such as support vector machine and neural networks. In this post, we note that GP regression is linear regression with Gaussian prior on parameters, and sketch how GP can arise as a limit of linear model as the number of features goes to infinite.

A Gaussian process is an infinite collection of random variables, of which any finite subset of them form a multivariate normal distribution. It is sorely determined by its mean and covariance function and is often denoted as $f(\cdot)\sim\mathcal{GP}(m(\cdot),k(\cdot,\cdot))$, where $m(x)$ is the mean function and $k(x,x’)$ is the covariance function. By definition, for a finite collection ${x_{1},\cdots,x_{n}}$, $[f(x_{1}),\cdots,f(x_{n})]^{T}$ is a multivariate normal distribution with mean $m(x_{1}),\cdots,m(x_{n})]^{T}$ and covariance $cov(f(x_{i}),f(x_{j}))=k(x_{i},x_{j}$). This joint multivariate normal makes Gaussian processes desirable for regression, in which inferences on new test points can be easily made by conditioning on training data. Next we show how GP regression is connected to linear regression.

A linear regression model can be written as where $\mathbf{x}$ is the input vector, $\mathbf{w}$ is the parameter vector, $\epsilon\sim\mathcal{N}(0,\sigma_{n}^{2}$) is noise. In a Bayesian approach, we assume a prior $\mathbf{w}\sim\mathcal{N}(0,\sigma^{2}I)$. Thus given certain $\mathbf{x}$, $f$ is a linear combination of normal random variables and hence is also normal. Moreover, for a finite collection of $\mathbf{x}$, their corresponding $f$’s is a multivariate Gaussian. Therefore, $f$ is a Gaussian process. Its mean function is

and its covariance function is which is called dot product covariance function.

The simple linear regression is restricted by the fact it only represents linear relationship with $\mathbf{x}$. A common way to resolve with this limitation is to expand the feature space by using basis functions $\phi_{i}(x)$, so instead of \eqref{eq:f=xw}, we have $f=\boldsymbol{\phi}(\mathbf{x})^{T}\mathbf{w}$. For illustrating purpose, we assume input vector $\mathbf{x}$ is 1-dimensional. Some examples of basis functions include monomials $\boldsymbol{\phi}(x)=[1,x,x^{2},\cdots]$, Fourier $\boldsymbol{\phi}(x)=[\sin(x),\cos(x),\sin(2x),\cos(2x),\cdots]$, radial basis functions $\boldsymbol{\phi}(x)=[e^{-(x-1)^{2}},e^{-(x-2)^{2}},e^{-(x-3)^{2}},\cdots]$. Since the model remains linear in $\mathbf{w},$ results from simple linear regression are extended directly. In particular,\eqref{eq:xTx} becomes $k(\mathbf{x},\mathbf{x}’)=\sigma^{2}\boldsymbol{\phi}(\mathbf{x})^{T}\boldsymbol{\phi}(\mathbf{x}’)$.Considering a set of $N$ radial basis functions centered equidistantly, it follows that

where we have scaled the variance of $\mathbf{w}$ to be $\sigma^{2}/N$. Formally, we let $N\to\infty$ and obtain

which is the widely used so-called squared exponential covariance function.

Building the connection to linear models not only offers another way of viewing GP, but also opens doors to design and pick appropriate covariance functions. A good resource to learn more is Rasmussen and Williams’ book. I also found the videos by Hennig enlightening.

Written on August 23, 2019