Theorem (Leove):
k k k corresponds to the covariance of a GP.
k k k is a symmetric positive semi-definite function.
when k k k is a function of x − y x-y x−y , the kernel is called stationary, σ \sigma σ is called the variance and θ \theta θ is called the lengthscale. And it is quite import to look at the length scale after the optimization step, if the value is quite small that means your model will gain no information from the surrounding, and thus no good prediction for the testing points.
In order to choose a kernel, one should gather all possible
information about the function to approximate:
Kernels often include rescaling parameters : θ for the x axis (length-scale) and σ for the y ( σ 2 \sigma^{2} σ2 often corresponds to the GP variance). They can be tuned by
It is common to try various kernels and to asses the model accuracy. The idea is to compare some model predictions against actual values :
Furthermore, it is often interesting to try some input remapping such as x → log ( x ) , x → exp ( x ) x \rightarrow \log (x), x \rightarrow \exp (x) x→log(x),x→exp(x) to make our data set stationary, and then choose to use the stationary kernel.
First, we would like to say that we observe the high frequency in the data, so we would like to choose a very small length scale value, the result is shown as the left picture. The reason is that, you choose the small scale value, when you are in 2020, the model will get no information from the data set and not influenced by the past values.
Second choice is to be focus on the trend, with low frequency which would lead to a very large length scale value, as shown in picture right. But the confidence interval is over confident.
One thing to notice that, even in the second choice (combination choice), it seems that we would have more parameters, but indeed the optimization process will get easier than the first choice. Because for the first two model, the likelihood will get very very small, since both assumptions make sense for data we have.
Property:
k ( x , y ) = k 1 ( x 1 , y 1 ) + k 2 ( x 2 , y 2 ) k(\mathbf{x}, \mathbf{y})=k_{1}\left(x_{1}, y_{1}\right)+k_{2}\left(x_{2}, y_{2}\right) k(x,y)=k1(x1,y1)+k2(x2,y2)
is a valid covariance structure.
Tensor Additive kernels are very useful for:
Remark:
m ( x ) = ( k 1 ( x , X ) + k 2 ( x , X ) ) ( k ( X , X ) ) − 1 F = k 1 ( x 1 , X 1 ) ( k ( X , X ) ) − 1 F ⎵ m 1 ( x 1 ) + k 2 ( x 2 , X 2 ) ( k ( X , X ) ) − 1 F ⎵ m 2 ( x 2 ) \begin{aligned} m(\mathbf{x}) &=\left(k_{1}(x, X)+k_{2}(x, X)\right)(k(X, X))^{-1} F \\ &=\underbrace{k_{1}\left(x_{1}, X_{1}\right)(k(X, X))^{-1} F}_{m_{1}\left(x_{1}\right)}+\underbrace{k_{2}\left(x_{2}, X_{2}\right)(k(X, X))^{-1} F}_{m_{2}\left(x_{2}\right)} \end{aligned} m(x)=(k1(x,X)+k2(x,X))(k(X,X))−1F=m1(x1) k1(x1,X1)(k(X,X))−1F+m2(x2) k2(x2,X2)(k(X,X))−1F
The prediction variance has interesting features.
The right one comes from a additive kernel, as we can see even in the area which is away from the observation points, the variance is not too high. The reason for that our prior, e.g. kernel is additive, we already three observations which would form a rectangle, and our prediction would be the fourth vertex, thus the variance would be small. All the prior would retrieve it in the posterior.
This property can be used to construct a design of experiment that covers the space especially for the high-D input space, with only c s t × d cst × d cst×d points.
Property:
k ( x , y ) = k 1 ( x , y ) × k 2 ( x , y ) k(x, y)=k_{1}(x, y) \times k_{2}(x, y) k(x,y)=k1(x,y)×k2(x,y)
is valid covariance structure.
k ( x , y ) = k 1 ( x 1 , y 1 ) × k 2 ( x 2 , y 2 ) k(\mathbf{x}, \mathbf{y})=k_{1}\left(x_{1}, y_{1}\right) \times k_{2}\left(x_{2}, y_{2}\right) k(x,y)=k1(x1,y1)×k2(x2,y2)
k ( x , y ) = k 1 ( f ( x ) , f ( y ) ) P r o o f : ∑ ∑ a i a j k ( x i , x j ) = ∑ ∑ a i a j k 1 ( f ( x i ) ⎵ y i , f ( x j ) ⎵ y j ) ≥ 0 k(x, y)=k_{1}(f(x), f(y))\\ Proof:\\ \sum \sum a_{i} a_{j} k\left(x_{i}, x_{j}\right)=\sum \sum a_{i} a_{j} k_{1}\left(\underbrace{f\left(x_{i}\right)}_{y_{i}}, \underbrace{f\left(x_{j}\right)}_{y_{j}}\right) \geq 0 k(x,y)=k1(f(x),f(y))Proof:∑∑aiajk(xi,xj)=∑∑aiajk1⎝⎛yi f(xi),yj f(xj)⎠⎞≥0
This can be seen as a nonlinear rescaling of the input space.
Given a few observations can we extract the periodic part of a signal ?
As previously we will build a decomposition of the process in two independent GPs :
Z = Z p + Z a Z=Z_{p}+Z_{a}\\ Z=Zp+Za
${\text { where } Z_{p} \text { is a GP in the span of the Fourier basis }} $
B ( t ) = ( sin ( t ) , cos ( t ) , … , sin ( n t ) , cos ( n t ) ) t {B(t)=(\sin (t), \cos (t), \ldots, \sin (n t), \cos (n t))^{t}} B(t)=(sin(t),cos(t),…,sin(nt),cos(nt))t
Note that the aperiodic means the projection of the c o s − s i n cos-sin cos−sin space will end up with zero.
And it can be proved that
k p ( x , y ) = B ( x ) t G − 1 B ( y ) k a ( x , y ) = k ( x , y ) − k p ( x , y ) \begin{array}{l}{k_{p}(x, y)=B(x)^{t} G^{-1} B(y)} \\ {k_{a}(x, y)=k(x, y)-k_{p}(x, y)}\end{array} kp(x,y)=B(x)tG−1B(y)ka(x,y)=k(x,y)−kp(x,y)
where G G G is the Gram Matrix associated to B B B in the RKHS.
As previously, a decomposition of the model comes with a decomposition of the kernel:
m ( t ) = ( k p ( x , X ) + k a ( x , X ) ) k ( X , X ) − 1 F = k p ( x , X ) k ( X , X ) − 1 F + k a ( x , X ) k ( X , X ) − 1 F ⎵ aperiodic sub-model m a \begin{aligned} m(t)=&\left(k_{p}(x, X)+k_{a}(x, X)\right) k(X, X)^{-1} F \\=& k_{p}(x, X) k(X, X)^{-1} F+\underbrace{k_{a}(x, X) k(X, X)^{-1} F}_{\text { aperiodic sub-model } m_{a}} \end{aligned} m(t)==(kp(x,X)+ka(x,X))k(X,X)−1Fkp(x,X)k(X,X)−1F+ aperiodic sub-model ma ka(x,X)k(X,X)−1F
and we can associate a prediction variance to the sub-models:
v p ( t ) = k p ( x , x ) − k p ( x , X ) t k ( X , X ) − 1 k p ( t ) v a ( t ) = k a ( x , x ) − k a ( x , X ) t k ( X , X ) − 1 k a ( t ) \begin{aligned} v_{p}(t) &=k_{p}(x, x)-k_{p}(x, X)^{t} k(X, X)^{-1} k_{p}(t) \\ v_{a}(t) &=k_{a}(x, x)-k_{a}(x, X)^{t} k(X, X)^{-1} k_{a}(t) \end{aligned} vp(t)va(t)=kp(x,x)−kp(x,X)tk(X,X)−1kp(t)=ka(x,x)−ka(x,X)tk(X,X)−1ka(t)