User Tools

Site Tools


academics:ml:gp:start

Gaussian Processes

Summary some information about GPs for my own research references.

:todo: Not comprehensive summarizes or comparisons yet.

Packages

GPy v.s. GPflow

What’s the difference between GPy and GPflow?
GPy use no GPU.
GPflow “extends” GPy with TensorFlow, but with many differences, and focus on VI and MCMC.

GPflow focusses on variational inference and MCMC – there is no expectation propagation or Laplace approximation.

GPflux

GPflux: A Library for Deep Gaussian Processes, https://arxiv.org/abs/2104.05674

https://github.com/secondmind-labs/GPflux/

GPflux is a toolbox dedicated to Deep Gaussian processes (DGP), the hierarchical extension of Gaussian processes (GP). GPflux uses the mathematical building blocks from GPflow and marries these with the powerful layered deep learning API provided by Keras. This combination leads to a framework that can be used for:

  1. researching new (deep) Gaussian process models, and
  2. building, training, evaluating and deploying (deep) Gaussian processes in a modern way — making use of the tools developed by the deep learning community.

Survey Papers

[1] Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai, “When Gaussian Process Meets Big Data: A Review of Scalable GPs”, IEEE Transactions on Neural Networks and Learning Systems, vol. 31, Nov. 2020, pp. 4405–4423.

Reading notes of When Gaussian Process Meets Big Data: A Review of Scalable GPs.


Categories: frameworks

Sparse GP

Deep GP

Multi-resolution GP

[2] is to solve: These multiple observation processes and sensing modalities can be dependent, with different signal-to-noise ratios and varying sampling resolutions across space and time. 【似乎不是目前的重点】

https://github.com/ohamelijnck/multi_res_gps


Multi-scale modelling

We note that the multiresolution GP work by Fox and Dunson [7] defines a DGP construction for non-stationary models that is more akin to multi-scale modelling [35]. (from [2])


[3] is to solve long-range dependencies and potential discontinuities. 【不是目前我要解决的问题】(这里值得重新思考,参见b_capability中对这fox_multiresolution_2012的点评)

multi-resolution multi-sensor problem

Various aspects of the mGP have similarities to other models proposed in the literature that primarily fall into two main categories: (i) GPs defined over a partitioned input space, and (ii) collections of GPs defined at tree nodes. The treed GP [8] captures non-stationarities by defining independent GPs at the leaves of a Bayesian CART-partitioned input space. The related approach of [12] assumes a Voronoi tessellation. For time series, [21] examines online inference of changepoints with GPs modeling the data within each segment. These methods capture abrupt changes, but do not allow for long-range dependencies spanning changepoints nor a functional data hierarchical structure, both inherent to our multiresolution perspective. A main motivation of the treed GP is the resulting computational speed-ups of an independently partitionedGP. A two-level hierarchicalGP also aimed at computational efficiency is considered by [16], where the top-levelGP is defined at a coarser scale and provides a piece-wise constant mean for lower-level GPs on a pre-partitioned input space. (see [3])

[10, 11] consider covariance functions defined on a phylogenetic tree such that the covariance between function-valued traits depends on both their spatial distance and evolutionary time spanned via a common ancestor. Here, the tree defines the strength and structure of sharing between a collection of functions rather than abrupt changes within the function. The Bayesian rose tree of [3] considers a mixture of GP experts, as in [14, 17], but using Bayesian hierarchical clustering with arbitrary branching structure in place of a Dirichlet process mixture. Such an approach is fundamentally different from the mGP: each GP is defined over the entire input space, data result from a GP mixture, and input points are not necessarily spatially clustered. Alternatively, multiscale processes have a long history (cf. [25]): the variables define a Markov process on a typically balanced, binary tree and higher-level nodes capture coarser level information about the process. In contrast, the higher level nodes in the mGP share the same temporal resolution and only vary in smoothness. (see [3])

At a high level, the mGP differs from previous GP-based tree models in that the nodes of our tree represent GPs over a contiguous subset of the input space X constrained in a hierarchical fashion. Thus, the mGP combines ideas of GP-based tree models and GP-based partition models. (see [3])

As presented in Sec. 3, one can formulate an mGP as an additive GP where each GP in the sum decomposes independently over the level-specific partition of the input space X. The additive GPs of [6] instead focus on coping with multivariate inputs, in a similar vain to hierarchical kernel learning [1], thus addressing an inherently different task. (see [3])

Treed GP

主要是加速prediction部分,与CG结合可以加速训练部分,但是会导致无法计算predictive variance [4]


Multiresolution tree data structures have been used to speed up the computation of a wide variety of machine learning algorithms [9, 5, 7, 14]. (see [5])

Sparse approximations to GP inference provide a different way of overcoming the O(n3) scaling [18, 3, 8], by selecting a representative subset of D of size d « n. Sparse methods can typically be trained in O(n d2) (including the active forward selection of the subset) and require O(d) prediction time only. In contrast, in our work here we make use of all of the data for prediction, achieving better scaling by exploiting cluster structure in the data through a kd-tree representation. (see [5])

More closely related to our work is [20], where the MVM primitive is also approximated using a special data structure for D. Their approach, called the improved fast Gauss transform (IFGT), partitions the space with a k-centers clustering of D and uses a Taylor expansion of the RBF kernel in order to cache repeated computations. The IFGT is limited to the RBF kernel, while our method can be used with all monotonic isotropic kernels. As a topic for future work, we believe it may be possible to apply IFGT's Taylor expansions at each node of the kd-tree's query-dependent multiresolution clustering, to obtain an algorithm that enjoys the best properties of both. (see [5])


A main motivation of the treed GP is the resulting computational speed-ups of an independently partitionedGP. [3]


Shen et al. (2006) used KD-trees to recursively partition the data space into a multi-resolution tree data structure, which scale GPs to O(104) training points. However, no solutions for variance predictions are provided, and the approach is limited to stationary kernels. [6]


This tree recursion can be thought of as an approximate matrix-vector multiplication (MVM) operation; a related method, the Improved Fast Gauss Transform (Morariu et al., 2008), implements fast MVM for the special case of the SE kernel. It is possible to accelerate GP training by combining MVM methods with a conjugate gradient solver, but models thus trained do not allow for the computation of predictive variances. [4]

Online Training GP

[7] is cited >800 times on google scholar.

Distributed GP

GP-MoE (mixture-of-experts)

Along the lines of exploiting locality, mixture-of-experts (MoE) models (Jacobs et al., 1991) have been applied to GP regression (Rasmussen & Ghahramani, 2002; Meeds & Osindero, 2006; Yuan & Neubauer, 2009). However, these models have not primarily been used to speed up GP regression, but rather to increase the expressiveness of the model, i.e., allowing for heteroscedasticity and nonstationarity. [6]

PoE-GP (Product-of-GP-experts)

Product-of-GP-experts models (PoEs) sidestep the weight assignment problem of mixture models: Since PoEs multiply predictions made by independent GP experts, the overall prediction naturally weights the contribution of each expert. However, the model tends to be overconfident (Ng & Deisenroth, 2014). [6]

generalised PoE-GP

Cao and Fleet (Cao & Fleet, 2014) recently proposed a generalised PoE-GP model in which the contribution of an expert in the overall prediction can weighted individually. This model is often too conservative, i.e., it over-estimates variances. [6]

Bayesian Committee Machine (BCM)

Tresp’s Bayesian Committee Machine (BCM) (Tresp, 2000) can be considered a PoE-GP model, which provides a consistent framework for combining independent estimators within the Bayesian framework, but it suffers from weak experts. [6]

robust BCM (rBMC) or Distributed Product-of-GP-Experts Models

In this paper, we exploit the fact that the computations of PoE models can be distributed amongst individual computing units and propose the robust BCM (rBMC), a new family of hierarchical PoE-GP models that (i) includes the BCM (Tresp, 2000) and to some degree the generalised PoE-GP (Cao & Fleet, 2014) as special cases, (ii) provides consistent approximations of a full GP, (iii) scales to arbitrarily large data sets by parallelisation. [6]

Since we assume that a standard GP is sufficient to model the latent function, all GP experts at the leaves of the tree-structured model are trained jointly and share a single set of hyper-parameters. [6]

In this paper, all experts share the same hyper-parameters, which leads to automatic regularisation: The overall gradient is an average of the experts’ marginal likelihood gradients, i.e., overfitting of individual experts is not favoured. [6]

The rBCM addresses shortcomings of other distributed models by appropriately incorporating the GP prior when combining predictions. [6]

Categories: kernels

Martinez-Cantin [9]

Categories: [(liu_when_2020)]

  • Manifold GP
  • Deep GP
  • Multitask GP
  • Online GP (*)
  • Recurrent GP
  • GP Classification

Terms and Acronyms

Predictive v.s. Predicted

两个词汇问题:

  1. 为什么GP中都是讲 predictive means and covariances?而不是 predicted means and covariances?
  2. 比如我的GP是用来建模描述物理值alpha,那我的GP的预测值,我应该讲 predictive alpha or predicted alpha?
(Answer from Magica-Chen:)
你在刻画模型的时候会用predictive mean 和covariance。 这里强调的是其可预测。另外一般predictive后面跟的distribution。mean和covariance只是distribution的替代,因为在gp或者高斯分布中mean和covariance可以为确定一个gp或者高斯分布。
刚开始叙述模型的时候应该是,the predictive distribution of alpha is 。。。。然后说结果的时候应该是可以用predicted distribution of alpha is 。。。。
另外其实predictive用的比predicted多的原因是predictive后面一般接distribution,而predicted更多后面接point。

References


[1] a Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai, “When Gaussian Process Meets Big Data: A Review of Scalable GPs”, IEEE Transactions on Neural Networks and Learning Systems, vol. 31, Nov. 2020, pp. 4405–4423.
[2] a, b Oliver Hamelijnck, Theodoros Damoulas, Kangrui Wang, and Mark Girolami, “Multi-resolution Multi-task Gaussian Processes”, arXiv:1906.08344 [cs, stat], Nov. 2019. Link.
[3] a, b, c, d, e, f Emily Fox, and David Dunson, “Multiresolution Gaussian Processes”, Advances in Neural Information Processing Systems, vol. 25, 2012, pp. 737–745.
[4] a, b David A. Moore, and Stuart Russell, “Fast Gaussian process posteriors with product trees”, Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, Arlington, Virginia, USA: AUAI Press, 2014, pp. 613–622.
[5] a, b, c Yirong Shen, Andrew Y. Ng, and Matthias Seeger, “Fast Gaussian Process Regression using KD-Trees”, Proceedings of the 18th International Conference on Neural Information Processing Systems, Cambridge, MA, USA: MIT Press, 2005, pp. 1225–1232.
[6] a, b, c, d, e, f, g, h, i Marc Peter Deisenroth, and Jun Wei Ng, “Distributed Gaussian Processes”, arXiv:1502.02843 [stat], May 2015. Link.
[7] a Lehel Csató, and Manfred Opper, “Sparse On-Line Gaussian Processes”, Neural Computation, vol. 14, Mar. 2002, pp. 641–668.
[9] a Ruben Martinez-Cantin, “Bayesian optimization with adaptive kernels for robot control”, 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3350–3356.
academics/ml/gp/start.txt · Last modified: 2021/07/05 21:55 by foreverph