Table of Contents

Learn GPflow

(from here:)
GPflow only supports eager execution, which is the default in TensorFlow 2. It does not support graph mode, which was the default execution mode in TensorFlow 1.

Deterministic Results

See this GibHub package about almost all topics about determinism in using TensorFlow.

The above link eventually leads to this design proposal Enabling Determinism in TensorFlow. In short:

Save and Load

https://gpflow.readthedocs.io/en/master/notebooks/intro_to_gpflow2.html#Saving-and-loading-models

Checkpointing: using tf.train.CheckpointManager and tf.train.Checkpoint

Copying (hyper)parameter values between models: using params = gpflow.utilities.parameter_dict(model1) to get parameters and gpflow.utilities.multiple_assign(model2, params) to assign to a model. (Will use this one for now as it looks the simplest.)
See this official demo.

TensorFlow saved_model: In order to save the model we need to explicitly store the tf.function-compiled functions that we wish to export. (I think this means we can only save those specific compiled functions but not the whole model?) (tf.saved_model seems to be more powerful… See https://www.tensorflow.org/guide/saved_model )

Learn TensorFlow

The phrase “Saving a TensorFlow model” typically means one of two things:

Using ''pickle''? Maybe not.

It seems I saved and loaded a trained GP model with pickle. And it trains and predicts just well.

However, this stackoverflow post has pointed some critical issue tracing back to TensorFlow Probability.

To avoid unecessary troubles, just use the simplest method, the second one discussed at the beginning.

Optimizers

Natural Gradients

My thoughts after reading these learning materials:

GPflow's tutorial: Natural gradients

In this tutorial, natgrad_opt is always used to update only the vairational parameters [(vgp.q_mu, vgp.q_sqrt)].

在GPflow的教程中反复提到可以一步更新使VGP变成GPR,意思应该是说只在Gaussian likelihood的情况下,在确定性的初始化后,只更新VGP的参数一步,使可以使$q(x)$和$f(x)$完全相同。因为这里相当于把初始值改了一下? 不是很理解为什么这里要反复强调这个特性,这有什么特殊的意义吗?


Excerpts from the post: It’s Only Natural: An Excessively Deep Dive Into Natural Gradient Optimization

(from here:)
In the context of Natural Gradient, KL divergence is deployed as a way of measuring the change in the output distribution our model is predicting.

The short answer is: practically speaking, it doesn’t provide compelling enough value to be in common use for most deep learning applications. There is evidence of natural gradient leading to convergence happening in fewer steps, but, as I’ll discuss later, that’s a bit of a complicated comparison. The idea of natural gradient is elegant and satisfying to people frustrated by the arbitrariness of scaling update steps in parameter space. But, other than being elegant, it’s not clear to me that it’s providing value that couldn’t be provided via more heuristic means.

Notice that I said that Natural Gradient is shown to speed up convergence in terms of gradient steps. That precision comes from the fact that each individual step of Natural Gradient takes longer, because it requires calculating a Fisher Information Matrix, which, remember, is a quantity that exists in n_parameters^2 space.

A lot of the reason that modern neural networks have been able to succeed where theory would predict that a first-order-only method would fail is that Deep Learning practitioners have found a bunch of clever tricks to essentially empirically approximate the information that would be contained in an analytic second-derivative matrix.

Excerpts from another good post: Natural Gradient Descent

(from here:)
As we know, the number of parameters in deep learning models is very large, within millions of parameters. The Fisher Information Matrix for these kind of models is then infeasible to compute, store, or invert. This is the same problem as why second order optimization methods are not popular in deep learning.

Method like ADAM [4] computes the running average of first and second moment of the gradient. First moment can be seen as momentum which is not our interest in this article. The second moment is approximating the Fisher Information Matrix, but constrainting it to be diagonal matrix. Thus in ADAM, we only need O(n) space to store (the approximation of) F instead of O(n2) and the inversion can be done in O(n) instead of O(n3). In practice ADAM works really well and is currently the de facto standard for optimizing deep neural networks.

Adam

tf.optimizers.Adam(adam_learning_rate)

MCMC

https://gpflow.readthedocs.io/en/master/notebooks/advanced/mcmc.html

GPflow提供一个处理参数的约束问题的函数gpflow.optimizers.SamplingHelper,实际的MCMC计算是通过调用tensorflow_probability中的tfp.mcmc.*具体sampling method来完成的。 除了参数约束,这个helper也有其它功能,暂时没有弄明白。

GPflow’s model for fully-Bayesian MCMC is called GPMC.

Multi Latent Likelihood

https://gpflow.readthedocs.io/en/master/notebooks/advanced/heteroskedastic.html?highlight=adam

https://github.com/GPflow/GPflow/blob/master/gpflow/likelihoods/multilatent.py

A Likelihood which assumes that a single dimensional observation is driven by multiple latent GPs.

Heteroskedastic Likelihood是用这个模型来实现的,内部实际上是用了一个2维输出的multi-output GP,GP的第一个输出作为likelihood的mean,GP的第二个输出通过一个变换后作为std (or named scale in the GPflow tutorials)。

Others

model.trainable_parameterstrainable_variables有不同的含义,根据看代码: