Learn GPflow

(from here:)
GPflow only supports eager execution, which is the default in TensorFlow 2. It does not support graph mode, which was the default execution mode in TensorFlow 1.

Deterministic Results

See this GibHub package about almost all topics about determinism in using TensorFlow.

The above link eventually leads to this design proposal Enabling Determinism in TensorFlow. In short:

Enable deterministic operations according to this section: os.environ['TF_DETERMINISTIC_OPS'] = '1'
Set random seed according to this setcion: tf.random.set_seed(42)

Save and Load

https://gpflow.readthedocs.io/en/master/notebooks/intro_to_gpflow2.html#Saving-and-loading-models

Checkpointing: using tf.train.CheckpointManager and tf.train.Checkpoint

Copying (hyper)parameter values between models: using params = gpflow.utilities.parameter_dict(model1) to get parameters and gpflow.utilities.multiple_assign(model2, params) to assign to a model. (Will use this one for now as it looks the simplest.)
See this official demo.

TensorFlow saved_model: In order to save the model we need to explicitly store the tf.function-compiled functions that we wish to export. ~~(I think this means we can only save those specific compiled functions but not the whole model?)~~ (tf.saved_model seems to be more powerful… See https://www.tensorflow.org/guide/saved_model )

Learn TensorFlow

The phrase “Saving a TensorFlow model” typically means one of two things:

Checkpoints: capture all parameters (tf.Variable objects) of a model; NOT contain the computation defined by the model; only useful when source code is available. (For now, I only need this one for GPflow) Functions:
- just save/load weights
- tf.keras.Model.save_weights;
- tf.train.Checkpoint + tf.train.CheckpointManager
- Calling restore on a tf.train.Checkpoint object queues the requested restorations, restoring variable values as soon as there's a matching path from the Checkpoint object.
- tf.train.load_checkpoint returns a CheckpointReader that gives lower level access to the checkpoint contents.
- See https://www.tensorflow.org/guide/checkpoint
SavedModel: Checkpoint + a serialized description of the computation defined by the model; independent of the source code; suitable for deployment.
- Low-level API: tf.saved_model.save(model, path_to_dir), tf.saved_model.load(path_to_dir)
  - tf.saved_model.save supports saving tf.Module objects and its subclasses, like tf.keras.Layer and tf.keras.Model.
  - any Python attributes, functions, and data are lost. This means that when a tf.function is saved, no Python code is saved. When saving a tf.function, you're really saving the tf.function's cache of ConcreteFunctions. (See more: https://www.tensorflow.org/guide/function)
  - See https://www.tensorflow.org/guide/saved_model
- High-level API: tf.keras.Model
  - SavedModel is the more comprehensive save format that saves the model architecture, weights, and the traced Tensorflow subgraphs of the call functions. This enables Keras to restore both built-in layers as well as custom objects. (default one as of now)
  - See https://www.tensorflow.org/guide/keras/save_and_serialize

Using ''pickle''? Maybe not.

It seems I saved and loaded a trained GP model with pickle. And it trains and predicts just well.

However, this stackoverflow post has pointed some critical issue tracing back to TensorFlow Probability.

To avoid unecessary troubles, just use the simplest method, the second one discussed at the beginning.

Optimizers

GPflow中的natural gradient optimizer是自己实现的from gpflow.optimizers import NaturalGradient

Natural Gradients

My thoughts after reading these learning materials:

针对KL divergence，所以也就是相当于针对variational inference。
似乎因为Fisher Information Matrix的计算量大，还需要求逆，实际中在碰到大量参数时，并不好用。
效果似乎并不会比Adam更好。

GPflow's tutorial: Natural gradients

In this tutorial, natgrad_opt is always used to update only the vairational parameters [(vgp.q_mu, vgp.q_sqrt)].

在GPflow的教程中反复提到可以一步更新使VGP变成GPR，意思应该是说只在Gaussian likelihood的情况下，在确定性的初始化后，只更新VGP的参数一步，使可以使$q(x)$和$f(x)$完全相同。因为这里相当于把初始值改了一下？不是很理解为什么这里要反复强调这个特性，这有什么特殊的意义吗？

Excerpts from the post: It’s Only Natural: An Excessively Deep Dive Into Natural Gradient Optimization

(from here:)
In the context of Natural Gradient, KL divergence is deployed as a way of measuring the change in the output distribution our model is predicting.

The short answer is: practically speaking, it doesn’t provide compelling enough value to be in common use for most deep learning applications. There is evidence of natural gradient leading to convergence happening in fewer steps, but, as I’ll discuss later, that’s a bit of a complicated comparison. The idea of natural gradient is elegant and satisfying to people frustrated by the arbitrariness of scaling update steps in parameter space. But, other than being elegant, it’s not clear to me that it’s providing value that couldn’t be provided via more heuristic means.

Notice that I said that Natural Gradient is shown to speed up convergence in terms of gradient steps. That precision comes from the fact that each individual step of Natural Gradient takes longer, because it requires calculating a Fisher Information Matrix, which, remember, is a quantity that exists in n_parameters^2 space.

A lot of the reason that modern neural networks have been able to succeed where theory would predict that a first-order-only method would fail is that Deep Learning practitioners have found a bunch of clever tricks to essentially empirically approximate the information that would be contained in an analytic second-derivative matrix.

Excerpts from another good post: Natural Gradient Descent

(from here:)
As we know, the number of parameters in deep learning models is very large, within millions of parameters. The Fisher Information Matrix for these kind of models is then infeasible to compute, store, or invert. This is the same problem as why second order optimization methods are not popular in deep learning.

Method like ADAM [4] computes the running average of first and second moment of the gradient. First moment can be seen as momentum which is not our interest in this article. The second moment is approximating the Fisher Information Matrix, but constrainting it to be diagonal matrix. Thus in ADAM, we only need O(n) space to store (the approximation of) F instead of O(n2) and the inversion can be done in O(n) instead of O(n3). In practice ADAM works really well and is currently the de facto standard for optimizing deep neural networks.

Adam

tf.optimizers.Adam(adam_learning_rate)

MCMC

https://gpflow.readthedocs.io/en/master/notebooks/advanced/mcmc.html

GPflow提供一个处理参数的约束问题的函数gpflow.optimizers.SamplingHelper，实际的MCMC计算是通过调用tensorflow_probability中的tfp.mcmc.*具体sampling method来完成的。除了参数约束，这个helper也有其它功能，暂时没有弄明白。

GPflow’s model for fully-Bayesian MCMC is called GPMC.

Multi Latent Likelihood

https://gpflow.readthedocs.io/en/master/notebooks/advanced/heteroskedastic.html?highlight=adam

https://github.com/GPflow/GPflow/blob/master/gpflow/likelihoods/multilatent.py

A Likelihood which assumes that a single dimensional observation is driven by multiple latent GPs.

Heteroskedastic Likelihood是用这个模型来实现的，内部实际上是用了一个2维输出的multi-output GP，GP的第一个输出作为likelihood的mean，GP的第二个输出通过一个变换后作为std (or named scale in the GPflow tutorials)。

Others

model.trainable_parameters和trainable_variables有不同的含义，根据看代码：

trainable_variables是在tf.Module中被定义的（看这里）。所以相当于说这一步调用得到的参数，是tf中使用的参数。
trainable_prarameters是在class Module(tf.Module)（代码在这里）中被定义的。在tf.Module的基础上添加了可以设定prior distribution的功能。
【有疑问，需要测试一下】按我的理解，调用trainable_parameters的返回，应该包含在调用trainable_variables的返回中。

Peng Knowledge Base

Table of Contents