(from here:)
GPflow only supports eager execution, which is the default in TensorFlow 2. It does not support graph mode, which was the default execution mode in TensorFlow 1.
See this GibHub package about almost all topics about determinism in using TensorFlow.
The above link eventually leads to this design proposal Enabling Determinism in TensorFlow. In short:
os.environ['TF_DETERMINISTIC_OPS'] = '1
'tf.random.set_seed(42)
https://gpflow.readthedocs.io/en/master/notebooks/intro_to_gpflow2.html#Saving-and-loading-models
Checkpointing:
using tf.train.CheckpointManager
and tf.train.Checkpoint
Copying (hyper)parameter values between models:
using params = gpflow.utilities.parameter_dict(model1)
to get parameters
and gpflow.utilities.multiple_assign(model2, params)
to assign to a model. (Will use this one for now as it looks the simplest.)
See this official demo.
TensorFlow saved_model
:
In order to save the model we need to explicitly store the tf.function
-compiled functions that we wish to export.
(I think this means we can only save those specific compiled functions but not the whole model?)
(tf.saved_model
seems to be more powerful… See https://www.tensorflow.org/guide/saved_model )
The phrase “Saving a TensorFlow model” typically means one of two things:
Checkpoints
: capture all parameters (tf.Variable
objects) of a model; NOT contain the computation defined by the model; only useful when source code is available. (For now, I only need this one for GPflow
) Functions: tf.keras.Model.save_weights
; tf.train.Checkpoint
+ tf.train.CheckpointManager
restore
on a tf.train.Checkpoint
object queues the requested restorations, restoring variable values as soon as there's a matching path from the Checkpoint object.tf.train.load_checkpoint
returns a CheckpointReader
that gives lower level access to the checkpoint contents.SavedModel
: Checkpoint
+ a serialized description of the computation defined by the model; independent of the source code; suitable for deployment. tf.saved_model.save(model, path_to_dir)
, tf.saved_model.load(path_to_dir)
tf.saved_model.save
supports saving tf.Module
objects and its subclasses, like tf.keras.Layer
and tf.keras.Model
.tf.function
is saved, no Python code is saved. When saving a tf.function
, you're really saving the tf.function
's cache of ConcreteFunctions. (See more: https://www.tensorflow.org/guide/function)tf.keras.Model
SavedModel
is the more comprehensive save format that saves the model architecture, weights, and the traced Tensorflow subgraphs of the call functions. This enables Keras to restore both built-in layers as well as custom objects. (default one as of now)
It seems I saved and loaded a trained GP model with pickle
.
And it trains and predicts just well.
However, this
stackoverflow post has pointed some critical issue tracing back to TensorFlow Probability
.
To avoid unecessary troubles, just use the simplest method, the second one discussed at the beginning.
from gpflow.optimizers import NaturalGradient
My thoughts after reading these learning materials:
GPflow's tutorial: Natural gradients
In this tutorial, natgrad_opt
is always used to update only the vairational parameters [(vgp.q_mu, vgp.q_sqrt)]
.
在GPflow的教程中反复提到可以一步更新使VGP变成GPR,意思应该是说只在Gaussian likelihood的情况下,在确定性的初始化后,只更新VGP的参数一步,使可以使$q(x)$和$f(x)$完全相同。因为这里相当于把初始值改了一下? 不是很理解为什么这里要反复强调这个特性,这有什么特殊的意义吗?
Excerpts from the post: It’s Only Natural: An Excessively Deep Dive Into Natural Gradient Optimization
(from here:)
In the context of Natural Gradient, KL divergence is deployed as a way of measuring the change in the output distribution our model is predicting.
The short answer is: practically speaking, it doesn’t provide compelling enough value to be in common use for most deep learning applications. There is evidence of natural gradient leading to convergence happening in fewer steps, but, as I’ll discuss later, that’s a bit of a complicated comparison. The idea of natural gradient is elegant and satisfying to people frustrated by the arbitrariness of scaling update steps in parameter space. But, other than being elegant, it’s not clear to me that it’s providing value that couldn’t be provided via more heuristic means.
Notice that I said that Natural Gradient is shown to speed up convergence in terms of gradient steps. That precision comes from the fact that each individual step of Natural Gradient takes longer, because it requires calculating a Fisher Information Matrix, which, remember, is a quantity that exists in n_parameters^2 space.
A lot of the reason that modern neural networks have been able to succeed where theory would predict that a first-order-only method would fail is that Deep Learning practitioners have found a bunch of clever tricks to essentially empirically approximate the information that would be contained in an analytic second-derivative matrix.
Excerpts from another good post: Natural Gradient Descent
(from here:)
As we know, the number of parameters in deep learning models is very large, within millions of parameters. The Fisher Information Matrix for these kind of models is then infeasible to compute, store, or invert. This is the same problem as why second order optimization methods are not popular in deep learning.
Method like ADAM [4] computes the running average of first and second moment of the gradient. First moment can be seen as momentum which is not our interest in this article. The second moment is approximating the Fisher Information Matrix, but constrainting it to be diagonal matrix. Thus in ADAM, we only need O(n) space to store (the approximation of) F instead of O(n2) and the inversion can be done in O(n) instead of O(n3). In practice ADAM works really well and is currently the de facto standard for optimizing deep neural networks.
tf.optimizers.Adam(adam_learning_rate)
https://gpflow.readthedocs.io/en/master/notebooks/advanced/mcmc.html
GPflow提供一个处理参数的约束问题的函数gpflow.optimizers.SamplingHelper
,实际的MCMC计算是通过调用tensorflow_probability
中的tfp.mcmc.*
具体sampling method来完成的。
除了参数约束,这个helper也有其它功能,暂时没有弄明白。
GPflow’s model for fully-Bayesian MCMC is called GPMC.
https://gpflow.readthedocs.io/en/master/notebooks/advanced/heteroskedastic.html?highlight=adam
https://github.com/GPflow/GPflow/blob/master/gpflow/likelihoods/multilatent.py
A Likelihood which assumes that a single dimensional observation is driven by multiple latent GPs.
Heteroskedastic Likelihood是用这个模型来实现的,内部实际上是用了一个2维输出的multi-output GP,GP的第一个输出作为likelihood的mean,GP的第二个输出通过一个变换后作为std (or named scale in the GPflow tutorials)。
model.trainable_parameters
和trainable_variables
有不同的含义,根据看代码:
trainable_prarameters
是在class Module(tf.Module)
(代码在这里)中被定义的。在tf.Module
的基础上添加了可以设定prior distribution的功能。trainable_parameters
的返回,应该包含在调用trainable_variables
的返回中。