Table of Contents
Learn GPflow
(from here:)
GPflow only supports eager execution, which is the default in TensorFlow 2. It does not support graph mode, which was the default execution mode in TensorFlow 1.
Deterministic Results
See this GibHub package about almost all topics about determinism in using TensorFlow.
The above link eventually leads to this design proposal Enabling Determinism in TensorFlow. In short:
- Enable deterministic operations according to this section:
os.environ['TF_DETERMINISTIC_OPS'] = '1
' - Set random seed according to this setcion:
tf.random.set_seed(42)
Save and Load
https://gpflow.readthedocs.io/en/master/notebooks/intro_to_gpflow2.html#Saving-and-loading-models
Checkpointing:
using tf.train.CheckpointManager
and tf.train.Checkpoint
Copying (hyper)parameter values between models:
using params = gpflow.utilities.parameter_dict(model1)
to get parameters
and gpflow.utilities.multiple_assign(model2, params)
to assign to a model. (Will use this one for now as it looks the simplest.)
See this official demo.
TensorFlow saved_model
:
In order to save the model we need to explicitly store the tf.function
-compiled functions that we wish to export.
(I think this means we can only save those specific compiled functions but not the whole model?)
(tf.saved_model
seems to be more powerful… See https://www.tensorflow.org/guide/saved_model )
Learn TensorFlow
The phrase “Saving a TensorFlow model” typically means one of two things:
Checkpoints
: capture all parameters (tf.Variable
objects) of a model; NOT contain the computation defined by the model; only useful when source code is available. (For now, I only need this one forGPflow
) Functions:- just save/load weights
tf.keras.Model.save_weights
;tf.train.Checkpoint
+tf.train.CheckpointManager
- Calling
restore
on atf.train.Checkpoint
object queues the requested restorations, restoring variable values as soon as there's a matching path from the Checkpoint object. tf.train.load_checkpoint
returns aCheckpointReader
that gives lower level access to the checkpoint contents.
SavedModel
:Checkpoint
+ a serialized description of the computation defined by the model; independent of the source code; suitable for deployment.- Low-level API:
tf.saved_model.save(model, path_to_dir)
,tf.saved_model.load(path_to_dir)
tf.saved_model.save
supports savingtf.Module
objects and its subclasses, liketf.keras.Layer
andtf.keras.Model
.- any Python attributes, functions, and data are lost. This means that when a
tf.function
is saved, no Python code is saved. When saving atf.function
, you're really saving thetf.function
's cache of ConcreteFunctions. (See more: https://www.tensorflow.org/guide/function)
- High-level API:
tf.keras.Model
SavedModel
is the more comprehensive save format that saves the model architecture, weights, and the traced Tensorflow subgraphs of the call functions. This enables Keras to restore both built-in layers as well as custom objects. (default one as of now)
Using ''pickle''? Maybe not.
It seems I saved and loaded a trained GP model with pickle
.
And it trains and predicts just well.
However, this
stackoverflow post has pointed some critical issue tracing back to TensorFlow Probability
.
To avoid unecessary troubles, just use the simplest method, the second one discussed at the beginning.
Optimizers
- GPflow中的natural gradient optimizer是自己实现的
from gpflow.optimizers import NaturalGradient
Natural Gradients
My thoughts after reading these learning materials:
- 针对KL divergence,所以也就是相当于针对variational inference。
- 似乎因为Fisher Information Matrix的计算量大,还需要求逆,实际中在碰到大量参数时,并不好用。
- 效果似乎并不会比Adam更好。
GPflow's tutorial: Natural gradients
In this tutorial, natgrad_opt
is always used to update only the vairational parameters [(vgp.q_mu, vgp.q_sqrt)]
.
在GPflow的教程中反复提到可以一步更新使VGP变成GPR,意思应该是说只在Gaussian likelihood的情况下,在确定性的初始化后,只更新VGP的参数一步,使可以使$q(x)$和$f(x)$完全相同。因为这里相当于把初始值改了一下? 不是很理解为什么这里要反复强调这个特性,这有什么特殊的意义吗?
Excerpts from the post: It’s Only Natural: An Excessively Deep Dive Into Natural Gradient Optimization
(from here:)
In the context of Natural Gradient, KL divergence is deployed as a way of measuring the change in the output distribution our model is predicting.
The short answer is: practically speaking, it doesn’t provide compelling enough value to be in common use for most deep learning applications. There is evidence of natural gradient leading to convergence happening in fewer steps, but, as I’ll discuss later, that’s a bit of a complicated comparison. The idea of natural gradient is elegant and satisfying to people frustrated by the arbitrariness of scaling update steps in parameter space. But, other than being elegant, it’s not clear to me that it’s providing value that couldn’t be provided via more heuristic means.
Notice that I said that Natural Gradient is shown to speed up convergence in terms of gradient steps. That precision comes from the fact that each individual step of Natural Gradient takes longer, because it requires calculating a Fisher Information Matrix, which, remember, is a quantity that exists in n_parameters^2 space.
A lot of the reason that modern neural networks have been able to succeed where theory would predict that a first-order-only method would fail is that Deep Learning practitioners have found a bunch of clever tricks to essentially empirically approximate the information that would be contained in an analytic second-derivative matrix.
Excerpts from another good post: Natural Gradient Descent
(from here:)
As we know, the number of parameters in deep learning models is very large, within millions of parameters. The Fisher Information Matrix for these kind of models is then infeasible to compute, store, or invert. This is the same problem as why second order optimization methods are not popular in deep learning.
Method like ADAM [4] computes the running average of first and second moment of the gradient. First moment can be seen as momentum which is not our interest in this article. The second moment is approximating the Fisher Information Matrix, but constrainting it to be diagonal matrix. Thus in ADAM, we only need O(n) space to store (the approximation of) F instead of O(n2) and the inversion can be done in O(n) instead of O(n3). In practice ADAM works really well and is currently the de facto standard for optimizing deep neural networks.
Adam
tf.optimizers.Adam(adam_learning_rate)
MCMC
https://gpflow.readthedocs.io/en/master/notebooks/advanced/mcmc.html
GPflow提供一个处理参数的约束问题的函数gpflow.optimizers.SamplingHelper
,实际的MCMC计算是通过调用tensorflow_probability
中的tfp.mcmc.*
具体sampling method来完成的。
除了参数约束,这个helper也有其它功能,暂时没有弄明白。
GPflow’s model for fully-Bayesian MCMC is called GPMC.
Multi Latent Likelihood
https://gpflow.readthedocs.io/en/master/notebooks/advanced/heteroskedastic.html?highlight=adam
https://github.com/GPflow/GPflow/blob/master/gpflow/likelihoods/multilatent.py
A Likelihood which assumes that a single dimensional observation is driven by multiple latent GPs.
Heteroskedastic Likelihood是用这个模型来实现的,内部实际上是用了一个2维输出的multi-output GP,GP的第一个输出作为likelihood的mean,GP的第二个输出通过一个变换后作为std (or named scale in the GPflow tutorials)。
Others
model.trainable_parameters
和trainable_variables
有不同的含义,根据看代码:
trainable_prarameters
是在class Module(tf.Module)
(代码在这里)中被定义的。在tf.Module
的基础上添加了可以设定prior distribution的功能。- 【有疑问,需要测试一下】按我的理解,调用
trainable_parameters
的返回,应该包含在调用trainable_variables
的返回中。