KfacOptimizer
Inherits From: GradientDescentOptimizer
Defined in tensorflow/contrib/kfac/python/ops/optimizer.py
.
The KFAC Optimizer (https://arxiv.org/abs/1503.05671).
cov_update_op
cov_update_ops
cov_update_thunks
damping
damping_adaptation_interval
inv_update_op
inv_update_ops
inv_update_thunks
variables
A list of variables which encode the current state of Optimizer
.
Includes slot variables and additional global variables created by the optimizer in the current default graph.
A list of variables.
__init__
__init__( learning_rate, cov_ema_decay, damping, layer_collection, var_list=None, momentum=0.9, momentum_type='regular', norm_constraint=None, name='KFAC', estimation_mode='gradients', colocate_gradients_with_ops=True, batch_size=None, placement_strategy=None, **kwargs )
Initializes the KFAC optimizer with the given settings.
learning_rate
: The base learning rate for the optimizer. Should probably be set to 1.0 when using momentum_type = 'qmodel', but can still be set lowered if desired (effectively lowering the trust in the quadratic model.)cov_ema_decay
: The decay factor used when calculating the covariance estimate moving averages.damping
: The damping factor used to stabilize training due to errors in the local approximation with the Fisher information matrix, and to regularize the update direction by making it closer to the gradient. If damping is adapted during training then this value is used for initializing damping varaible. (Higher damping means the update looks more like a standard gradient update - see Tikhonov regularization.)layer_collection
: The layer collection object, which holds the fisher blocks, kronecker factors, and losses associated with the graph. The layer_collection cannot be modified after KfacOptimizer's initialization.var_list
: Optional list or tuple of variables to train. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES
.momentum
: The momentum decay constant to use. Only applies when momentum_type is 'regular' or 'adam'. (Default: 0.9)momentum_type
: The type of momentum to use in this optimizer, one of 'regular', 'adam', or 'qmodel'. (Default: 'regular')norm_constraint
: float or Tensor. If specified, the update is scaled down so that its approximate squared Fisher norm v^T F v is at most the specified value. May only be used with momentum type 'regular'. (Default: None)name
: The name for this optimizer. (Default: 'KFAC')estimation_mode
: The type of estimator to use for the Fishers. Can be 'gradients', 'empirical', 'curvature_propagation', or 'exact'. (Default: 'gradients'). See the doc-string for FisherEstimator for more a more detailed description of these options.colocate_gradients_with_ops
: Whether we should request gradients we compute in the estimator be colocated with their respective ops. (Default: True)batch_size
: The size of the mini-batch. Only needed when momentum_type == 'qmodel' or when automatic adjustment is used. (Default: None)placement_strategy
: string, Device placement strategy used when creating covariance variables, covariance ops, and inverse ops. (Default: None
)**kwargs
: Arguments to be passesd to specific placement strategy mixin. Check placement.RoundRobinPlacementMixin
for example.ValueError
: If the momentum type is unsupported.ValueError
: If clipping is used with momentum type other than 'regular'.ValueError
: If no losses have been registered with layer_collection.ValueError
: If momentum is non-zero and momentum_type is not 'regular' or 'adam'.apply_gradients
apply_gradients( grads_and_vars, *args, **kwargs )
Applies gradients to variables.
grads_and_vars
: List of (gradient, variable) pairs.*args
: Additional arguments for super.apply_gradients.**kwargs
: Additional keyword arguments for super.apply_gradients.An Operation
that applies the specified gradients.
compute_gradients
compute_gradients( *args, **kwargs )
Compute gradients of loss
for the variables in var_list
.
This is the first part of minimize()
. It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a Tensor
, an IndexedSlices
, or None
if there is no gradient for the given variable.
loss
: A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.var_list
: Optional list or tuple of tf.Variable
to update to minimize loss
. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES
.gate_gradients
: How to gate the computation of gradients. Can be GATE_NONE
, GATE_OP
, or GATE_GRAPH
.aggregation_method
: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod
.colocate_gradients_with_ops
: If True, try colocating gradients with the corresponding op.grad_loss
: Optional. A Tensor
holding the gradient computed for loss
.A list of (gradient, variable) pairs. Variable is always present, but gradient can be None
.
TypeError
: If var_list
contains anything else than Variable
objects.ValueError
: If some arguments are invalid.RuntimeError
: If called with eager execution enabled and loss
is not callable.When eager execution is enabled, gate_gradients
, aggregation_method
, and colocate_gradients_with_ops
are ignored.
create_ops_and_vars_thunks
create_ops_and_vars_thunks()
Create thunks that make the ops and vars on demand.
This function returns 4 lists of thunks: cov_variable_thunks, cov_update_thunks, inv_variable_thunks, and inv_update_thunks.
The length of each list is the number of factors and the i-th element of each list corresponds to the i-th factor (given by the "factors" property).
Note that the execution of these thunks must happen in a certain partial order. The i-th element of cov_variable_thunks must execute before the i-th element of cov_update_thunks (and also the i-th element of inv_update_thunks). Similarly, the i-th element of inv_variable_thunks must execute before the i-th element of inv_update_thunks.
TL;DR (oversimplified): Execute the thunks according to the order that they are returned.
cov_variable_thunks
: A list of thunks that make the cov variables.cov_update_thunks
: A list of thunks that make the cov update ops.inv_variable_thunks
: A list of thunks that make the inv variables.inv_update_thunks
: A list of thunks that make the inv update ops.get_name
get_name()
get_slot
get_slot( var, name )
Return a slot named name
created for var
by the Optimizer.
Some Optimizer
subclasses use additional variables. For example Momentum
and Adagrad
use variables to accumulate updates. This method gives access to these Variable
objects if for some reason you need them.
Use get_slot_names()
to get the list of slot names created by the Optimizer
.
var
: A variable passed to minimize()
or apply_gradients()
.name
: A string.The Variable
for the slot if it was created, None
otherwise.
get_slot_names
get_slot_names()
Return a list of the names of slots created by the Optimizer
.
See get_slot()
.
A list of strings.
make_ops_and_vars
make_ops_and_vars()
Make ops and vars with device placement self._placement_strategy
.
See FisherEstimator.make_ops_and_vars
for details.
cov_update_ops
: List of ops that compute the cov updates. Corresponds one-to-one with the list of factors given by the "factors" property.cov_update_op
: cov_update_ops grouped into a single op.inv_update_ops
: List of ops that compute the inv updates. Corresponds one-to-one with the list of factors given by the "factors" property.cov_update_op
: cov_update_ops grouped into a single op.inv_update_op
: inv_update_ops grouped into a single op.make_vars_and_create_op_thunks
make_vars_and_create_op_thunks()
Make vars and create op thunks.
cov_update_thunks
: List of cov update thunks. Corresponds one-to-one with the list of factors given by the "factors" property.inv_update_thunks
: List of inv update thunks. Corresponds one-to-one with the list of factors given by the "factors" property.minimize
minimize( *args, **kwargs )
Add operations to minimize loss
by updating var_list
.
This method simply combines calls compute_gradients()
and apply_gradients()
. If you want to process the gradient before applying them call compute_gradients()
and apply_gradients()
explicitly instead of using this function.
loss
: A Tensor
containing the value to minimize.global_step
: Optional Variable
to increment by one after the variables have been updated.var_list
: Optional list or tuple of Variable
objects to update to minimize loss
. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES
.gate_gradients
: How to gate the computation of gradients. Can be GATE_NONE
, GATE_OP
, or GATE_GRAPH
.aggregation_method
: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod
.colocate_gradients_with_ops
: If True, try colocating gradients with the corresponding op.name
: Optional name for the returned operation.grad_loss
: Optional. A Tensor
holding the gradient computed for loss
.An Operation that updates the variables in var_list
. If global_step
was not None
, that operation also increments global_step
.
ValueError
: If some of the variables are not Variable
objects.When eager execution is enabled, loss
should be a Python function that takes elements of var_list
as arguments and computes the value to be minimized. If var_list
is None, loss
should take no arguments. Minimization (and gradient computation) is done with respect to the elements of var_list
if not None, else with respect to any trainable variables created during the execution of the loss
function. gate_gradients
, aggregation_method
, colocate_gradients_with_ops
and grad_loss
are ignored when eager execution is enabled.
set_damping_adaptation_params
set_damping_adaptation_params( is_chief, prev_train_batch, loss_fn, min_damping=1e-05, damping_adaptation_decay=0.99, damping_adaptation_interval=5 )
Sets parameters required to adapt damping during training.
When called, enables damping adaptation according to the Levenberg-Marquardt style rule described in Section 6.5 of "Optimizing Neural Networks with Kronecker-factored Approximate Curvature".
Note that this function creates Tensorflow variables which store a few scalars and are accessed by the ops which update the damping (as part of the training op returned by the minimize() method).
is_chief
: Boolean
, True
if the worker is chief.prev_train_batch
: Training data used to minimize loss in the previous step. This will be used to evaluate loss by calling loss_fn(prev_train_batch)
.loss_fn
: function
that takes as input training data tensor and returns a scalar loss.min_damping
: float
(Optional), Minimum value the damping parameter can take. Default value 1e-5.damping_adaptation_decay
: float
(Optional), The damping
parameter is multipled by the damping_adaptation_decay
every damping_adaptation_interval
number of iterations. Default value 0.99.damping_adaptation_interval
: int
(Optional), Number of steps in between updating the damping
parameter. Default value 5.ValueError
: If set_damping_adaptation_params
is already called and the the adapt_damping
is True
.GATE_GRAPH
GATE_NONE
GATE_OP
© 2018 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
https://www.tensorflow.org/api_docs/python/tf/contrib/kfac/optimizer/KfacOptimizer