RunConfig
Defined in tensorflow/python/estimator/run_config.py
.
This class specifies the configurations for an Estimator
run.
cluster_spec
evaluation_master
global_id_in_cluster
The global id in the training cluster.
All global ids in the training cluster are assigned from an increasing sequence of consecutive integers. The first id is 0.
Note: Task id (the property field task_id
) is tracking the index of the node among all nodes with the SAME task type. For example, given the cluster definition as follows:
cluster = {'chief': ['host0:2222'], 'ps': ['host1:2222', 'host2:2222'], 'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
Nodes with task type worker
can have id 0, 1, 2. Nodes with task type ps
can have id, 0, 1. So, task_id
is not unique, but the pair (task_type
, task_id
) can uniquely determine a node in the cluster.
Global id, i.e., this field, is tracking the index of the node among ALL nodes in the cluster. It is uniquely assigned. For example, for the cluster spec given above, the global ids are assigned as:
task_type | task_id | global_id -------------------------------- chief | 0 | 0 worker | 0 | 1 worker | 1 | 2 worker | 2 | 3 ps | 0 | 4 ps | 1 | 5
An integer id.
is_chief
keep_checkpoint_every_n_hours
keep_checkpoint_max
log_step_count_steps
master
model_dir
num_ps_replicas
num_worker_replicas
save_checkpoints_secs
save_checkpoints_steps
save_summary_steps
service
Returns the platform defined (in TF_CONFIG) service dict.
session_config
task_id
task_type
tf_random_seed
train_distribute
Returns the optional tf.contrib.distribute.DistributionStrategy
object.
__init__
__init__( model_dir=None, tf_random_seed=None, save_summary_steps=100, save_checkpoints_steps=_USE_DEFAULT, save_checkpoints_secs=_USE_DEFAULT, session_config=None, keep_checkpoint_max=5, keep_checkpoint_every_n_hours=10000, log_step_count_steps=100, train_distribute=None )
Constructs a RunConfig.
All distributed training related properties cluster_spec
, is_chief
, master
, num_worker_replicas
, num_ps_replicas
, task_id
, and task_type
are set based on the TF_CONFIG
environment variable, if the pertinent information is present. The TF_CONFIG
environment variable is a JSON object with attributes: cluster
and task
.
cluster
is a JSON serialized version of ClusterSpec
's Python dict from server_lib.py
, mapping task types (usually one of the TaskType
enums) to a list of task addresses.
task
has two attributes: type
and index
, where type
can be any of the task types in cluster
. When
TF_CONFIG` contains said information, the following properties are set on this class:
cluster_spec
is parsed from TF_CONFIG['cluster']
. Defaults to {}. If present, must have one and only one node in the chief
attribute of cluster_spec
.task_type
is set to TF_CONFIG['task']['type']
. Must set if cluster_spec
is present; must be worker
(the default value) if cluster_spec
is not set.task_id
is set to TF_CONFIG['task']['index']
. Must set if cluster_spec
is present; must be 0 (the default value) if cluster_spec
is not set.master
is determined by looking up task_type
and task_id
in the cluster_spec
. Defaults to ''.num_ps_replicas
is set by counting the number of nodes listed in the ps
attribute of cluster_spec
. Defaults to 0.num_worker_replicas
is set by counting the number of nodes listed in the worker
and chief
attributes of cluster_spec
. Defaults to 1.is_chief
is determined based on task_type
and cluster
.There is a special node with task_type
as evaluator
, which is not part of the (training) cluster_spec
. It handles the distributed evaluation job.
Example of non-chief node:
cluster = {'chief': ['host0:2222'], 'ps': ['host1:2222', 'host2:2222'], 'worker': ['host3:2222', 'host4:2222', 'host5:2222']} os.environ['TF_CONFIG'] = json.dumps( {'cluster': cluster, 'task': {'type': 'worker', 'index': 1}}) config = RunConfig() assert config.master == 'host4:2222' assert config.task_id == 1 assert config.num_ps_replicas == 2 assert config.num_worker_replicas == 4 assert config.cluster_spec == server_lib.ClusterSpec(cluster) assert config.task_type == 'worker' assert not config.is_chief
Example of chief node:
cluster = {'chief': ['host0:2222'], 'ps': ['host1:2222', 'host2:2222'], 'worker': ['host3:2222', 'host4:2222', 'host5:2222']} os.environ['TF_CONFIG'] = json.dumps( {'cluster': cluster, 'task': {'type': 'chief', 'index': 0}}) config = RunConfig() assert config.master == 'host0:2222' assert config.task_id == 0 assert config.num_ps_replicas == 2 assert config.num_worker_replicas == 4 assert config.cluster_spec == server_lib.ClusterSpec(cluster) assert config.task_type == 'chief' assert config.is_chief
Example of evaluator node (evaluator is not part of training cluster):
cluster = {'chief': ['host0:2222'], 'ps': ['host1:2222', 'host2:2222'], 'worker': ['host3:2222', 'host4:2222', 'host5:2222']} os.environ['TF_CONFIG'] = json.dumps( {'cluster': cluster, 'task': {'type': 'evaluator', 'index': 0}}) config = RunConfig() assert config.master == '' assert config.evaluator_master == '' assert config.task_id == 0 assert config.num_ps_replicas == 0 assert config.num_worker_replicas == 0 assert config.cluster_spec == {} assert config.task_type == 'evaluator' assert not config.is_chief
N.B.: If save_checkpoints_steps
or save_checkpoints_secs
is set, keep_checkpoint_max
might need to be adjusted accordingly, especially in distributed training. For example, setting save_checkpoints_secs
as 60 without adjusting keep_checkpoint_max
(defaults to 5) leads to situation that checkpoint would be garbage collected after 5 minutes. In distributed training, the evaluation job starts asynchronously and might fail to load or find the checkpoint due to race condition.
model_dir
: directory where model parameters, graph, etc are saved. If PathLike
object, the path will be resolved. If None
, will use a default value set by the Estimator.tf_random_seed
: Random seed for TensorFlow initializers. Setting this value allows consistency between reruns.save_summary_steps
: Save summaries every this many steps.save_checkpoints_steps
: Save checkpoints every this many steps. Can not be specified with save_checkpoints_secs
.save_checkpoints_secs
: Save checkpoints every this many seconds. Can not be specified with save_checkpoints_steps
. Defaults to 600 seconds if both save_checkpoints_steps
and save_checkpoints_secs
are not set in constructor. If both save_checkpoints_steps
and save_checkpoints_secs
are None, then checkpoints are disabled.session_config
: a ConfigProto used to set session parameters, or None.keep_checkpoint_max
: The maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)keep_checkpoint_every_n_hours
: Number of hours between each checkpoint to be saved. The default value of 10,000 hours effectively disables the feature.log_step_count_steps
: The frequency, in number of global steps, that the global step/sec and the loss will be logged during training.train_distribute
: an optional instance of tf.contrib.distribute.DistributionStrategy
. If specified, then Estimator will distribute the user's model during training, according to the policy specified by that strategy.ValueError
: If both save_checkpoints_steps
and save_checkpoints_secs
are set.replace
replace(**kwargs)
Returns a new instance of RunConfig
replacing specified properties.
Only the properties in the following list are allowed to be replaced:
model_dir
,tf_random_seed
,save_summary_steps
,save_checkpoints_steps
,save_checkpoints_secs
,session_config
,keep_checkpoint_max
,keep_checkpoint_every_n_hours
,log_step_count_steps
,train_distribute
.In addition, either save_checkpoints_steps
or save_checkpoints_secs
can be set (should not be both).
**kwargs
: keyword named properties with new values.ValueError
: If any property name in kwargs
does not exist or is not allowed to be replaced, or both save_checkpoints_steps
and save_checkpoints_secs
are set.a new instance of RunConfig
.
© 2018 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig