Callback to back up and restore the training state.
Inherits From: Callback
tf.keras.callbacks.experimental.BackupAndRestore( backup_dir )
BackupAndRestore
callback is intended to recover from interruptions that happened in the middle of a model.fit execution by backing up the training states in a temporary checkpoint file (based on TF CheckpointManager) at the end of each epoch. If training restarted before completion, the training state and model are restored to the most recently saved state at the beginning of a new model.fit() run. Note that user is responsible to bring jobs back up. This callback is important for the backup and restore mechanism for fault tolerance purpose. And the model to be restored from an previous checkpoint is expected to be the same as the one used to back up. If user changes arguments passed to compile or fit, the checkpoint saved for fault tolerance can become invalid.
class InterruptingCallback(tf.keras.callbacks.Callback): def on_epoch_begin(self, epoch, logs=None): if epoch == 4: raise RuntimeError('Interrupting!') callback = tf.keras.callbacks.experimental.BackupAndRestore( backup_dir="/tmp") model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)]) model.compile(tf.keras.optimizers.SGD(), loss='mse') try: model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10, batch_size=1, callbacks=[callback, InterruptingCallback()], verbose=0) except: pass history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10, batch_size=1, callbacks=[callback], verbose=0) # Only 6 more epochs are run, since first trainning got interrupted at # zero-indexed epoch 4, second training will continue from 4 to 9. len(history.history['loss']) 6
Arguments | |
---|---|
backup_dir | String, path to save the model file. This is the directory in which the system stores temporary files to recover the model from jobs terminated unexpectedly. The directory cannot be reused elsewhere to store other checkpoints, e.g. by BackupAndRestore callback of another training, or by another callback (ModelCheckpoint) of the same training. |
set_model
set_model( model )
set_params
set_params( params )
© 2020 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
https://www.tensorflow.org/versions/r2.4/api_docs/python/tf/keras/callbacks/experimental/BackupAndRestore