View source on GitHub |

Layer normalization layer (Ba et al., 2016).

Inherits From: `Layer`

tf.keras.layers.LayerNormalization( axis=-1, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None, trainable=True, name=None, **kwargs )

Normalize the activations of the previous layer for each given example in a batch independently, rather than across a batch like Batch Normalization. i.e. applies a transformation that maintains the mean activation within each example close to 0 and the activation standard deviation close to 1.

Given a tensor `inputs`

, moments are calculated and normalization is performed across the axes specified in `axis`

.

data = tf.constant(np.arange(10).reshape(5, 2) * 10, dtype=tf.float32) print(data) tf.Tensor( [[ 0. 10.] [20. 30.] [40. 50.] [60. 70.] [80. 90.]], shape=(5, 2), dtype=float32)

layer = tf.keras.layers.LayerNormalization(axis=1) output = layer(data) print(output) tf.Tensor( [[-1. 1.] [-1. 1.] [-1. 1.] [-1. 1.] [-1. 1.]], shape=(5, 2), dtype=float32)

Notice that with Layer Normalization the normalization happens across the axes *within* each example, rather than across different examples in the batch.

If `scale`

or `center`

are enabled, the layer will scale the normalized outputs by broadcasting them with a trainable variable `gamma`

, and center the outputs by broadcasting with a trainable variable `beta`

. `gamma`

will default to a ones tensor and `beta`

will default to a zeros tensor, so that centering and scaling are no-ops before training has begun.

So, with scaling and centering enabled the normalization equations are as follows: Let the intermediate activations for a mini-batch to be the `inputs`

.

For each sample `x_i`

in `inputs`

with `k`

features, we compute the mean and variance of the sample:

mean_i = sum(x_i[j] for j in range(k)) / k var_i = sum((x_i[j] - mean_i) ** 2 for j in range(k)) / k

and then compute a normalized `x_i_normalized`

, including a small factor `epsilon`

for numerical stability.

x_i_normalized = (x_i - mean_i) / sqrt(var_i + epsilon)

And finally `x_i_normalized`

is linearly transformed by `gamma`

and `beta`

, which are learned parameters:

output_i = x_i_normalized * gamma + beta

`gamma`

and `beta`

will span the axes of `inputs`

specified in `axis`

, and this part of the inputs' shape must be fully defined.

layer = tf.keras.layers.LayerNormalization(axis=[1, 2, 3]) layer.build([5, 20, 30, 40]) print(layer.beta.shape) (20, 30, 40) print(layer.gamma.shape) (20, 30, 40)

Note that other implementations of layer normalization may choose to define `gamma`

and `beta`

over a separate set of axes from the axes being normalized across. For example, Group Normalization (Wu et al. 2018) with group size of 1 corresponds to a Layer Normalization that normalizes across height, width, and channel and has `gamma`

and `beta`

span only the channel dimension. So, this Layer Normalization implementation will not match a Group Normalization layer with group size set to 1.

Arguments | |
---|---|

`axis` | Integer or List/Tuple. The axis or axes to normalize across. Typically this is the features axis/axes. The left-out axes are typically the batch axis/axes. This argument defaults to `-1` , the last dimension in the input. |

`epsilon` | Small float added to variance to avoid dividing by zero. Defaults to 1e-3 |

`center` | If True, add offset of `beta` to normalized tensor. If False, `beta` is ignored. Defaults to True. |

`scale` | If True, multiply by `gamma` . If False, `gamma` is not used. Defaults to True. When the next layer is linear (also e.g. `nn.relu` ), this can be disabled since the scaling will be done by the next layer. |

`beta_initializer` | Initializer for the beta weight. Defaults to zeros. |

`gamma_initializer` | Initializer for the gamma weight. Defaults to ones. |

`beta_regularizer` | Optional regularizer for the beta weight. None by default. |

`gamma_regularizer` | Optional regularizer for the gamma weight. None by default. |

`beta_constraint` | Optional constraint for the beta weight. None by default. |

`gamma_constraint` | Optional constraint for the gamma weight. None by default. |

`trainable` | Boolean, if `True` the variables will be marked as trainable. Defaults to True. |

Input shape: Arbitrary. Use the keyword argument `input_shape`

(tuple of integers, does not include the samples axis) when using this layer as the first layer in a model. Output shape: Same shape as input. Reference:

© 2020 The TensorFlow Authors. All rights reserved.

Licensed under the Creative Commons Attribution License 3.0.

Code samples licensed under the Apache 2.0 License.

https://www.tensorflow.org/versions/r2.3/api_docs/python/tf/keras/layers/LayerNormalization