Francis Benistant
17 min readMar 14, 2024

--

Deep Dive into Deep Learning: Layers, RMSNorm, and Batch Normalization

Introduction:

In the realm of deep learning, normalization techniques play a crucial role in stabilizing and accelerating the training process. These techniques help mitigate the issues of vanishing or exploding gradients, enabling neural networks to converge faster and produce more reliable results. Among the various normalization methods, three prominent approaches stand out: Batch Normalization (BN), Layer Normalization, and RMSNorm.

Layer Normalization:

Layer Normalization (LN) is a normalization technique proposed by Jimmy Lei Ba et al. in 2016, offering an alternative to batch normalization (BN). Unlike BN, which normalizes across the mini-batch dimension, LN normalizes the activations of each layer across the feature dimension. This means that LN computes the mean and variance for each feature independently, making it suitable for scenarios where the mini-batch size is small or inconsistent. LN also eliminates the dependency on mini-batch statistics, making it well-suited for recurrent neural networks (RNNs) and other architectures where batch sizes vary.

RMSNorm:

Root Mean Square Normalization (RMSNorm) is a relatively novel normalization technique introduced by Biao Zhang, Rico Sennrich in 2019. Unlike BN and LN, RMSNorm normalizes activations based on the root mean square of the activations themselves, rather than using mini-batch or layer statistics. This approach ensures that the activations are consistently scaled regardless of the mini-batch size or the number of features. Additionally, RMSNorm introduces learnable scale parameters, offering similar adaptability to BN.

Batch Normalization:

Batch Normalization (BN) is perhaps one of the most widely used normalization techniques in deep learning. Introduced by Sergey Ioffe and Christian Szegedy in 2015, BN operates by normalizing the activations of each layer across the mini-batch during training. This ensures that the distribution of inputs remains stable throughout the network, leading to faster convergence and improved generalization. BN also introduces learnable parameters, allowing the model to adaptively scale and shift the normalized activations, further enhancing its flexibility and performance.

Layer Normalization

Layer normalization is a technique used in deep learning to stabilize the training of neural networks. It works by normalizing the inputs across the features for each training example. This contrasts with batch normalization, which normalizes across the batch dimension (i.e., different training examples). Layer normalization is particularly useful in recurrent neural networks (RNNs) and has also been successfully applied in transformers and other network architectures.

Here’s a more detailed explanation of how layer normalization works:

Computation:

For a given layer, layer normalization computes the mean and variance used for normalization from all of the summed inputs to the neurons in that layer for a single training example. It does this for each training example independently, rather than across the batch.

Normalization:

Once the mean and variance are computed, the layer normalization process normalizes each input for each neuron in the layer by subtracting the mean and dividing by the square root of the variance plus a small epsilon (to prevent division by zero). This results in a normalized output for each input.

Re-scaling and Re-centering:

After normalization, the outputs are scaled and shifted by two trainable parameters, γ (gamma) and β (beta), specific to each feature. This allows the network to undo the normalization if that is what the learned behavior requires. This step ensures that layer normalization can represent the identity transformation and can adjust the scale and location of the normalized values.

The key benefits of layer normalization include:

Stabilizes the training process:

By normalizing the inputs to each layer, it helps to reduce the internal covariate shift, which is the change in the distribution of network activations due to the update in network parameters during training.

Independence from batch size:

Since layer normalization does not rely on the batch dimension, it works well for models where batch size is a constraint or for architectures like RNNs where batch normalization is less effective.

Improves convergence:

Layer normalization can lead to faster convergence by smoothing the optimization landscape.

Versatility:

It can be applied to a wide range of network architectures, including those where batch normalization is less suitable.

Layer normalization has become a crucial component in many state-of-the-art deep learning models, especially in natural language processing (NLP) tasks and models like transformers, demonstrating its effectiveness in various contexts.

The normalization in layer normalization typically occurs before applying the activation function. Here’s the usual sequence of operations in a layer where layer normalization is applied:

  1. Linear Transformation: The input data X is first processed through a linear transformation, for example, Wx+b where W is the weight matrix, x is the input vector, and b is the bias vector.
  2. Layer Normalization: After the linear transformation, layer normalization is applied to the output of this transformation. This step involves calculating the mean and variance of the transformed data across the features for each training example, normalizing these values, and then re-scaling and re-centering the normalized values using learned parameters γ (gamma) and β (beta).
  3. Activation Function: The output from the layer normalization step is then passed through an activation function (e.g., ReLU, sigmoid, tanh, etc.).

The rationale for placing layer normalization before the activation function is to stabilize the distribution of the inputs to the activation functions across different layers and training steps. This helps in controlling the exploding and vanishing gradients problem, making the training process more stable and efficient.

By normalizing the inputs to the activation functions, the network can also potentially learn faster and achieve better performance, as the inputs to the activation functions are more likely to fall within regions where gradients are neither too large nor too small.

For a given vector of inputs to a layer for a single training example, represented as x=[x_1, x_2, ..., x_N], the layer normalization of each element x_i is computed by the formula:

LN(x_i) = γ * ((x_i — μ) / sqrt(σ² + ε)) + β

Where:

  • LN(x_i) represents the layer-normalized output for the input element x_i.
  • μ is the average of the inputs x=[x_1, x_2, …, x_N].
  • σ² denotes the variance of the inputs.
  • ε is a small constant added to ensure numerical stability (preventing division by zero).
  • γ and β are parameters that are learned during training for each feature. They are used to scale and shift the normalized output, allowing the network to determine if it should utilize the normalized values or adjust them to achieve the best performance.

This process guarantees that each feature’s input values are standardized to have an average of 0 and a variance of 1 (via the ((x_i — μ) / sqrt(σ² + ε)) part), and subsequently re-scaled and offset using γ and β, which are independently learned for each feature.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LayerNormalization

# Define a simple Sequential model
model = Sequential()

# Add a Dense layer with LayerNormalization
model.add(Dense(64, input_shape=(784,))) # Assuming input shape of 784 (e.g., flattened MNIST images)
model.add(LayerNormalization()) # Apply layer normalization
model.add(Activation('relu')) # Then apply the activation function

# Add more layers as needed
model.add(Dense(32, activation='relu'))
model.add(Dense(10, activation='softmax')) # Output layer for 10 classes

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Model summary to check the architecture
model.summary()

In this example:

  • The model is a Sequential model, meaning that it has a linear stack of layers.
  • The first layer is a Dense layer with 64 units, which is the size of the output space. The input shape is set to (784,), which corresponds to flattened images (for example, from the MNIST dataset).
  • Right after the first Dense layer, LayerNormalization() is applied. This normalizes the inputs across the features for each data point in the batch before the activation function.
  • The Activation('relu') layer applies the ReLU activation function to the output of the layer normalization.
  • The model then adds another Dense layer with ReLU activation and finally an output Dense layer with 10 units and the softmax activation function, suitable for a classification task with 10 classes.
  • The model is compiled with the Adam optimizer, using categorical crossentropy as the loss function, which is typical for multi-class classification tasks.

This code demonstrates how to integrate layer normalization into a Keras model to potentially improve training stability and performance

Here are some key points to consider when integrating layer normalization:

  1. After Linear Transformations: It’s common to apply layer normalization right after linear transformations (e.g., dense or convolutional layers) and before non-linear activations. This helps to ensure that the inputs to the activation functions have a stable distribution.
  2. Within Recurrent Networks: In recurrent neural networks (RNNs), including LSTMs (Long Short-Term Memory networks) and GRUs (Gated Recurrent Units), layer normalization can be applied to the inputs and recurrent connections to help mitigate issues related to training over long sequences.
  3. Before or After Activation: While it’s more typical to apply layer normalization before the activation function, some architectures might experiment with applying it after activation. The choice depends on the specific characteristics of the network and the problem it’s trying to solve.
  4. Custom Layers and Blocks: In more complex architectures or when designing custom layers, layer normalization can be a crucial component to ensure that the model trains effectively, especially in deep networks where gradient flow can be a challenge.
  5. Transformers: In transformer models, layer normalization is a critical component and is often applied in several places, including after multi-head attention and feed-forward networks, usually before the residual connection is added.

The flexibility of layer normalization allows it to be used in various network configurations, making it a powerful tool for improving model training and convergence across a wide range of deep learning tasks.

Layer normalization works by first normalizing the data to have a mean of 0 and a standard deviation of 1, and then it scales and shifts the normalized data using learned parameters γ (gamma) and β (beta).

Here’s a step-by-step breakdown of the process:

  1. Normalization: For each feature in a layer’s output (across a single input in the batch), calculate the mean (μ) and standard deviation (σ) of that feature. Then, normalize each value xi​ by subtracting the mean and dividing by the standard deviation, which is expressed as (xi​−μ)/σ​. This step ensures that the output of each feature across the inputs has a mean of 0 and a standard deviation of 1
  2. Scaling and Shifting: After normalization, each normalized value is then scaled and shifted using two parameters specific to each feature: γ (gamma) for scaling and β (beta) for shifting. The formula for this step is γ(xi​−μ​)/σ+β. These parameters are learned during the training process along with the other parameters of the model.

This process is applied independently to each feature and each data point in the batch, ensuring that the normalization does not depend on the batch size. By normalizing the inputs to the activation functions in this way, layer normalization helps to stabilize and speed up the training process of deep neural networks, especially in cases where batch normalization might not be applicable or effective.

RMSNORM

RMSNorm (Root Mean Square Normalization) is another normalization technique that, like Layer Normalization, is designed to stabilize the training of deep neural networks. While both aim to normalize the activations within a layer, they differ in the specifics of their calculations and motivations. Here’s a breakdown of the differences between Layer Normalization (LayerNorm) and RMSNorm:

Calculation

  • Layer Normalization computes the mean and variance across all the features for a specific layer and normalizes the activations based on these statistics. After normalization, it scales and shifts the activations using learned parameters (γ and β).
  • RMSNorm, on the other hand, normalizes the activations by dividing them by the root mean square (RMS) of the activations for each layer. Unlike LayerNorm, RMSNorm typically does not center the activations (subtract the mean) before normalization. It does include scaling by a learnable parameter (similar to γ in LayerNorm), but there’s usually no shifting parameter (β). The RMS is calculated as the square root of the mean of the squares of the activations.

Motivation

  • Layer Normalization was designed to reduce the internal covariate shift by normalizing the inputs to each layer, which helps in stabilizing the gradients and improving the training of deep networks.
  • RMSNorm focuses on simplifying the normalization process and reducing the computational overhead associated with LayerNorm. By avoiding the calculation of the mean and variance, RMSNorm aims to provide a faster and potentially more scalable alternative for normalization. Additionally, RMSNorm was motivated by the observation that the scaling factor (i.e., the RMS value) is crucial for stabilizing the norm of the gradients, which can be beneficial for training deep networks.

Use Cases

  • Layer Normalization is widely used across various types of neural networks, including RNNs, LSTMs, GRUs, and Transformers. Its ability to normalize the activations independently of the batch size makes it particularly useful for tasks where batch sizes are small or vary.
  • RMSNorm may be preferred in scenarios where computational efficiency is crucial, and the overhead of computing the exact mean and variance for normalization could be prohibitive. It can be particularly useful in training deep learning models where reducing computational complexity without significantly sacrificing performance is desired.

In summary, while both LayerNorm and RMSNorm aim to stabilize the training of neural networks by normalizing activations, they differ in their approach to normalization, computational complexity, and specific use cases. RMSNorm offers a computationally simpler alternative to LayerNorm by focusing on the root mean square of activations without subtracting the mean.

The formula for RMSNorm (Root Mean Square Normalization) for a given layer’s activations is encapsulated as:

RMSNorm(x_i) = (x_i / sqrt((1/N) * Σ(j=1 to N) x_j^2 + ε)) * γ

Where:

  • x_i is the i-th element of the input vector x to the layer, with elements [x_1, x_2, ..., x_N].
  • N is the total number of features in the input vector x to the layer.
  • Σ(j=1 to N) x_j^2 is the sum of squares of the elements of x.
  • ε is a small constant added for numerical stability (to avoid division by zero).
  • γ is a trainable scale parameter specific to each feature, similar to the scaling parameter in layer normalization. Unlike layer normalization, RMSNorm typically does not include a bias term (β).

The denominator in the formula, sqrt((1/N) * Σ(j=1 to N) x_j^2 + ε), computes the root mean square of the input vector, providing a measure of its magnitude. Each element of the input vector is then normalized by this value and scaled by the trainable parameter γ. This normalization process helps stabilize the gradients and makes training deep neural networks more efficient by reducing the computational overhead associated with calculating mean and variance as in Layer Normalization.

RMSNorm indeed takes the input vector, calculates the sum of the squares of its elements, takes the square root of this sum divided by the length of the vector, and applies this normalization factor to each element of the vector. After this normalization, it scales the normalized values by γ. This process is followed by the linear transformation (Wx + b) and then the activation function in the network's architecture.

So, to clarify, the sequence within a network layer incorporating RMSNorm would be:

  1. Perform the linear transformation on the input vector (Wx + b).
  2. Apply RMSNorm to this result, normalizing based on the RMS value and scaling the normalized output by γ.
  3. Pass the RMSNorm-adjusted outputs through an activation function.

This means RMSNorm is applied directly to the outputs of a linear transformation but before the activation function, facilitating a stabilized input to the activation functions and aiding in more efficient neural network training.

To demonstrate the use of a normalization technique similar to RMSNorm in Keras, we’ll need to implement a custom layer because RMSNorm is not directly available in Keras as a built-in layer, unlike LayerNormalization. This example will show how you can define a custom RMSNorm layer and integrate it into a Keras model. Note that we’ll simulate the RMSNorm operation focusing on the normalization process, assuming the gamma (γ) parameter for scaling but not adding a bias term for simplicity.

Here’s a simplified version to give the idea:

from tensorflow.keras.layers import Layer, InputSpec
from tensorflow.keras import initializers, regularizers, constraints
import tensorflow.keras.backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

class RMSNorm(Layer):
def __init__(self, epsilon=1e-7, gamma_initializer='ones', gamma_regularizer=None, gamma_constraint=None, **kwargs):
super(SimplifiedRMSNorm, self).__init__(**kwargs)
self.epsilon = epsilon
self.gamma_initializer = initializers.get(gamma_initializer)
self.gamma_regularizer = regularizers.get(gamma_regularizer)
self.gamma_constraint = constraints.get(gamma_constraint)

def build(self, input_shape):
shape = (input_shape[-1],)
self.gamma = self.add_weight(name='gamma',
shape=shape,
initializer=self.gamma_initializer,
regularizer=self.gamma_regularizer,
constraint=self.gamma_constraint,
trainable=True)
super(SimplifiedRMSNorm, self).build(input_shape)

def call(self, inputs):
rms = tf.sqrt(tf.reduce_mean(tf.square(inputs), axis=-1, keepdims=True) + self.epsilon)
return self.gamma * inputs / rms

def compute_output_shape(self, input_shape):
return input_shape

# Now, let's use this RMSNorm layer in a simple Sequential model
model = Sequential([
Dense(64, input_shape=(784,)),
RMSNorm(), # Custom RMSNorm layer after the dense layer
Activation('relu'),
Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

In this example:

  • A custom RMSNorm layer class is defined, where the normalization is based on the root mean square of the input activations. This custom layer also includes a trainable scaling parameter (γ), similar to the gamma in the RMSNorm discussion.
  • The RMSNorm layer is used in a simple Sequential model right after a Dense layer and before an activation function, demonstrating how it can be integrated into a model.

This code is a simplified illustration to show how you could implement a normalization technique inspired by RMSNorm principles. For actual use, you may need to adjust the implementation details based on your specific requirements and the characteristics of your dataset and model architecture.

Custom Layer with Layer Normalization

You can create a custom network layer that encapsulates a sequence of operations: a linear transformation (often just a dense layer without an activation function or with linear activation), followed by normalization (either Layer Normalization or a custom RMSNorm), and then an activation function. This approach allows you to modularize your architecture, making your code cleaner and more reusable. Below is an example of how you could implement such a layer in Keras for both Layer Normalization and a custom RMSNorm layer, including the activation function as part of the custom layer:

from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Activation
from tensorflow.keras import activations
import tensorflow as tf

class DenseLayerNorm(Layer):
def __init__(self, units, activation=None, **kwargs):
super(DenseLayerNorm, self).__init__(**kwargs)
self.units = units
self.activation = activations.get(activation) # Safely handle activation

def build(self, input_shape):
self.dense = Dense(units=self.units, activation='linear')
self.layer_norm = LayerNormalization()
super(DenseLayerNorm, self).build(input_shape) # Good practice

def call(self, inputs):
x = self.dense(inputs)
x = self.layer_norm(x)
if self.activation is not None:
x = self.activation(x) # Apply activation function if not None
return x

# Example of using the custom layer
model = tf.keras.Sequential([
DenseLayerNorm(64, activation='relu', input_shape=(784,)),
DenseLayerNorm(32, activation='relu'),
Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Custom Layer with RMSNorm

Assuming we have the RMSNorm class defined as before, we can create a similar custom layer that incorporates RMSNorm:

from tensorflow.keras.layers import Layer, Dense, Activation
from tensorflow.keras import activations

class DenseRMSNorm(Layer):
def __init__(self, units, activation=None, **kwargs):
super(DenseRMSNorm, self).__init__(**kwargs)
self.units = units
self.activation = activations.get(activation) # Get the activation function

def build(self, input_shape):
self.dense = Dense(units=self.units, activation='linear')
# Ensure RMSNorm is properly imported and initialized here
self.rms_norm = RMSNorm()
super(DenseRMSNorm, self).build(input_shape) # It's good practice to call super().build()

def call(self, inputs):
x = self.dense(inputs)
x = self.rms_norm(x)
if self.activation is not None:
x = self.activation(x) # Apply activation function if not None
return x

# Example of using the custom layer
model = tf.keras.Sequential([
DenseRMSNorm(64, activation='relu', input_shape=(784,)),
DenseRMSNorm(32, activation='relu'),
Dense(10, activation='softmax')
])

In both examples, the custom layer (DenseLayerNorm and DenseRMSNorm) encapsulates the process of applying a linear transformation, followed by normalization (either layer normalization or RMSNorm), and then applying an activation function. This design makes it easy to add these combined operations as a single layer in your neural network models.

The line self.activation = Activation(activation) in the custom layer examples is a way to dynamically define the activation function for the layer based on the provided argument. Here's a breakdown of what's happening:

  • Activation class: In TensorFlow/Keras, Activation is a layer that applies an activation function to the output of the layer preceding it. The Activation layer takes as its argument the name of the activation function to use (e.g., 'relu', 'sigmoid', 'tanh', etc.).
  • activation argument: When you instantiate your custom layer (e.g., DenseLayerNorm or DenseRMSNorm), you pass an activation parameter, which is the name of the activation function you want to apply after your normalization process (LayerNorm or RMSNorm in these examples).
  • Dynamic assignment: By using self.activation = Activation(activation), you're dynamically creating an Activation layer based on the activation parameter passed to your custom layer. This Activation layer is then stored as an instance variable (self.activation) of your custom layer.
  • Usage: In the call method of your custom layer, after applying the dense operation and the normalization (LayerNorm or RMSNorm), you call self.activation(x) on the output. This applies the chosen activation function to the data. Since self.activation is an instance of the Activation layer, calling it like this processes the input through that layer, effectively applying the desired activation function.

This approach allows your custom layer to be flexible regarding the activation function it uses, enabling you to specify different activations depending on your needs without changing the layer’s implementation. It makes your custom layer more modular and reusable for different scenarios.

Batch Normalization

Batch normalization is a technique to normalize the inputs of each layer within a neural network. It works by adjusting and scaling the activations of the previous layer, aiming to improve the stability, speed, and performance of the training process. Batch normalization can lead to faster convergence, and in some cases, it can also improve the model’s accuracy by reducing internal covariate shift.

In Keras, batch normalization is implemented via the BatchNormalization layer. You can easily add it to your model just like any other layer. Here’s how you can use it in a Keras model:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Activation

model = Sequential([
Dense(64, input_shape=(784,)),
BatchNormalization(), # Batch normalization layer
Activation('relu'), # Activation layer comes after batch normalization

Dense(32),
BatchNormalization(), # Another batch normalization layer
Activation('relu'),

Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In this example, BatchNormalization layers are added after Dense layers but before activation functions (ReLU in this case). This order—applying batch normalization before the activation—is commonly recommended, though some research and practical applications show benefits to applying it after the activation function, depending on the specific circumstances and the network architecture.

How Batch Normalization Works in Keras:

  1. Normalization: For each mini-batch during training, the BatchNormalization layer computes the mean and variance of its inputs. It then normalizes the inputs using these statistics.
  2. Scaling and Shifting: After normalization, the layer applies a scale factor (gamma) and an offset (beta), both of which are learnable parameters. This allows the layer to undo the normalization if that's what the learned behavior requires, effectively giving the model the ability to learn the optimal scale and mean of the activations for each layer.
  3. During Training: The mean and variance are computed for each batch.
  4. During Inference: The layer uses the moving average of the mean and variance it learned during training to normalize inputs. This ensures that the model’s behavior remains consistent outside of training, even when it receives a single example or a batch of different size from the training batches.

Batch normalization often leads to significant improvement in training speed, stability, and sometimes even model accuracy. It’s widely used in deep learning models, especially deep neural networks and convolutional neural networks.

So in mathematical terms, we use a dense layer with linear activation to get Wx+b the batch normalization calculate the mean and sigma, then ((Wx+b)-mean)/sigma is input in the activation function.

Here’s a more detailed breakdown in mathematical terms:

Linear Transformation: For each input x, the dense layer performs a linear transformation to compute its output z as:

z = Wx + b

where W is the weight matrix, and b is the bias vector.

Batch Normalization: The batch normalization layer takes z (the output from the previous layer) and normalizes it. The normalization process for a given feature i in z across a mini-batch is defined as:

z_hat_i = (z_i - μ_B) / sqrt(σ_B^2 + ε)

where:

  • μ_B is the mean of the feature across the mini-batch.
  • σ_B^2 is the variance of the feature across the mini-batch.
  • ε is a small constant added for numerical stability to avoid division by zero.

Scale and Shift: After normalization, the batch normalization layer applies a scale (γ) and shift (β), both of which are learnable parameters specific to each feature. This step is expressed as:

y_i = γ * z_hat_i + β

The resulting y_i is the output of the batch normalization layer, ready to be passed into the activation function. This scale and shift operation allows the network to undo the normalization if that is what the learned behavior requires, giving the model flexibility.

Activation Function: Finally, y_i is passed through a non-linear activation function, such as ReLU, sigmoid, or tanh, to introduce non-linearity into the model:

a_i = f(y_i)

where f(⋅) is the activation function, and a_i is the activated output that gets passed to the next layer in the network.

This sequence — linear transformation, batch normalization, and then activation — helps stabilize the distribution of inputs to deep layers in the network, which can improve convergence rates and overall model performance.

Custom Layer with BatchNormalization

from tensorflow.keras.layers import Layer, Dense, BatchNormalization, Activation
from tensorflow.keras import activations
import tensorflow as tf

class DenseBatchNorm(Layer):
def __init__(self, units, activation=None, **kwargs):
super(DenseBatchNorm, self).__init__(**kwargs)
self.units = units
# Use activations.get() to safely handle the activation function, whether it's None, a string, or a callable
self.activation = activations.get(activation)

def build(self, input_shape):
# Dense layer without activation, activation will be applied after batch normalization
self.dense = Dense(units=self.units, activation='linear')
# BatchNormalization layer
self.batch_norm = BatchNormalization()
super(DenseBatchNorm, self).build(input_shape)

def call(self, inputs):
x = self.dense(inputs) # Apply dense layer
x = self.batch_norm(x) # Apply batch normalization to the output of the dense layer
if self.activation is not None:
x = self.activation(x) # Apply the activation function if it's specified
return x

# Example of using the custom layer in a model
model = tf.keras.Sequential([
DenseBatchNorm(64, activation='relu', input_shape=(784,)), # First layer with input shape specified
DenseBatchNorm(32, activation='relu'), # Second custom dense layer with batch normalization
Dense(10, activation='softmax') # Output layer for classification into 10 classes
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Conclusion:

In conclusion, normalization techniques such as Batch Normalization, Layer Normalization, and RMSNorm play pivotal roles in the success of deep learning models. Each technique offers unique advantages and considerations, catering to different network architectures and training scenarios. By understanding the principles behind these normalization methods, practitioners can effectively leverage them to train more stable, efficient, and accurate deep learning models.

References

1- Layer Normalization: https://arxiv.org/abs/1607.06450

2- Root Mean Square Layer Normalization: https://arxiv.org/abs/1910.07467

3-Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift: https://arxiv.org/abs/1502.03167

--

--