Architecture and Working of Transformers in Deep Learning - Coding

Transformers are a type of deep learning model that utilizes self-attention mechanisms to process and generate sequences of data efficiently, capturing long-range dependencies and contextual relationships.

The article aims to discuss the architecture and working of the transformers model in deep learning.

Table of Content

Transformer Architecture: High-Level Overview

Components of the Transformer

1. Encoder
2. Decoder

Detailed Components of Transformers Architecture

1. Multi-Head Self-Attention Mechanism
2. Position-wise Feed-Forward Networks
3. Positional Encoding
4. Layer Normalization and Residual Connections

Working of Transformers

1. Input Representation
2. Encoder Process in Transformers
3. Decoder Process
4. Training and Inference

Conclusion

Transformer Architecture: High-Level Overview

The transformer model is built on an encoder-decoder architecture, where both the encoder and decoder are composed of a series of layers that utilize self-attention mechanisms and feed-forward neural networks. This architecture enables the model to process input data in parallel, making it highly efficient and effective for tasks involving sequential data.

In a transformer model, the encoder processes the input sequence and generates a set of continuous representations. These representations are then fed into the decoder, which produces the output sequence. The encoder and decoder work together to transform the input into the desired output, such as translating a sentence from one language to another or generating a response to a query.

Transformer’s Architecture

Components of the Transformer

1. Encoder

The encoder in a transformer consists of a stack of identical layers, each designed to capture various aspects of the input data. The primary function of the encoder is to create a high-dimensional representation of the input sequence that the decoder can use to generate the output.

Each encoder layer is composed of two main sub-layers:

Self-Attention Mechanism: This sub-layer allows the encoder to weigh the importance of different parts of the input sequence differently, capturing dependencies regardless of their distance within the sequence.
Feed-Forward Neural Network: This sub-layer consists of two linear transformations with a ReLU activation in between. It processes the output of the self-attention mechanism to generate a refined representation.

Layer normalization and residual connections are used around each of these sub-layers to ensure stability and improve convergence during training.

2. Decoder

The decoder in a transformer also consists of a stack of identical layers. Its primary function is to generate the output sequence based on the representations provided by the encoder and the previously generated tokens of the output.

Each decoder layer consists of three main sub-layers:

Masked Self-Attention Mechanism: Similar to the encoder’s self-attention mechanism but with a mask to prevent the decoder from attending to future tokens in the output sequence.
Encoder-Decoder Attention Mechanism: This sub-layer allows the decoder to focus on relevant parts of the encoder’s output representation, facilitating the generation of coherent and contextually appropriate output sequences.
Feed-Forward Neural Network: This sub-layer processes the combined output of the masked self-attention and encoder-decoder attention mechanisms.

Detailed Components of Transformers Architecture

1. Multi-Head Self-Attention Mechanism

Self-attention is a mechanism that allows the model to weigh the importance of different parts of the input sequence differently, regardless of their positions. This mechanism enables the model to capture dependencies between distant tokens effectively.

Multi-head attention extends the self-attention mechanism by applying it multiple times in parallel, with each “head” learning different aspects of the input data. This allows the model to capture a richer set of relationships within the input sequence. The outputs of these heads are then concatenated and linearly transformed to produce the final output. The benefits include:

Improved ability to capture complex patterns in the data.
Enhanced model capacity without a significant increase in computational complexity.

Mathematical Formulation:

Given an input sequence X, the self-attention mechanism computes three matrices: queries Q, keys K, and values V, by multiplying X with learned weight matrices [Tex]W_Q[/Tex], [Tex]W_K[/Tex], and [Tex]W_V[/Tex].

[Tex]Q = XW_Q , \quad K = X W_K, \quad V = XW_V[/Tex]

The attention scores are computed as:

[Tex]\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)[/Tex]

For multi-head attention, these computations are performed h times (with different learned weight matrices for each head), and the results are concatenated and projected:

[Tex]\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O[/Tex]

where [Tex]\text{head}_i = \text{Attention}(QW_Q^i, KW_K^i, VW_V^i)[/Tex] and [Tex]W_O[/Tex] is the output projection matrix.

2. Position-wise Feed-Forward Networks

Position-wise feed-forward networks (FFNs) are applied to each position of the sequence independently and identically. They consist of two linear transformations with a ReLU activation in between:

[Tex]\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2[/Tex]

The FFN helps in transforming the input to a higher-level representation, allowing the model to learn complex functions. It is applied after the self-attention mechanism within each layer of the encoder and decoder.

3. Positional Encoding

Transformers lack inherent information about the order of the input sequence due to their parallel processing nature. Positional encoding is introduced to provide the model with information about the position of each token in the sequence.

Positional encodings are added to the input embeddings to give the model a sense of token order. These encodings can be either learned or fixed.

A common approach is to use sinusoidal functions to generate positional encodings:

[Tex]PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)[/Tex]

[Tex]PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)[/Tex]

where pos is the position and i is the dimension. These encodings are then added to the input embeddings.

4. Layer Normalization and Residual Connections

Layer normalization is applied to the output of the self-attention and feed-forward sub-layers to stabilize and accelerate training by normalizing the inputs across the features. This helps in maintaining the scale of the inputs to the next layer.

Residual connections (skip connections) are used around each sub-layer to allow the gradients to flow through the network more effectively, preventing the vanishing gradient problem and enabling the training of deeper models. The output of each sub-layer is added to its input before applying layer normalization:

[Tex]\text{Output} = \text{LayerNorm}(x + \text{SubLayer}(x))[/Tex]

This addition helps in preserving the original input information, which is crucial for learning complex representations.

Working of Transformers

1. Input Representation

The first step in processing input data involves converting raw text into a format that the transformer model can understand. This involves tokenization and embedding.

Tokenization and Embedding

Tokenization: The input text is split into smaller units called tokens, which can be words, subwords, or characters. Tokenization ensures that the text is broken down into manageable pieces.
Embedding: Each token is then converted into a fixed-size vector using an embedding layer. This layer maps each token to a dense vector representation that captures its semantic meaning. Positional encodings are added to these embeddings to provide information about the token positions within the sequence.

2. Encoder Process in Transformers

Step by Step Functioning of Encoder

Input Embedding: The input sequence is tokenized and converted into embeddings, with positional encodings added.
Self-Attention Mechanism: Each token in the input sequence attends to every other token to capture dependencies and contextual information. The self-attention scores are calculated and used to weigh the importance of different tokens.
Feed-Forward Network: The output from the self-attention mechanism is passed through a position-wise feed-forward network, which consists of two linear transformations with a ReLU activation in between.
Layer Normalization and Residual Connections: Layer normalization and residual connections are applied around the self-attention and feed-forward sub-layers to stabilize training and improve gradient flow.

Self-Attention and Feed-Forward Network Operations Within the Encoder

Self-Attention: For each token, self-attention computes a weighted sum of the values (from the value matrix V), where the weights are determined by the dot product of the query (from the query matrix Q) and key (from the key matrix K) vectors.
Feed-Forward Network: The output from the self-attention layer is transformed through a feed-forward network, which helps in learning complex representations.

3. Decoder Process

Step-by-Step Functioning of the Decoder

Input Embedding and Positional Encoding: The partially generated output sequence is tokenized and embedded, with positional encodings added.
Masked Self-Attention Mechanism: The decoder uses masked self-attention to prevent attending to future tokens, ensuring that the model generates the sequence step-by-step.
Encoder-Decoder Attention Mechanism: The decoder attends to the encoder’s output, allowing it to focus on relevant parts of the input sequence.
Feed-Forward Network: Similar to the encoder, the output from the attention mechanisms is passed through a position-wise feed-forward network.
Layer Normalization and Residual Connections: Layer normalization and residual connections are applied to stabilize training and improve gradient flow.

Self-Attention, Encoder-Decoder Attention, and Feed-Forward Network Operations Within the Decoder

Masked Self-Attention: Ensures that the decoder only attends to previous tokens in the sequence.
Encoder-Decoder Attention: Allows the decoder to attend to the encoder’s output, integrating information from the input sequence.
Feed-Forward Network: Processes the combined output of the attention mechanisms to generate the final representation.

4. Training and Inference

Transformers are trained using supervised learning, where the model learns to predict the next token in a sequence given the previous tokens. The training involves the following steps:

1. Loss Function and Optimization

Loss Function: The most common loss function for training transformers is the cross-entropy loss, which measures the difference between the predicted and actual token distributions.
Optimization: The model parameters are optimized using gradient descent techniques, typically with the Adam optimizer. The gradients are computed using backpropagation, and the model parameters are updated to minimize the loss.

2. Inference Process and Generating Output

Inference: During inference, the trained model generates the output sequence step-by-step. For each step, the model predicts the next token based on the previously generated tokens and the input sequence.
Generating Output: The decoder generates tokens until a special end-of-sequence token is produced or a predefined maximum length is reached. Beam search or other decoding strategies can be used to improve the quality of the generated sequence.

Conclusion

Transformers have transformed deep learning by using self-attention mechanisms to efficiently process and generate sequences, capturing long-range dependencies and contextual relationships. Their encoder-decoder architecture, combined with multi-head attention and feed-forward networks, enables highly effective handling of sequential data. Despite computational challenges, transformers continue to drive innovation in various fields, making them a cornerstone in the evolution of deep learning.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
How to use CoreNLPParser in NLTK in Python
How to Make a Mosaic Plot in Matplotlib
Calculating Precision and Recall for Multiclass Classification Using Confusion Matrix
How to Set Dataframe Column Value as X-axis Labels in Python Pandas
How to fix "R neuralNet: non-conformable arguments

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	15