![]() |
Transformers are a deep learning architecture designed for sequence-to-sequence tasks. Unlike traditional sequence models such as recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), transformers rely entirely on a mechanism known as self-attention to draw global dependencies between input and output. In this guide, we will walk through the implementation of a Transformer model from scratch using TensorFlow. The implementation includes key components such as positional encoding, multi-head attention, and transformer blocks. Table of Content Architecture Overview of Transformer ModelThe Transformer model consists of an encoder and a decoder, both built from multiple layers of self-attention and feed-forward networks. The architecture is designed to handle sequential data with attention mechanisms and is capable of processing long-range dependencies efficiently. 1. EncoderThe encoder processes the input sequence. It consists of a series of identical layers, each of which contains two main subcomponents:
2. DecoderThe decoder generates the output sequence and consists of layers similar to the encoder, but with additional mechanisms:
![]() Implementing Transformer Model from Scratch using TensorFlow1. Importing Required LibrariesWe’ll start by importing TensorFlow and necessary components from import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Embedding, Dropout, LayerNormalization
from tensorflow.keras.models import Model
import numpy as np 2. Positional EncodingPositional encoding is added to the input embeddings to provide information about the position of tokens in the sequence. Unlike RNNs and LSTMs, Transformers do not inherently capture the sequential nature of data, so positional encodings are essential for injecting this information. Here’s a function to compute the positional encodings: def positional_encoding(position, d_model):
angle_rads = np.arange(position)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d_model) // 2)) / np.float32(d_model))
angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
return tf.cast(angle_rads[np.newaxis, ...], dtype=tf.float32) 3. Multi-Head AttentionThe multi-head attention mechanism allows the model to focus on different parts of the input sequence simultaneously. It uses multiple attention heads to compute different representations of the input.
Scaled Dot Product Attention is the core attention mechanism used by the multi-head attention component to compute attention scores. Here’s a class for multi-head attention: class MultiHeadAttention(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
assert d_model % num_heads == 0
self.depth = d_model // num_heads
self.wq = Dense(d_model)
self.wk = Dense(d_model)
self.wv = Dense(d_model)
self.dense = Dense(d_model)
def split_heads(self, x, batch_size):
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, v, k, q, mask):
batch_size = tf.shape(q)[0]
q = self.wq(q)
k = self.wk(k)
v = self.wv(v)
q = self.split_heads(q, batch_size)
k = self.split_heads(k, batch_size)
v = self.split_heads(v, batch_size)
attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask)
attention = tf.transpose(attention, perm=[0, 2, 1, 3])
attention = tf.reshape(attention, (batch_size, -1, self.d_model))
output = self.dense(attention)
return output
def scaled_dot_product_attention(self, q, k, v, mask):
matmul_qk = tf.matmul(q, k, transpose_b=True)
dk = tf.cast(tf.shape(k)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
if mask is not None:
scaled_attention_logits += (mask * -1e9)
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
output = tf.matmul(attention_weights, v)
return output, attention_weights 4. Feed Forward NetworkThe position-wise feed-forward network is used to process each position independently: class PositionwiseFeedforward(tf.keras.layers.Layer):
def __init__(self, d_model, dff):
super(PositionwiseFeedforward, self).__init__()
self.d_model = d_model
self.dff = dff
self.dense1 = Dense(dff, activation='relu')
self.dense2 = Dense(d_model)
def call(self, x):
x = self.dense1(x)
x = self.dense2(x)
return x 5. Transformer BlockA transformer block combines multi-head attention and feed-forward networks with layer normalization and dropout:
class TransformerBlock(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
super(TransformerBlock, self).__init__()
self.att = MultiHeadAttention(d_model, num_heads)
self.ffn = PositionwiseFeedforward(d_model, dff)
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.dropout1 = Dropout(dropout_rate)
self.dropout2 = Dropout(dropout_rate)
def call(self, x, training, mask):
attn_output = self.att(x, x, x, mask)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output)
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
out2 = self.layernorm2(out1 + ffn_output)
return out2 6. EncoderThe encoder consists of a stack of encoder layers. It converts the input sequence into a set of embeddings enriched with positional information.
class Encoder(tf.keras.layers.Layer):
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding, dropout_rate=0.1):
super(Encoder, self).__init__()
self.d_model = d_model
self.num_layers = num_layers
self.embedding = Embedding(input_vocab_size, d_model)
self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
self.dropout = Dropout(dropout_rate)
self.enc_layers = [TransformerBlock(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]
def call(self, x, training, mask):
seq_len = tf.shape(x)[1]
x = self.embedding(x)
x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x, training=training)
for i in range(self.num_layers):
x = self.enc_layers[i](x, training, mask)
return x 7. DecoderThe decoder generates the output sequence from the encoded representation, using mechanisms to attend to both the encoder output and previously generated tokens.
class Decoder(tf.keras.layers.Layer):
def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size, maximum_position_encoding, dropout_rate=0.1):
super(Decoder, self).__init__()
self.d_model = d_model
self.num_layers = num_layers
self.embedding = Embedding(target_vocab_size, d_model)
self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
self.dropout = Dropout(dropout_rate)
self.dec_layers = [TransformerBlock(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]
def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
seq_len = tf.shape(x)[1]
attention_weights = {}
x = self.embedding(x)
x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x, training=training)
for i in range(self.num_layers):
x = self.dec_layers[i](x, training, look_ahead_mask)
return x 8. Transformer ModelThe final model combines the encoder and decoder and outputs the final predictions.
class Transformer(Model):
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, pe_input, pe_target, rate=0.1):
super(Transformer, self).__init__()
self.encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, pe_input, rate)
self.decoder = Decoder(num_layers, d_model, num_heads, dff, target_vocab_size, pe_target, rate)
self.final_layer = Dense(target_vocab_size)
def call(self, inputs, targets, training, look_ahead_mask, padding_mask):
enc_output = self.encoder(inputs, training, padding_mask)
dec_output = self.decoder(targets, enc_output, training, look_ahead_mask, padding_mask)
final_output = self.final_layer(dec_output)
return final_output 9. Training and TestingLet’s define the model parameters and perform a forward pass with example inputs: # Parameters
num_layers = 4
d_model = 128
num_heads = 8
dff = 512
input_vocab_size = 8500
target_vocab_size = 8000
pe_input = 1000
pe_target = 1000
dropout_rate = 0.1
# Create the Transformer model
transformer = Transformer(num_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, pe_input, pe_target, dropout_rate)
# Example Input
inputs = tf.random.uniform((64, 50), dtype=tf.int64, minval=0, maxval=input_vocab_size)
targets = tf.random.uniform((64, 50), dtype=tf.int64, minval=0, maxval=target_vocab_size)
look_ahead_mask = None
padding_mask = None
# Forward Pass
output = transformer(inputs, targets, training=True, look_ahead_mask=look_ahead_mask, padding_mask=padding_mask)
print(output.shape) # (64, 50, target_vocab_size) Complete Code
Output: (64, 50, 8000) The output shape If you’re working on a machine translation task:
ConclusionYou have now built a Transformer model from scratch using TensorFlow. This implementation covers the core components of a Transformer architecture, including positional encoding, multi-head attention, feed-forward networks, and both encoder and decoder layers. With this foundational model, you can explore various NLP tasks and further customize it based on specific requirements. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 20 |