Breaking down the Transformer

Transformer Models with PyTorch

James Chapman

Curriculum Manager, DataCamp

The paper that changed everything...

Attention Is All You Need by Ashish Vaswani et al. (arXiv:1706.03762)
- Attention mechanisms
- Optimized for text modeling
- Used in Large Language Models (LLMs)

The transformers architecture as shown in the academic paper, Attention Is All You Need.

The paper that changed everything...

Attention Is All You Need by Ashish Vaswani et al.
- Attention mechanisms
- Optimized for text modeling
- Used in Large Language Models (LLMs)

The transformers architecture as shown in the academic paper, Attention Is All You Need.

The paper that changed everything...

Attention Is All You Need by Ashish Vaswani et al.
- Attention mechanisms
- Optimized for text modeling
- Used in Large Language Models (LLMs)

The transformers architecture as shown in the academic paper, Attention Is All You Need.

Unpacking the Transformer...

Encoder block

Multiple identical layers
Reading and processing input sequence
Generate context-rich numerical representations
Uses self-attention and feed-forward networks

The transformers architecture as shown in the academic paper, Attention Is All You Need.

Unpacking the Transformer...

Decoder block

Encoded input sequence → output sequence

The transformers architecture as shown in the academic paper, Attention Is All You Need.

Unpacking the Transformer...

Positional encoding

Encode each token's position in the sequence
Order is crucial for modeling sequences

The transformers architecture as shown in the academic paper, Attention Is All You Need.

Unpacking the Transformer...

Attention mechanisms

Focus on important tokens and their relationships
Improve text generation

The transformers architecture as shown in the academic paper, Attention Is All You Need.

Unpacking the Transformer...

Attention mechanisms

Focus on important tokens and their relationships
Improve text generation

Self-attention

Weighting token importance
Captures long-range dependencies

The transformers architecture as shown in the academic paper, Attention Is All You Need.

Unpacking the Transformer...

Attention mechanisms

Focus on important tokens and their relationships
Improve text generation

Self-attention

Weighting token importance
Captures long-range dependencies

Multi-head attention

Splits input into multiple heads
Heads capture distinct patterns, leading to richer representations

The transformers architecture as shown in the academic paper, Attention Is All You Need.

Unpacking the Transformer...

Position-wise feed-forward networks

Simple NNs that apply transformations
Each token is transformed independently
Position-independent/"position-wise"

The transformers architecture as shown in the academic paper, Attention Is All You Need.

Transformers in PyTorch

d_model: Dimensionality of model inputs
nheads: Number of attention heads
num_encoder_layers: Number of encoder layers
num_decoder_layers: Number of decoder layers

import torch.nn as nn


model = nn.Transformer(

    d_model=512,

    nhead=8,

    num_encoder_layers=6,

    num_decoder_layers=6

)


print(model)

Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  ...

  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
)

Let's practice!

Transformer Models with PyTorch