Balanced training with AdamW

Efficient AI Model Training with PyTorch

Dennis Lee

Data Engineer

Efficient training

 

 

Diagram showing chapter topics in the course with a highlight on optimizers.

Efficient AI Model Training with PyTorch

Optimizers for training efficiency

 

 

Diagram depicting three optimizers: AdamW, Adafactor, and 8-bit Adam.

Efficient AI Model Training with PyTorch

Optimizers for training efficiency

 

 

Diagram depicting three optimizers: AdamW, Adafactor, and 8-bit Adam.

Efficient AI Model Training with PyTorch

Optimizers for training efficiency

 

 

Diagram depicting three optimizers: AdamW, Adafactor, and 8-bit Adam.

Efficient AI Model Training with PyTorch

Optimizer tradeoffs

Diagram showing the tradeoffs between number of parameters and precision for AdamW, Adafactor, and 8-bit Adam.

Efficient AI Model Training with PyTorch

Optimizer tradeoffs

Diagram showing the tradeoffs between number of parameters and precision for AdamW, Adafactor, and 8-bit Adam.

Efficient AI Model Training with PyTorch

Optimizer tradeoffs

Diagram showing the tradeoffs between number of parameters and precision for AdamW, Adafactor, and 8-bit Adam.

Efficient AI Model Training with PyTorch

Optimizer tradeoffs

Diagram showing the tradeoffs between number of parameters and precision for AdamW, Adafactor, and 8-bit Adam.

Efficient AI Model Training with PyTorch

How does AdamW work?

Diagram illustrating how AdamW works.

Efficient AI Model Training with PyTorch

How does AdamW work?

Diagram illustrating how AdamW works.

Efficient AI Model Training with PyTorch

How does AdamW work?

Diagram illustrating how AdamW works.

  • Compute the exponential moving average (EMA) of the gradients
Efficient AI Model Training with PyTorch

How does AdamW work?

Diagram illustrating how AdamW works.

  • Compute the exponential moving average (EMA) of the gradients
  • Compute EMA of squared gradients
Efficient AI Model Training with PyTorch

How does AdamW work?

Diagram illustrating how AdamW works.

  • Compute the exponential moving average (EMA) of the gradients
  • Compute EMA of squared gradients
Efficient AI Model Training with PyTorch

How does AdamW work?

Diagram illustrating how AdamW works.

  • Compute the exponential moving average (EMA) of the gradients
  • Compute EMA of squared gradients
Efficient AI Model Training with PyTorch

Memory usage of AdamW

Diagram showing the quantities involved in calculations for AdamW: parameter gradients, EMA of gradients, and EMA of squared gradients.

  • Each square is a parameter, and each color is a state
Efficient AI Model Training with PyTorch

Memory usage of AdamW

Diagram showing the quantities involved in calculations for AdamW: parameter gradients, EMA of gradients, and EMA of squared gradients.

  • Each square is a parameter, and each color is a state
Efficient AI Model Training with PyTorch

Memory usage of AdamW

Diagram showing the quantities involved in calculations for AdamW: parameter gradients, EMA of gradients, and EMA of squared gradients.

  • Each square is a parameter, and each color is a state
  • Memory per parameter = 8 bytes = 4 bytes per state * 2 states
  • Total memory = Memory per parameter (8 bytes) * Number of parameters
Efficient AI Model Training with PyTorch

Estimate memory usage of AdamW

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-cased", return_dict=True)

num_parameters = sum(p.numel() for p in model.parameters()) print(f"Number of model parameters: {num_parameters:,}")
Number of model parameters: 65,783,042
estimated_memory = num_parameters * 8 / (1024 ** 2)
print(f"Estimated memory usage of AdamW: {estimated_memory:.0f} MB")
Estimated memory usage of AdamW: 502 MB
Efficient AI Model Training with PyTorch

Trainer and Accelerator

Diagram showing the tradeoff between ability to customize and ease of use for Accelerator and Trainer.

Efficient AI Model Training with PyTorch

Implement AdamW with Trainer

from torch.optim import AdamW

optimizer = AdamW(params=model.parameters())


trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=validation_dataset, compute_metrics=compute_metrics, optimizers=(optimizer, lr_scheduler))
trainer.train()
{'epoch': 1.0, 'eval_accuracy': 0.7, 'eval_f1': 0.8}
Efficient AI Model Training with PyTorch

Implement AdamW with Accelerator

from torch.optim import AdamW

optimizer = AdamW(params=model.parameters())


for batch in train_dataloader: inputs, targets = batch["input_ids"], batch["labels"] outputs = model(inputs, labels=targets) loss = outputs.loss accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() print(f"Loss = {loss}")
Loss = 0.7
Efficient AI Model Training with PyTorch

Inspecting the optimizer state

optimizer_state = optimizer.state.values()
print(optimizer_state)
dict_values([{'step': tensor(3.),

'exp_avg': tensor([[0., 0., 0., ..., 0., 0., 0.], ...]),
'exp_avg_sq': tensor([[0., 0., 0., ..., 0., 0., 0.], ...])}, ...])
Efficient AI Model Training with PyTorch

Computing the optimizer size

def compute_optimizer_size(optimizer_state):
    total_size_megabytes, total_num_elements = 0, 0

for params in optimizer_state:
for name, tensor in params.items(): tensor = torch.tensor(tensor)
num_elements = tensor.numel()
element_size = tensor.element_size()
total_num_elements += num_elements
total_size_megabytes += num_elements * element_size / (1024 ** 2)
return total_size_megabytes, total_num_elements
Efficient AI Model Training with PyTorch

Computing the optimizer size

total_size_megabytes, total_num_elements = \
    compute_optimizer_size(trainer.optimizer.state.values())
print(f"Number of optimizer parameters: {total_num_elements:,}")
Number of optimizer parameters: 131,566,188
print(f"Optimizer size: {total_size_megabytes:.0f} MB")
Optimizer size: 502 MB
Efficient AI Model Training with PyTorch

Let's practice!

Efficient AI Model Training with PyTorch

Preparing Video For Download...