Learning rate and momentum

Introduction to Deep Learning with PyTorch

Jasmin Ludolf

Senior Data Science Content Developer, DataCamp

Updating weights with SGD

  • Training a neural network = solving an optimization problem.

Stochastic Gradient Descent (SGD) optimizer

sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)
  • Two arguments:
    • learning rate: controls the step size
    • momentum: adds inertia to avoid getting stuck
Introduction to Deep Learning with PyTorch

Impact of the learning rate: optimal learning rate

an example of optimal learning rate

  • Step size decreases near zero as the gradient gets smaller
Introduction to Deep Learning with PyTorch

Impact of the learning rate: small learning rate

an example of small learning rate

Introduction to Deep Learning with PyTorch

Impact of the learning rate: high learning rate

a really value of the learning rate

Introduction to Deep Learning with PyTorch

Convex and non-convex functions

This is a convex function.

an example of convex function

This is a non-convex function.

an example of non-convex function

  • Loss functions are non-convex
Introduction to Deep Learning with PyTorch

Without momentum

  • lr = 0.01 momentum = 0, after 100 steps minimum found for x = -1.23 and y = -0.14

an example of the optimizer being stuck in a local minimum

Introduction to Deep Learning with PyTorch

With momentum

  • lr = 0.01 momentum = 0.9, after 100 steps minimum found for x = 0.92 and y = -2.04

an example of optimization with momentum

Introduction to Deep Learning with PyTorch

Summary

$$

Learning Rate Momentum
Controls the step size Controls the inertia
Too high → poor performance Helps escape local minimum
Too low → slow training Too small → optimizer gets stuck
Typical range: 0.01 ($10^{-2}$) and 0.0001 ($10^{-4}$) Typical range: 0.85 to 0.99
Introduction to Deep Learning with PyTorch

Let's practice!

Introduction to Deep Learning with PyTorch

Preparing Video For Download...