Learning rate and momentum

Introduction to Deep Learning with PyTorch

Jasmin Ludolf

Senior Data Science Content Developer, DataCamp

Updating weights with SGD

Stochastic Gradient Descent (SGD) optimizer

sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)

Two arguments:
- learning rate: controls the step size
- momentum: adds inertia to avoid getting stuck

an example of optimal learning rate

an example of small learning rate

a really value of the learning rate

This is a convex function.

an example of convex function

This is a non-convex function.

an example of non-convex function

lr = 0.01 momentum = 0, after 100 steps minimum found for x = -1.23 and y = -0.14

an example of the optimizer being stuck in a local minimum

lr = 0.01 momentum = 0.9, after 100 steps minimum found for x = 0.92 and y = -2.04

an example of optimization with momentum

Learning Rate	Momentum
Controls the step size	Controls the inertia
Too high → poor performance	Helps escape local minimum
Too low → slow training	Too small → optimizer gets stuck
Typical range: 0.01 ($10^{-2}$) and 0.0001 ($10^{-4}$)	Typical range: 0.85 to 0.99

Introduction to Deep Learning with PyTorch