Learning rate and momentum

Introduzione al Deep Learning con PyTorch

Jasmin Ludolf

Senior Data Science Content Developer, DataCamp

Updating weights with SGD

  • Training a neural network = solving an optimization problem.

Stochastic Gradient Descent (SGD) optimizer

sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)
  • Two arguments:
    • learning rate: controls the step size
    • momentum: adds inertia to avoid getting stuck
Introduzione al Deep Learning con PyTorch

Impact of the learning rate: optimal learning rate

an example of optimal learning rate

  • Step size decreases near zero as the gradient gets smaller
Introduzione al Deep Learning con PyTorch

Impact of the learning rate: small learning rate

an example of small learning rate

Introduzione al Deep Learning con PyTorch

Impact of the learning rate: high learning rate

a really value of the learning rate

Introduzione al Deep Learning con PyTorch

Convex and non-convex functions

This is a convex function.

an example of convex function

This is a non-convex function.

an example of non-convex function

  • Loss functions are non-convex
Introduzione al Deep Learning con PyTorch

Without momentum

  • lr = 0.01 momentum = 0, after 100 steps minimum found for x = -1.23 and y = -0.14

an example of the optimizer being stuck in a local minimum

Introduzione al Deep Learning con PyTorch

With momentum

  • lr = 0.01 momentum = 0.9, after 100 steps minimum found for x = 0.92 and y = -2.04

an example of optimization with momentum

Introduzione al Deep Learning con PyTorch

Summary

$$

Learning Rate Momentum
Controls the step size Controls the inertia
Too high → poor performance Helps escape local minimum
Too low → slow training Too small → optimizer gets stuck
Typical range: 0.01 ($10^{-2}$) and 0.0001 ($10^{-4}$) Typical range: 0.85 to 0.99
Introduzione al Deep Learning con PyTorch

Let's practice!

Introduzione al Deep Learning con PyTorch

Preparing Video For Download...