Introduction to Deep Learning with PyTorch
Jasmin Ludolf
Senior Data Science Content Developer, DataCamp
Stochastic Gradient Descent (SGD) optimizer
sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)



This is a convex function.

This is a non-convex function.

lr = 0.01 momentum = 0, after 100 steps minimum found for x = -1.23 and y = -0.14
lr = 0.01 momentum = 0.9, after 100 steps minimum found for x = 0.92 and y = -2.04
$$
| Learning Rate | Momentum |
|---|---|
| Controls the step size | Controls the inertia |
| Too high → poor performance | Helps escape local minimum |
| Too low → slow training | Too small → optimizer gets stuck |
| Typical range: 0.01 ($10^{-2}$) and 0.0001 ($10^{-4}$) | Typical range: 0.85 to 0.99 |
Introduction to Deep Learning with PyTorch