Working with a forecast object

Designing Forecasting Pipelines for Production

Rami Krispin

Senior Manager, Data Science and Engineering

Work with a forecast object

  • Data preparation
  • Train and test multiple forecasting models
  • Evaluate the models' performance
  • The statsforecast package
    • mlforecast in the exercises
Designing Forecasting Pipelines for Production

Training Partition

Time Series

Testing Partition

Testing Partition

Designing Forecasting Pipelines for Production

Required libraries

import pandas as pd 
import datetime 
Designing Forecasting Pipelines for Production

Required libraries

from statsforecast import StatsForecast

from statsforecast.models import (
    DynamicOptimizedTheta,
    SeasonalNaive,
    AutoARIMA,
    HoltWinters,
    MSTL
)

from utilsforecast.plotting import plot_series
Designing Forecasting Pipelines for Production

The statsforecast data format

Data input format:

  1. unique_id - the series ID
  2. ds - the series timestamp
  3. y - the series values
Designing Forecasting Pipelines for Production

Data preparation - data load

ts = pd.read_csv("data/data.csv")
ts["ds"] = pd.to_datetime(ts["ds"])
ts = ts.sort_values("ds")
ts = ts[["unique_id", "ds", "y"]]

os.environ['NIXTLA_ID_AS_COL'] = '1'
 unique_id        ds             y
0    1    2022-11-11 23:00:00    456403
1    1    2022-11-12 00:00:00    458842
2    1    2022-11-12 01:00:00    455111
3    1    2022-11-12 02:00:00    448035
4    1    2022-11-12 03:00:00    438165
Designing Forecasting Pipelines for Production

Data preparation - train/test split

  • Split the time series into training and testing partitions
    • using datetime.timedelta
    • leave last 72 hours as test data
test_length = 72

train_end = end  - datetime.timedelta(hours = test_length)

train = ts[ts["ds"] <= train_end]
test = ts[ts["ds"] > train_end]
Designing Forecasting Pipelines for Production

Data preparation

plot_series(train, engine = "plotly")

Time Series

plot_series(test, engine = "plotly")

Testing Partition

Designing Forecasting Pipelines for Production

Forecasting with StatsModels

# Instantiate models with hourly seasonality
auto_arima = AutoARIMA()
s_naive = SeasonalNaive(season_length=24)
theta =  DynamicOptimizedTheta(season_length=24)

# Instantiate models with hourly and weekly seasonality mstl1 = MSTL(season_length=[24, 24 * 7], trend_forecaster=AutoARIMA(), alias="MSTL_ARIMA_trend") mstl2 = MSTL(season_length=[24, 24 * 7], trend_forecaster=HoltWinters(), alias="MSTL_Holt_trend")
Designing Forecasting Pipelines for Production

Forecasting with StatsModels

# Combine models into a list
stats_models = [auto_arima, s_naive, theta, mstl1, mstl2]

# Instantiate the model object sf = StatsForecast( models=stats_models, freq="h", fallback_model = AutoARIMA(), n_jobs= -1)
Designing Forecasting Pipelines for Production

Forecasting with StatsModels

# Combine models into a list
stats_models = [auto_arima, s_naive, theta, mstl1, mstl2]

# Instantiate the model object sf = StatsForecast( models=stats_models, freq="h", fallback_model = AutoARIMA(), n_jobs= -1)
# Create the forecast forecast_stats = sf.forecast(df=train, h=72, level=[95])
Designing Forecasting Pipelines for Production

Forecast Forecasting with StatsModels StatsModels

p = sf.plot(test, forecast_stats, engine = "plotly", level=[95])
p.update_layout(height=400)

Forecast Plot

Designing Forecasting Pipelines for Production

Model evaluation

def mape(y, yhat):
    mape = mean(abs(y - yhat)/ y) 
    return mape

def rmse(y, yhat):
    rmse = (mean((y - yhat) ** 2 )) ** 0.5
    return rmse

def coverage(y, lower, upper):
    coverage = sum((y <= upper) & (y >= lower)) / len(y)
    return coverage
Designing Forecasting Pipelines for Production

Model evaluation

fc = forecast_stats.merge(test, how="left", on="ds")
fc_performance = None

for i in [str(m) for m in stats_models]:
    m = mape(y = fc["y"], yhat = fc[i])
    r = rmse(y = fc["y"], yhat = fc[i])
    c = coverage(y = fc["y"], lower = fc[i + "-lo-95"], upper = fc[i + "-hi-95"])

perf = {"model": i, "mape": m, "rmse": r, "coverage": c} if fc_performance is None: fc_performance = pd.DataFrame([perf]) else: fc_performance = pd.concat([fc_performance, pd.DataFrame([perf])]) fc_performance.reset_index(drop=True, inplace=True)
Designing Forecasting Pipelines for Production

Model evaluation

print(fc_performance.sort_values("rmse"))
     model                    mape        rmse            coverage
1    SeasonalNaive            0.041340    22410.483959    1.000000
4    MSTL_Holt_trend          0.053939    30201.299395    0.513889
3    MSTL_ARIMA_trend         0.053771    30428.240235    0.666667
2    DynamicOptimizedTheta    0.090566    45189.869437    0.916667
0    AutoARIMA                0.149790    70483.617198    1.000000
Designing Forecasting Pipelines for Production

Model evaluation

print(fc_performance.sort_values("rmse"))
     model                    mape        rmse            coverage
1    SeasonalNaive            0.041340    22410.483959    1.000000
4    MSTL_Holt_trend          0.053939    30201.299395    0.513889
3    MSTL_ARIMA_trend         0.053771    30428.240235    0.666667
2    DynamicOptimizedTheta    0.090566    45189.869437    0.916667
0    AutoARIMA                0.149790    70483.617198    1.000000

Can we improve the results?

  • Different settings for better accuracy
  • Experiments come in handy
Designing Forecasting Pipelines for Production

Let's practice!

Designing Forecasting Pipelines for Production

Preparing Video For Download...