Stationarity and stability

Machine Learning for Time Series Data in Python

Chris Holdgraf

Fellow, Berkeley Institute for Data Science

Stationarity

  • Stationary time series do not change their statistical properties over time
  • E.g., mean, standard deviation, trends
  • Most time series are non-stationary to some extent
Machine Learning for Time Series Data in Python

Machine Learning for Time Series Data in Python

Model stability

  • Non-stationary data results in variability in our model
  • The statistical properties the model finds may change with the data
  • In addition, we will be less certain about the correct values of model parameters
  • How can we quantify this?
Machine Learning for Time Series Data in Python

Cross validation to quantify parameter stability

  • One approach: use cross-validation
  • Calculate model parameters on each iteration
  • Assess parameter stability across all CV splits
Machine Learning for Time Series Data in Python

Bootstrapping the mean

  • Bootstrapping is a common way to assess variability
  • The bootstrap:
    1. Take a random sample of data with replacement
    2. Calculate the mean of the sample
    3. Repeat this process many times (1000s)
    4. Calculate the percentiles of the result (usually 2.5, 97.5)

The result is a 95% confidence interval of the mean of each coefficient.

Machine Learning for Time Series Data in Python

Bootstrapping the mean

from sklearn.utils import resample

# cv_coefficients has shape (n_cv_folds, n_coefficients)
n_boots = 100
bootstrap_means = np.zeros(n_boots, n_coefficients)
for ii in range(n_boots):
    # Generate random indices for our data with replacement, 
    # then take the sample mean
    random_sample = resample(cv_coefficients)
    bootstrap_means[ii] = random_sample.mean(axis=0)

# Compute the percentiles of choice for the bootstrapped means
percentiles = np.percentile(bootstrap_means, (2.5, 97.5), axis=0)
Machine Learning for Time Series Data in Python

Plotting the bootstrapped coefficients

fig, ax = plt.subplots()
ax.scatter(many_shifts.columns, percentiles[0], marker='_', s=200)
ax.scatter(many_shifts.columns, percentiles[1], marker='_', s=200)

Machine Learning for Time Series Data in Python

Assessing model performance stability

  • If using the TimeSeriesSplit, can plot the model's score over time.
  • This is useful in finding certain regions of time that hurt the score
  • Also useful to find non-stationary signals
Machine Learning for Time Series Data in Python

Model performance over time

def my_corrcoef(est, X, y):
    """Return the correlation coefficient 
    between model predictions and a validation set."""
    return np.corrcoef(y, est.predict(X))[1, 0]

# Grab the date of the first index of each validation set
first_indices = [data.index[tt[0]] for tr, tt in cv.split(X, y)]

# Calculate the CV scores and convert to a Pandas Series
cv_scores = cross_val_score(model, X, y, cv=cv, scoring=my_corrcoef)
cv_scores = pd.Series(cv_scores, index=first_indices)
Machine Learning for Time Series Data in Python

Visualizing model scores as a timeseries

fig, axs = plt.subplots(2, 1, figsize=(10, 5), sharex=True)

# Calculate a rolling mean of scores over time
cv_scores_mean = cv_scores.rolling(10, min_periods=1).mean()
cv_scores.plot(ax=axs[0])
axs[0].set(title='Validation scores (correlation)', ylim=[0, 1])

# Plot the raw data
data.plot(ax=axs[1])
axs[1].set(title='Validation data')
Machine Learning for Time Series Data in Python

Visualizing model scores

Machine Learning for Time Series Data in Python

Fixed windows with time series cross-validation

# Only keep the last 100 datapoints in the training data
window = 100

# Initialize the CV with this window size
cv = TimeSeriesSplit(n_splits=10, max_train_size=window)
Machine Learning for Time Series Data in Python

Non-stationary signals

Machine Learning for Time Series Data in Python

Let's practice!

Machine Learning for Time Series Data in Python

Preparing Video For Download...