Dataset shift

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

What is dataset shift?

elec dataset:

  • 2 years worth of data.
  • class=1 represents price went up relative to last 24 hours, and 0 means down.
   day    period  nswprice  ...    vicdemand  transfer  class
0    2  0.000000  0.056443  ...     0.422915  0.414912      1
1    2  0.553191  0.042482  ...     0.422915  0.414912      0
2    2  0.574468  0.044374  ...     0.422915  0.414912      1

[3 rows x 8 columns]
Designing Machine Learning Workflows in Python

What is shifting exactly?

Two classes in two dimensions separated by a curved decision boundary.

Designing Machine Learning Workflows in Python

What is shifting exactly?

Two classes in two dimensions separated by a curved decision boundary, and alongside that another image with similarly positioned classes, but a differently shaped decision boundary.

Designing Machine Learning Workflows in Python

Windows

Sliding window

window = (t_now-window_size+1):t_now
sliding_window = elec.loc[window]

A dataset where only recent datapoints are highlighted.

Expanding window

window = 0:t_now
expanding_window = elec.loc[window]

A dataset shown in a terminal.

Designing Machine Learning Workflows in Python

Dataset shift detection

# t_now = 40000, window_size = 20000
clf_full = RandomForestClassifier().fit(X, y)
clf_sliding = RandomForestClassifier().fit(sliding_X, sliding_y)
# Use future data as test
test = elec.loc[t_now:elec.shape[0]]
test_X = test.drop('class', 1); test_y = test['class']
roc_auc_score(test_y, clf_full.predict(test_X))
roc_auc_score(test_y, clf_sliding.predict(test_X))
0.775
0.780
Designing Machine Learning Workflows in Python

Window size

for w_size in range(10, 100, 10):
    sliding = arrh.loc[
      (t_now - w_size + 1):t_now
    ]
    X = sliding.drop('class', 1)
    y = sliding['class']
    clf = GaussianNB()
    clf.fit(X, y)
    preds = clf.predict(test_X)
    roc_auc_score(test_y, preds)

AUC plotted against the window size shows to increase to a maximum of 0.83 for a window of 50 and then decrease again.

Designing Machine Learning Workflows in Python

Domain shift

arrhythmia dataset:

   age  sex  height  ...    chV6_TwaveAmp  chV6_QRSA  chV6_QRSTA  class
0   75    0     190  ...              2.9       23.3        49.4      0
1   56    1     165  ...              2.1       20.4        38.8      0
2   54    0     172  ...              3.4       12.3        49.0      0
3   55    0     175  ...              2.6       34.6        61.6      1
4   75    0     190  ...              3.9       25.4        62.8      0

[5 rows x 280 columns]
Designing Machine Learning Workflows in Python

More data is not always better!

Designing Machine Learning Workflows in Python

Preparing Video For Download...