Scaling and transforming new data

Feature Engineering for Machine Learning in Python

Robet O'Callaghan

Director of Data Science, Ordergroove

Reuse training scalers

scaler = StandardScaler()

scaler.fit(train[['col']])

train['scaled_col'] = scaler.transform(train[['col']])

# FIT SOME MODEL
# ....

test = pd.read_csv('test_csv')

test['scaled_col'] = scaler.transform(test[['col']])

Feature Engineering for Machine Learning in Python

Training transformations for reuse

train_mean = train[['col']].mean()
train_std = train[['col']].std()

cut_off = train_std * 3
train_lower = train_mean - cut_off
train_upper = train_mean + cut_off

# Subset train data

test = pd.read_csv('test_csv')

# Subset test data
test = test[(test[['col']] < train_upper) & 
              (test[['col']] > train_lower)]

Feature Engineering for Machine Learning in Python

Why only use training data?

 

Data leakage: Using data that you won't have access to when assessing the performance of your model

Feature Engineering for Machine Learning in Python

Avoid data leakage!

Feature Engineering for Machine Learning in Python

Preparing Video For Download...