Feature Engineering for Machine Learning in Python
Robet O'Callaghan
Director of Data Science, Ordergroove
scaler = StandardScaler()
scaler.fit(train[['col']])
train['scaled_col'] = scaler.transform(train[['col']])
# FIT SOME MODEL
# ....
test = pd.read_csv('test_csv')
test['scaled_col'] = scaler.transform(test[['col']])
train_mean = train[['col']].mean()
train_std = train[['col']].std()
cut_off = train_std * 3
train_lower = train_mean - cut_off
train_upper = train_mean + cut_off
# Subset train data
test = pd.read_csv('test_csv')
# Subset test data
test = test[(test[['col']] < train_upper) &
(test[['col']] > train_lower)]
Data leakage: Using data that you won't have access to when assessing the performance of your model
Feature Engineering for Machine Learning in Python