Standardizing features

Predicting CTR with Machine Learning in Python

Kevin Huo

Instructor

Why standardization is important

  • Standardization: ensuring your data fits assumptions that models have
  • Certain features may have too high variance, which might unfairly dominate models
  • Example: certain count have too large of a range of values due to one spam user
  • Does not apply to categorical variables such as site_id, app_id, device_id, etc.
Predicting CTR with Machine Learning in Python

Log normalization

df.var()
click                   1.294270e-01
hour                    1.123316e-01
df.var().median()
0.7108583771671939
print(df['click'].var())
df['device_id_count'] = df[
  'device_id_count'].apply(
  lambda x: np.log(x))
print(df['click'].var())
249362570.10134825
15.628476003312514
Predicting CTR with Machine Learning in Python

Scaling data

  • Standard scaling converts all features to have mean of 0 and standard deviation of 1

Example of standard scaling

  • Generally a good practice for machine learning models
Predicting CTR with Machine Learning in Python

How to standard scale data

  • Scaling can be done using StandardScaler() as follows:
scaler = StandardScaler()
X[numeric_cols] = scaler.fit_transform(X[numeric_cols])
dtype: float64
1    10.5 -> 0.85
2    32.3 -> 1.54
Predicting CTR with Machine Learning in Python

Let's practice!

Predicting CTR with Machine Learning in Python

Preparing Video For Download...