Preparation for modeling

Machine Learning for Marketing in Python

Karolis Urbonas

Head of Analytics & Science, Amazon

Data sample

telco_raw.head()

Telelcom data header

Data types

telco_raw.dtypes

customerID           object
gender               object
SeniorCitizen        object
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object

OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object

Separate categorical and numerical columns

Separate the identifier and target variable names as lists

custid = ['customerID']
target = ['Churn']

Separate categorical and numeric column names as lists

categorical = telco_raw.nunique()[telcom.nunique()<10].keys().tolist()

categorical.remove(target[0])

numerical = [col for col in telco_raw.columns 
                     if col not in custid+target+categorical]

One-hot encoding

This is a typical categorical data type column

Color
Red
White
Blue
Red

One-hot encoding result

And this is how it looks when we transform it with one-hot encoding.

Color		Red	White	Blue
Red	---------->	1	0	0
White	---------->	0	1	0
Blue	---------->	0	0	1
Red	---------->	1	0	0

One-hot encoding categorical variables

telco_raw = pd.get_dummies(data=telco_raw, columns=categorical, drop_first=True)

Scaling numerical features

# Import StandardScaler library
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler instance
scaler = StandardScaler()

# Fit the scaler to numerical columns
scaled_numerical = scaler.fit_transform(telco_raw[numerical])

# Build a DataFrame
scaled_numerical = pd.DataFrame(scaled_numerical, columns=numerical)

Bringing it all together

# Drop non-scaled numerical columns 
telco_raw = telco_raw.drop(columns=numerical, axis=1)

# Merge the non-numerical with the scaled numerical data
telco = telco_raw.merge(right=scaled_numerical,
                        how='left',
                        left_index=True, 
                        right_index=True
                        )

Let's practice pre-processing data!

Machine Learning for Marketing in Python