Preparazione al modeling

Machine Learning per il marketing con Python

Karolis Urbonas

Head of Analytics & Science, Amazon

Esempio di dati

telco_raw.head()

Intestazione dati telecom

Tipi di dato

telco_raw.dtypes

customerID           object
gender               object
SeniorCitizen        object
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object

OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object

Separare colonne categoriche e numeriche

Separa i nomi di identificatore e target in liste

custid = ['customerID']
target = ['Churn']

Separa i nomi delle colonne categoriche e numeriche in liste

categorical = telco_raw.nunique()[telcom.nunique()<10].keys().tolist()

categorical.remove(target[0])

numerical = [col for col in telco_raw.columns 
                     if col not in custid+target+categorical]

One-hot encoding

Questo è un tipico campo categorico

Colore
Rosso
Bianco
Blu
Rosso

Risultato del one-hot encoding

Ecco come appare dopo il one-hot encoding.

Colore		Rosso	Bianco	Blu
Rosso	---------->	1	0	0
Bianco	---------->	0	1	0
Blu	---------->	0	0	1
Rosso	---------->	1	0	0

One-hot encoding delle variabili categoriche

telco_raw = pd.get_dummies(data=telco_raw, columns=categorical, drop_first=True)

Scalare le variabili numeriche

# Import StandardScaler library
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler instance
scaler = StandardScaler()

# Fit the scaler to numerical columns
scaled_numerical = scaler.fit_transform(telco_raw[numerical])

# Build a DataFrame
scaled_numerical = pd.DataFrame(scaled_numerical, columns=numerical)

Mettere tutto insieme

# Drop non-scaled numerical columns 
telco_raw = telco_raw.drop(columns=numerical, axis=1)

# Merge the non-numerical with the scaled numerical data
telco = telco_raw.merge(right=scaled_numerical,
                        how='left',
                        left_index=True, 
                        right_index=True
                        )

Passiamo alla pratica con il pre-processing!

Machine Learning per il marketing con Python