Review of pipelines using sklearn

Extreme Gradient Boosting with XGBoost

Sergey Fogelson

Head of Data Science, TelevisaUnivision

Pipeline review

Takes a list of named 2-tuples (name, pipeline_step) as input
Tuples can contain any arbitrary scikit-learn compatible estimator or transformer object
Pipeline implements fit/predict methods
Can be used as input estimator into grid/randomized search and cross_val_score methods

Scikit-learn pipeline example

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

names = ["crime","zone","industry","charles","no","rooms",
        "age", "distance","radial","tax","pupil","aam","lower","med_price"]

data = pd.read_csv("boston_housing.csv",names=names)

X, y = data.iloc[:,:-1], data.iloc[:,-1]

rf_pipeline = Pipeline[("st_scaler", 
                StandardScaler()),
                ("rf_model",RandomForestRegressor())]

scores = cross_val_score(rf_pipeline,X,y,    
scoring="neg_mean_squared_error",cv=10)

Scikit-learn pipeline example

final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))

print("Final RMSE:", final_avg_rmse)

Final RMSE: 4.54530686529

Preprocessing I: LabelEncoder and OneHotEncoder

LabelEncoder: Converts a categorical column of strings into integers
OneHotEncoder: Takes the column of integers and encodes them as dummy variables
Cannot be done within a pipeline

Preprocessing II: DictVectorizer

Traditionally used in text processing
Converts lists of feature mappings into vectors
Need to convert DataFrame into a list of dictionary entries
Explore the scikit-learn documentation

Let's build pipelines!

Extreme Gradient Boosting with XGBoost