Review of pipelines using sklearn

Extreme Gradient Boosting with XGBoost

Sergey Fogelson

Head of Data Science, TelevisaUnivision

Pipeline review

  • Takes a list of named 2-tuples (name, pipeline_step) as input
  • Tuples can contain any arbitrary scikit-learn compatible estimator or transformer object
  • Pipeline implements fit/predict methods
  • Can be used as input estimator into grid/randomized search and cross_val_score methods
Extreme Gradient Boosting with XGBoost

Scikit-learn pipeline example

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

names = ["crime","zone","industry","charles","no","rooms", "age", "distance","radial","tax","pupil","aam","lower","med_price"] data = pd.read_csv("boston_housing.csv",names=names) X, y = data.iloc[:,:-1], data.iloc[:,-1]
rf_pipeline = Pipeline[("st_scaler", StandardScaler()), ("rf_model",RandomForestRegressor())] scores = cross_val_score(rf_pipeline,X,y, scoring="neg_mean_squared_error",cv=10)
Extreme Gradient Boosting with XGBoost

Scikit-learn pipeline example

final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))

print("Final RMSE:", final_avg_rmse)
Final RMSE: 4.54530686529
Extreme Gradient Boosting with XGBoost

Preprocessing I: LabelEncoder and OneHotEncoder

  • LabelEncoder: Converts a categorical column of strings into integers
  • OneHotEncoder: Takes the column of integers and encodes them as dummy variables
  • Cannot be done within a pipeline
Extreme Gradient Boosting with XGBoost

Preprocessing II: DictVectorizer

  • Traditionally used in text processing
  • Converts lists of feature mappings into vectors
  • Need to convert DataFrame into a list of dictionary entries
  • Explore the scikit-learn documentation
Extreme Gradient Boosting with XGBoost

Let's build pipelines!

Extreme Gradient Boosting with XGBoost

Preparing Video For Download...