Incorporating xgboost into pipelines

Extreme Gradient Boosting with XGBoost

Sergey Fogelson

Head of Data Science, TelevisaUnivision

Scikit-learn pipeline example with XGBoost

import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

names = ["crime","zone","industry","charles","no","rooms","age", "distance","radial","tax","pupil","aam","lower","med_price"] data = pd.read_csv("boston_housing.csv",names=names) X, y = data.iloc[:,:-1], data.iloc[:,-1]
xgb_pipeline = Pipeline[("st_scaler", StandardScaler()), ("xgb_model",xgb.XGBRegressor())] scores = cross_val_score(xgb_pipeline, X, y, scoring="neg_mean_squared_error",cv=10)
final_avg_rmse = np.mean(np.sqrt(np.abs(scores))) print("Final XGB RMSE:", final_avg_rmse)
Final RMSE: 4.02719593323
Extreme Gradient Boosting with XGBoost

Additional components introduced for pipelines

  • sklearn_pandas:
    • DataFrameMapper - Interoperability between pandas and scikit-learn
  • sklearn.impute:
    • SimpleImputer - Native imputation of numerical and categorical columns in scikit-learn
  • sklearn.pipeline:
    • FeatureUnion - combine multiple pipelines of features into a single pipeline of features
Extreme Gradient Boosting with XGBoost

Let's practice!

Extreme Gradient Boosting with XGBoost

Preparing Video For Download...