Choosing the Algorithm

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

Spark ML Landscape

ML Flowchart

Feature Engineering with PySpark

Spark ML Landscape

ML Flowchart

Feature Engineering with PySpark

Spark ML Landscape

ML Flowchart

Feature Engineering with PySpark

Spark ML Landscape

ML Flowchart

Feature Engineering with PySpark

PySpark Regression Methods

Methods in ml.regression:

  • GeneralizedLinearRegression
  • IsotonicRegression
  • LinearRegression

 

  • DecisionTreeRegression
  • GBTRegression
  • RandomForestRegression
Feature Engineering with PySpark

PySpark Regression Methods

Methods in ml.regression:

  • GeneralizedLinearRegression
  • IsotonicRegression
  • LinearRegression

 

  • DecisionTreeRegression
  • GBTRegression
  • RandomForestRegression
Feature Engineering with PySpark

RFR Diagram

Feature Engineering with PySpark

Test and Train Splits for Time Series

Feature Engineering with PySpark

Test and Train Splits for Time Series

# Create variables for max and min dates in our dataset
max_date = df.agg({'OFFMKTDATE': 'max'}).collect()[0][0]
min_date = df.agg({'OFFMKTDATE': 'min'}).collect()[0][0]
# Find how many days our data spans
from pyspark.sql.functions import datediff
range_in_days = datediff(max_date, min_date)
# Find the date to split the dataset on
from pyspark.sql.functions import date_add
split_in_days = round(range_in_days * 0.8)
split_date = date_add(min_date, split_in_days)
# Split the data into 80% train, 20% test
train_df = df.where(df['OFFMKTDATE'] < split_date) 
test_df = df.where(df['OFFMKTDATE'] >= split_date)\
  .where(df['LISTDATE'] >= split_date)
Feature Engineering with PySpark

Time to practice!

Feature Engineering with PySpark

Preparing Video For Download...