Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
Methods in ml.regression
:
GeneralizedLinearRegression
IsotonicRegression
LinearRegression
DecisionTreeRegression
GBTRegression
RandomForestRegression
Methods in ml.regression
:
GeneralizedLinearRegression
IsotonicRegression
LinearRegression
DecisionTreeRegression
GBTRegression
RandomForestRegression
# Create variables for max and min dates in our dataset
max_date = df.agg({'OFFMKTDATE': 'max'}).collect()[0][0]
min_date = df.agg({'OFFMKTDATE': 'min'}).collect()[0][0]
# Find how many days our data spans
from pyspark.sql.functions import datediff
range_in_days = datediff(max_date, min_date)
# Find the date to split the dataset on
from pyspark.sql.functions import date_add
split_in_days = round(range_in_days * 0.8)
split_date = date_add(min_date, split_in_days)
# Split the data into 80% train, 20% test
train_df = df.where(df['OFFMKTDATE'] < split_date)
test_df = df.where(df['OFFMKTDATE'] >= split_date)\
.where(df['LISTDATE'] >= split_date)
Feature Engineering with PySpark