Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
Data Collection
Data Storage Rules
Joining Disparate Data
Intentionally Missing
Missing completely at random
Missing at random
Missing not at random
When to drop rows with missing data?
isNull()
df.where(df['ROOF'].isNull()).count()
765
# Import library import seaborn as sns
# subset the dataframe sub_df = df.select(['ROOMAREA1'])
# sample the dataframe sample_df = sub_df.sample(False, .5, 4)
# Convert to Pandas DataFrame pandas_df = sample_df.toPandas()
# Plot it sns.heatmap(data=pandas_df.isnull())
Process of replacing missing values
Rule Based
Statistics Based
Model Based
** fillna(value, subset=None)
value
the value to replace missings withsubset
the list of column names to replace missings# Replacing missing values with zero
df.fillna(0, subset=['DAYSONMARKET'])
# Replacing with the mean value for that column
col_mean = df.agg({'DAYSONMARKET': 'mean'}).collect()[0][0]
df.fillna(col_mean, subset=['DAYSONMARKET'])
Feature Engineering with PySpark