Working with Missing Data

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

How does data go missing in the digital age?

Data Collection

Broken Sensors

Data Storage Rules

2017-01-01 vs January 1st, 2017

Joining Disparate Data

Monthly to Weekly

Intentionally Missing

Privacy Concerns

Missing

Types of Missing

Missing completely at random

Missing Data is just a completely random subset

Missing at random

Missing conditionally at random based on another observation

Missing not at random

Data is missing because of how it is collected

Assessing Missing Values

When to drop rows with missing data?

Missing values are rare
Missing Completely at Random

isNull()

True if the current expression is null.

df.where(df['ROOF'].isNull()).count()

Plotting Missing Values

# Import library
import seaborn as sns

# subset the dataframe
sub_df = df.select(['ROOMAREA1'])

# sample the dataframe
sample_df = sub_df.sample(False, .5, 4)

# Convert to Pandas DataFrame
pandas_df = sample_df.toPandas()

# Plot it
sns.heatmap(data=pandas_df.isnull())

Missing Values Heatmap

Missing data heatmap

Imputation of Missing Values

Process of replacing missing values

Rule Based

Value based on business logic

Statistics Based

Using mean, median, etc

Model Based

Use model to predict value

Imputation of Missing Values

** fillna(value, subset=None)

value the value to replace missings with
subset the list of column names to replace missings

# Replacing missing values with zero
df.fillna(0, subset=['DAYSONMARKET'])

# Replacing with the mean value for that column
col_mean = df.agg({'DAYSONMARKET': 'mean'}).collect()[0][0]
df.fillna(col_mean, subset=['DAYSONMARKET'])

Let's practice!

Feature Engineering with PySpark