Working with Missing Data

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

How does data go missing in the digital age?

Data Collection

  • Broken Sensors

Data Storage Rules

  • 2017-01-01 vs January 1st, 2017

Joining Disparate Data

  • Monthly to Weekly

Intentionally Missing

  • Privacy Concerns

Missing

Feature Engineering with PySpark

Types of Missing

Missing completely at random

  • Missing Data is just a completely random subset

Missing at random

  • Missing conditionally at random based on another observation

Missing not at random

  • Data is missing because of how it is collected
Feature Engineering with PySpark

Assessing Missing Values

When to drop rows with missing data?

  • Missing values are rare
  • Missing Completely at Random

isNull()

  • True if the current expression is null.
df.where(df['ROOF'].isNull()).count()
765
Feature Engineering with PySpark

Plotting Missing Values

# Import library
import seaborn as sns

# subset the dataframe sub_df = df.select(['ROOMAREA1'])
# sample the dataframe sample_df = sub_df.sample(False, .5, 4)
# Convert to Pandas DataFrame pandas_df = sample_df.toPandas()
# Plot it sns.heatmap(data=pandas_df.isnull())
Feature Engineering with PySpark

Missing Values Heatmap

Missing data heatmap

Feature Engineering with PySpark

Imputation of Missing Values

Process of replacing missing values

Rule Based

  • Value based on business logic

Statistics Based

  • Using mean, median, etc

Model Based

  • Use model to predict value
Feature Engineering with PySpark

Imputation of Missing Values

** fillna(value, subset=None)

  • value the value to replace missings with
  • subset the list of column names to replace missings
# Replacing missing values with zero
df.fillna(0, subset=['DAYSONMARKET'])
# Replacing with the mean value for that column
col_mean = df.agg({'DAYSONMARKET': 'mean'}).collect()[0][0]
df.fillna(col_mean, subset=['DAYSONMARKET'])
Feature Engineering with PySpark

Let's practice!

Feature Engineering with PySpark

Preparing Video For Download...