Missing data and imputation

Introduction to Python in Power BI

Jacob H. Marquez

Data Scientist

What is missing data?

Common values for "missing":

  • null
  • NA
  • 99
  • ""
Introduction to Python in Power BI

What is missing data?

Common values for "missing":

  • null
  • NA
  • 99
  • ""
entity year fished
Australia 1988 153148
Australia 1989 null
Australia 1990 567895
Australia 1991 632987
Australia 1992 643578
Australia 1993 null
Introduction to Python in Power BI

Why is data missing?

  • A participant forgot or refused to answer a question in a survey
  • A participant dropped out of the second part of a study
  • There was a glitch in the instrument used to obtain measurements
  • Privacy laws restrict the use of data
Introduction to Python in Power BI

Is it missing at random?

Missing at random

Table of rainfall, in inches, across three cities - Seattle, New York City, and Paris.

Introduction to Python in Power BI

Is it missing at random?

Missing not at random

Table of rainfall, in inches, across three cities - Seattle, New York City, and Paris. One row from Seattle is missing.

Introduction to Python in Power BI

Is it missing at random?

Missing not at random

Table of rainfall, in inches, across three cities - Seattle, New York City, and Paris. One row from Seattle is missing.

  • Instrument can't detect low readings
  • Certain groups of individuals are unlikely to disclose information
Introduction to Python in Power BI

How to address missing data?

Missing not at random

  • Pause analysis
  • Understand reasons for missing data
  • Gather more data
  • Clearly document limitations and assumptions made

Missing at random

  • Delete the observations
  • Add an indicator variable for missing, 1, or not, 0
  • Imputation
Introduction to Python in Power BI

Imputation

Definition: replacing a missing value with another.

Types of Imputation:

  • Mean
  • Median
  • Mode
  • Previous or Next values

Best when 5% of less of the column data is missing.

Remember to sort the values!

Introduction to Python in Power BI

Imputation - Example

Missing at random

Table of rainfall, in inches, across three cities - Seattle, New York City, and Paris - with observations missing.

Median imputation

Table of rainfall, in inches, across three cities - Seattle, New York City, and Paris. Missing observations filled in with median in city.

Introduction to Python in Power BI

Dataset

Invoice StockCode Description Quantity InvoiceDate Price Customer ID
506303 PADS PADS TO MATCH ALL CUSHIONS 1 4/29/2010 10:43:00 AM 0.001 14249
496725 M Manual 1 2/3/2010 2:16:00 PM 1.5 13619
502660 M Manual 6 3/25/2010 5:18:00 PM 1.5 13187
509669 90214S LETTER "S" BLING KEY RING 10 12/13/2009 3:54:00 PM 1.25 16725
Introduction to Python in Power BI

Let's practice!

Introduction to Python in Power BI

Preparing Video For Download...