What is exploratory data analysis?

Exploratory Data Analysis in Power BI

Jacob H. Marquez

Data Scientist at Microsoft

What is exploratory data analysis?

"An approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods."

1 https://en.wikipedia.org/wiki/Exploratory_data_analysis
Exploratory Data Analysis in Power BI

Six steps to EDA

  1. Understanding the data structure

  2. Identifying missing data

  3. Describing the data with descriptive statistics & distributions

  4. Identifying outliers

  5. Examining and quantifying relationships between variables

  6. Forming hypothesis

Exploratory Data Analysis in Power BI

Six steps to EDA

  1. Understanding the data structure

  2. Identifying missing data

  3. Describing the data with descriptive statistics & distributions

  4. Identifying outliers

  5. Examining and quantifying relationships between variables

  6. Forming hypothesis

Exploratory Data Analysis in Power BI

1. Understanding the data structure

Continuous

Numerical variables often able to take an infinite set of values

  • Number of stars in space
  • Click-through rates
  • Distance between two cities

Categorical

Non-numerical variables, usually text, with two or more groups

  • House types
  • Country
  • Company
Exploratory Data Analysis in Power BI

2. Identifying missing data

 

Missing at random

A nine-by-four matrix with a sets of three rows grouped together for three cities: Seattle, New York City, and Paris. There values representing inches of rainfall in 30 cells and 6 are blank at random across the three city groupings.

 

Missing not at random

A nine-by-four matrix with a sets of three rows grouped together for three cities: Seattle, New York City, and Paris. There values representing inches of rainfall in 30 cells and 4 are blank randomly across only Seattle.

Exploratory Data Analysis in Power BI

2. Addressing missing data

 

A nine-by-four matrix with a sets of three rows grouped together for three cities: Seattle, New York City, and Paris. There values representing inches of rainfall in 30 cells and 4 are blank randomly across only Seattle.

The same nine-by-four matrix except with the top row removed, to represent removing the blank cells from the matrix.

The same nine-by-four matrix except with the top row now has values, to represent imputing the blank cells with the median value.

Exploratory Data Analysis in Power BI

3. Describing the data

  • Minimum
  • Maximum
  • Mean: sum of all values divided by the number of observations
  • Median: the value in the center of a range of values
  • Standard Deviation: average amount of difference from the mean of a variable observed across all data points
Exploratory Data Analysis in Power BI

3. Describe the data with distributions.

A histogram of heights of people with the values of heights on the x-axis and the number of observations with those heights on the y-axis.

  • Median and the mean are the same value
  • A symmetrical curve
Exploratory Data Analysis in Power BI

3. Describing the data with distributions

A histogram of household income with the values of income on the x-axis and the number of observations with those income on the y-axis. The histogram is wide on the left side and becomes narrower as it moves to the right.

  • Median < Mean
  • "Right-skewed": the tail is to the right

A histogram of time spent online with the values of amount of time on the x-axis and the number of observations with those values on the y-axis. The histogram is narrow on the left side and becomes wide as it moves to the right.

  • Median > Mean
  • "Left-skewed": the tail is to the left
Exploratory Data Analysis in Power BI

The dataset: AirBnB listings

A picture of the airbnb dataset with five columns - listing_id, host_id, host_since (a date column), city, and price.

Exploratory Data Analysis in Power BI

Let's practice!

Exploratory Data Analysis in Power BI

Preparing Video For Download...