Distributions and outliers

Exploratory Data Analysis in Power BI

Jacob H. Marquez

Data Scientist at Microsoft

What are distributions?

Definition: set of all possible values of the variable and the associated frequencies.

Exploratory Data Analysis in Power BI

What are distributions?

Continuous

Age Frequency
18 7
19 11
20 13
21 19
22 12
Exploratory Data Analysis in Power BI

What are distributions?

Continuous

Age Frequency
18 7
19 11
20 13
21 19
22 12

Categorical

Hair Color Frequency
Blonde 30
Brown 50
Black 40
Red 20
Grey 20
Exploratory Data Analysis in Power BI

What are histograms?

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. The histogram has a narrow tail on either side of a large mass in the middle.

Exploratory Data Analysis in Power BI

What are histogram? - bins

Histogram with 100 bins

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. The histogram is smoother and shows more detail with 100 bins.

Histogram with 20 bins

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. The histogram is more rigid and box-like with less bins.

Exploratory Data Analysis in Power BI

Reading histograms - centrality and skewness

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. There is a large mass of observations in the center and less towards the ends.

Normal distribution

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. There is a large mass of observations on the left and it becomes narrower towards the right side of the chart.

Right-skewed distribution

Exploratory Data Analysis in Power BI

Reading histograms - spread

Larger standard deviation

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. There is a large mass of observations in the center and less towards the ends.

Smaller standard deviation

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. It is more narrow, almost spire-like, as the standard deviation is small.

Exploratory Data Analysis in Power BI

Reading histograms - percentiles

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. There is a green-shaded area from the center to the left-side representing the 50th percentile.

Exploratory Data Analysis in Power BI

Reading histograms - 25th & 75th percentiles

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. There is are two green-shaded areas from the 25th percentile to the left-side and 75th percentile to the right-side.

Exploratory Data Analysis in Power BI

Reading histograms - interquartile range

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. There is a green-shaded area from the 25th percentile to the 75th percentile representing the interquartile range.

Exploratory Data Analysis in Power BI

What is an outlier?

A histogram of heights of people with values of heights on the x-axis and number of observations with those heights on the y-axis. There is a green-shaded area on the ends of either side highlighting possible outliers in the chart.

Exploratory Data Analysis in Power BI

Finding outliers

Using standard deviation

$lower = -3 * SD$

$upper = 3 * SD$

$$

Outlier when

$value < lower$ OR $value > upper$

Interquartile Range (IQR)

$lower = 25percentile-(1.5 * IQR)$

$upper = 75percentile+(1.5 * IQR)$

$$

Outlier when

$value < lower$ OR $value > upper$

Exploratory Data Analysis in Power BI

Addressing outliers

  1. Remove observations
  2. Imputation

Winsorizing

IF value < 5th percentile THEN value = 5th percentile

$$

IF value > 95th percentile THEN value = 95th percentile

Exploratory Data Analysis in Power BI

Let's practice!

Exploratory Data Analysis in Power BI

Preparing Video For Download...