Exploratory Data Analysis in Power BI
Jacob H. Marquez
Data Scientist at Microsoft
Definition: set of all possible values of the variable and the associated frequencies.
Continuous
| Age | Frequency |
|---|---|
| 18 | 7 |
| 19 | 11 |
| 20 | 13 |
| 21 | 19 |
| 22 | 12 |
Continuous
| Age | Frequency |
|---|---|
| 18 | 7 |
| 19 | 11 |
| 20 | 13 |
| 21 | 19 |
| 22 | 12 |
Categorical
| Hair Color | Frequency |
|---|---|
| Blonde | 30 |
| Brown | 50 |
| Black | 40 |
| Red | 20 |
| Grey | 20 |

Histogram with 100 bins

Histogram with 20 bins


Normal distribution

Right-skewed distribution
Larger standard deviation

Smaller standard deviation





Using standard deviation
$lower = -3 * SD$
$upper = 3 * SD$
$$
Outlier when
$value < lower$ OR $value > upper$
Interquartile Range (IQR)
$lower = 25percentile-(1.5 * IQR)$
$upper = 75percentile+(1.5 * IQR)$
$$
Outlier when
$value < lower$ OR $value > upper$
Winsorizing
IF value < 5th percentile THEN value = 5th percentile
$$
IF value > 95th percentile THEN value = 95th percentile
Exploratory Data Analysis in Power BI