Exploratory Data Analysis in Power BI
Jacob H. Marquez
Data Scientist at Microsoft
Definition: set of all possible values of the variable and the associated frequencies.
Continuous
Age | Frequency |
---|---|
18 | 7 |
19 | 11 |
20 | 13 |
21 | 19 |
22 | 12 |
Continuous
Age | Frequency |
---|---|
18 | 7 |
19 | 11 |
20 | 13 |
21 | 19 |
22 | 12 |
Categorical
Hair Color | Frequency |
---|---|
Blonde | 30 |
Brown | 50 |
Black | 40 |
Red | 20 |
Grey | 20 |
Histogram with 100 bins
Histogram with 20 bins
Normal distribution
Right-skewed distribution
Larger standard deviation
Smaller standard deviation
Using standard deviation
$lower = -3 * SD$
$upper = 3 * SD$
$$
Outlier when
$value < lower$ OR $value > upper$
Interquartile Range (IQR)
$lower = 25percentile-(1.5 * IQR)$
$upper = 75percentile+(1.5 * IQR)$
$$
Outlier when
$value < lower$ OR $value > upper$
Winsorizing
IF value < 5th percentile THEN value = 5th percentile
$$
IF value > 95th percentile THEN value = 95th percentile
Exploratory Data Analysis in Power BI