Handling outliers

Exploratory Data Analysis in Python

George Boorman

Curriculum Manager, DataCamp

What is an outlier?

An observation far away from other data points
- Median house price: $400,000
- Outlier house price: $5,000,000

Should consider why the value is different:
- Location, number of bedrooms, overall size etc

Large house with a swimming pool

¹ Image credit: https://unsplash.com/@ralphkayden

Using descriptive statistics

print(salaries["Salary_USD"].describe())

count       518.000
mean     104905.826
std       62660.107
min        3819.000
25%       61191.000
50%       95483.000
75%      137496.000
max      429675.000
Name: Salary_USD, dtype: float64

Using the interquartile range

Interquartile range (IQR)

IQR = 75th - 25th percentile

IQR in box plots

sns.boxplot(data=salaries,
            y="Salary_USD")
plt.show()

Box plot of salaries for data professionals, showing the 25th percentile at the bottom of the box, the 50th percentile as the middle line, the 75th percentile at the top of the box, and outliers as diamonds outside of the box

Using the interquartile range

Interquartile range (IQR)

IQR = 75th - 25th percentile
Upper Outliers > 75th percentile + (1.5 * IQR)
Lower Outliers < 25th percentile - (1.5 * IQR)

Identifying thresholds

# 75th percentile
seventy_fifth = salaries["Salary_USD"].quantile(0.75)


# 25th percentile
twenty_fifth = salaries["Salary_USD"].quantile(0.25)


# Interquartile range
salaries_iqr = seventy_fifth - twenty_fifth


print(salaries_iqr)

76305.0

Identifying outliers

# Upper threshold
upper = seventy_fifth + (1.5 * salaries_iqr)


# Lower threshold
lower = twenty_fifth - (1.5 * salaries_iqr)


print(upper, lower)

251953.5 -53266.5

Subsetting our data

salaries[(salaries["Salary_USD"] < lower) | (salaries["Salary_USD"] > upper)] \

        [["Experience", "Employee_Location", "Salary_USD"]]

        Experience    Employee_Location    Salary_USD
29      Mid           US                   429675.0
67      Mid           US                   257805.0
80      Senior        US                   263534.0
83      Mid           US                   429675.0
133     Mid           US                   403895.0
410     Executive     US                   309366.0
441     Senior        US                   362837.0
445     Senior        US                   386708.0
454     Senior        US                   254368.0

Why look for outliers?

Outliers are extreme values
- may not accurately represent our data
Can change the mean and standard deviation
Statistical tests and machine learning models need normally distributed data

What to do about outliers?

Questions to ask:

Why do these outliers exist?
- More senior roles / different countries pay more
- Consider leaving them in the dataset

Is the data accurate?
- Could there have been an error in data collection?
  - If so, remove them

Dropping outliers

no_outliers = salaries[(salaries["Salary_USD"] > lower) & (salaries["Salary_USD"] < upper)]

print(no_outliers["Salary_USD"].describe())

count       509.000000
mean     100674.567780
std       53643.050057
min        3819.000000
25%       60928.000000
50%       95483.000000
75%      134059.000000
max      248257.000000
Name: Salary_USD, dtype: float64

Distribution of salaries

Histogram of salaries after replacing outliers with the median, with extreme values from around 250000 to 450000 dollars

Histogram of salaries after replacing outliers with the median, which almost resembles a normal distribution

Let's practice!

Exploratory Data Analysis in Python