Handling outliers

Exploratory Data Analysis in Python

George Boorman

Curriculum Manager, DataCamp

What is an outlier?

  • An observation far away from other data points
    • Median house price: $400,000
    • Outlier house price: $5,000,000

 

  • Should consider why the value is different:
    • Location, number of bedrooms, overall size etc

Large house with a swimming pool

1 Image credit: https://unsplash.com/@ralphkayden
Exploratory Data Analysis in Python

Using descriptive statistics

print(salaries["Salary_USD"].describe())
count       518.000
mean     104905.826
std       62660.107
min        3819.000
25%       61191.000
50%       95483.000
75%      137496.000
max      429675.000
Name: Salary_USD, dtype: float64
Exploratory Data Analysis in Python

Using the interquartile range

Interquartile range (IQR)

  • IQR = 75th - 25th percentile
Exploratory Data Analysis in Python

IQR in box plots

sns.boxplot(data=salaries,
            y="Salary_USD")
plt.show()

Box plot of salaries for data professionals, showing the 25th percentile at the bottom of the box, the 50th percentile as the middle line, the 75th percentile at the top of the box, and outliers as diamonds outside of the box

Exploratory Data Analysis in Python

Using the interquartile range

Interquartile range (IQR)

  • IQR = 75th - 25th percentile
  • Upper Outliers > 75th percentile + (1.5 * IQR)
  • Lower Outliers < 25th percentile - (1.5 * IQR)
Exploratory Data Analysis in Python

Identifying thresholds

# 75th percentile
seventy_fifth = salaries["Salary_USD"].quantile(0.75)

# 25th percentile twenty_fifth = salaries["Salary_USD"].quantile(0.25)
# Interquartile range salaries_iqr = seventy_fifth - twenty_fifth
print(salaries_iqr)
76305.0
Exploratory Data Analysis in Python

Identifying outliers

# Upper threshold
upper = seventy_fifth + (1.5 * salaries_iqr)

# Lower threshold lower = twenty_fifth - (1.5 * salaries_iqr)
print(upper, lower)
251953.5 -53266.5
Exploratory Data Analysis in Python

Subsetting our data

salaries[(salaries["Salary_USD"] < lower) | (salaries["Salary_USD"] > upper)] \

[["Experience", "Employee_Location", "Salary_USD"]]
        Experience    Employee_Location    Salary_USD
29      Mid           US                   429675.0
67      Mid           US                   257805.0
80      Senior        US                   263534.0
83      Mid           US                   429675.0
133     Mid           US                   403895.0
410     Executive     US                   309366.0
441     Senior        US                   362837.0
445     Senior        US                   386708.0
454     Senior        US                   254368.0
Exploratory Data Analysis in Python

Why look for outliers?

  • Outliers are extreme values

    • may not accurately represent our data
  • Can change the mean and standard deviation

  • Statistical tests and machine learning models need normally distributed data

Exploratory Data Analysis in Python

What to do about outliers?

Questions to ask:

  • Why do these outliers exist?
    • More senior roles / different countries pay more
    • Consider leaving them in the dataset

 

  • Is the data accurate?
    • Could there have been an error in data collection?
      • If so, remove them
Exploratory Data Analysis in Python

Dropping outliers

no_outliers = salaries[(salaries["Salary_USD"] > lower) & (salaries["Salary_USD"] < upper)]
print(no_outliers["Salary_USD"].describe())
count       509.000000
mean     100674.567780
std       53643.050057
min        3819.000000
25%       60928.000000
50%       95483.000000
75%      134059.000000
max      248257.000000
Name: Salary_USD, dtype: float64
Exploratory Data Analysis in Python

Distribution of salaries

Histogram of salaries after replacing outliers with the median, with extreme values from around 250000 to 450000 dollars

Histogram of salaries after replacing outliers with the median, which almost resembles a normal distribution

Exploratory Data Analysis in Python

Let's practice!

Exploratory Data Analysis in Python

Preparing Video For Download...