Handling outliers

Analisi esplorativa dei dati in Python

George Boorman

Curriculum Manager, DataCamp

What is an outlier?

  • An observation far away from other data points
    • Median house price: $400,000
    • Outlier house price: $5,000,000

 

  • Should consider why the value is different:
    • Location, number of bedrooms, overall size etc

Large house with a swimming pool

1 Image credit: https://unsplash.com/@ralphkayden
Analisi esplorativa dei dati in Python

Using descriptive statistics

print(salaries["Salary_USD"].describe())
count       518.000
mean     104905.826
std       62660.107
min        3819.000
25%       61191.000
50%       95483.000
75%      137496.000
max      429675.000
Name: Salary_USD, dtype: float64
Analisi esplorativa dei dati in Python

Using the interquartile range

Interquartile range (IQR)

  • IQR = 75th - 25th percentile
Analisi esplorativa dei dati in Python

IQR in box plots

sns.boxplot(data=salaries,
            y="Salary_USD")
plt.show()

Box plot of salaries for data professionals, showing the 25th percentile at the bottom of the box, the 50th percentile as the middle line, the 75th percentile at the top of the box, and outliers as diamonds outside of the box

Analisi esplorativa dei dati in Python

Using the interquartile range

Interquartile range (IQR)

  • IQR = 75th - 25th percentile
  • Upper Outliers > 75th percentile + (1.5 * IQR)
  • Lower Outliers < 25th percentile - (1.5 * IQR)
Analisi esplorativa dei dati in Python

Identifying thresholds

# 75th percentile
seventy_fifth = salaries["Salary_USD"].quantile(0.75)

# 25th percentile twenty_fifth = salaries["Salary_USD"].quantile(0.25)
# Interquartile range salaries_iqr = seventy_fifth - twenty_fifth
print(salaries_iqr)
76305.0
Analisi esplorativa dei dati in Python

Identifying outliers

# Upper threshold
upper = seventy_fifth + (1.5 * salaries_iqr)

# Lower threshold lower = twenty_fifth - (1.5 * salaries_iqr)
print(upper, lower)
251953.5 -53266.5
Analisi esplorativa dei dati in Python

Subsetting our data

salaries[(salaries["Salary_USD"] < lower) | (salaries["Salary_USD"] > upper)] \

[["Experience", "Employee_Location", "Salary_USD"]]
        Experience    Employee_Location    Salary_USD
29      Mid           US                   429675.0
67      Mid           US                   257805.0
80      Senior        US                   263534.0
83      Mid           US                   429675.0
133     Mid           US                   403895.0
410     Executive     US                   309366.0
441     Senior        US                   362837.0
445     Senior        US                   386708.0
454     Senior        US                   254368.0
Analisi esplorativa dei dati in Python

Why look for outliers?

  • Outliers are extreme values

    • may not accurately represent our data
  • Can change the mean and standard deviation

  • Statistical tests and machine learning models need normally distributed data

Analisi esplorativa dei dati in Python

What to do about outliers?

Questions to ask:

  • Why do these outliers exist?
    • More senior roles / different countries pay more
    • Consider leaving them in the dataset

 

  • Is the data accurate?
    • Could there have been an error in data collection?
      • If so, remove them
Analisi esplorativa dei dati in Python

Dropping outliers

no_outliers = salaries[(salaries["Salary_USD"] > lower) & (salaries["Salary_USD"] < upper)]
print(no_outliers["Salary_USD"].describe())
count       509.000000
mean     100674.567780
std       53643.050057
min        3819.000000
25%       60928.000000
50%       95483.000000
75%      134059.000000
max      248257.000000
Name: Salary_USD, dtype: float64
Analisi esplorativa dei dati in Python

Distribution of salaries

Histogram of salaries after replacing outliers with the median, with extreme values from around 250000 to 450000 dollars

Histogram of salaries after replacing outliers with the median, which almost resembles a normal distribution

Analisi esplorativa dei dati in Python

Let's practice!

Analisi esplorativa dei dati in Python

Preparing Video For Download...