Congratulations

Exploratory Data Analysis in Python

George Boorman

Curriculum Manager, DataCamp

Inspection and validation

a histogram of book ratings

books["year"] = books["year"].astype(int)
books.dtypes
name       object
author     object
rating    float64
year        int64
genre      object
dtype: object
Exploratory Data Analysis in Python

Aggregation

books.groupby("genre").agg(
    mean_rating=("rating", "mean"),
    std_rating=("rating", "std"),
    median_year=("year", "median")
)
|  genre      | mean_rating | std_rating | median_year |
|-------------|-------------|------------|-------------|
|   Childrens |    4.780000 |   0.122370 |      2015.0 |
|     Fiction |    4.570229 |   0.281123 |      2013.0 |
| Non Fiction |    4.598324 |   0.179411 |      2013.0 |
Exploratory Data Analysis in Python

Address missing data

print(salaries.isna().sum())
Working_Year            12
Designation             27
Experience              33
Employment_Status       31
Employee_Location       28
Company_Size            40
Remote_Working_Ratio    24
Salary_USD              60
dtype: int64
Exploratory Data Analysis in Python

Address missing data

  • Drop missing values

 

  • Impute mean, median, mode

 

  • Impute by sub-group

 

salaries_dict = salaries.groupby("Experience")["Salary_USD"].median().to_dict()
salaries["Salary_USD"] = salaries["Salary_USD"].fillna(salaries["Experience"].map(salaries_dict))
Exploratory Data Analysis in Python

Analyze categorical data

salaries["Job_Category"] = np.select(conditions, 
                                     job_categories, 
                                     default="Other")

Bar plot displaying the count of jobs by category

Exploratory Data Analysis in Python

Apply lambda functions

Apply a lambda function

salaries["std_dev"] = salaries.groupby("Experience")["Salary_USD"].transform(lambda x: x.std())
Exploratory Data Analysis in Python

Handle outliers

sns.boxplot(data=salaries,
            y="Salary_USD")
plt.show()

Box plot of salaries for data professionals, showing the 25th percentile at the bottom of the box, the 50th percentile as the middle line, and the 75th percentile at the top of the box

Exploratory Data Analysis in Python

Patterns over time

sns.lineplot(data=divorce, x="marriage_month", y="marriage_duration")
plt.show()

A line plot showing the relationship between month of marriage and marriage duration

Exploratory Data Analysis in Python

Correlation

sns.heatmap(divorce.corr(), annot=True)
plt.show()

A heat map of divorce correlations

Exploratory Data Analysis in Python

Distributions

sns.kdeplot(data=divorce, x="marriage_duration", hue="education_man", cut=0)
plt.show()

marriage duration kde with hue set to education_man and cut equal to zero

Exploratory Data Analysis in Python

Cross-tabulation

pd.crosstab(planes["Source"], planes["Destination"],
            values=planes["Price"], aggfunc="median")
Destination  Banglore   Cochin   Delhi  Hyderabad  Kolkata  New Delhi
Source                                                               
Banglore          NaN      NaN  4823.0        NaN      NaN    10976.5
Chennai           NaN      NaN     NaN        NaN   3850.0        NaN
Delhi             NaN  10262.0     NaN        NaN      NaN        NaN
Kolkata        9345.0      NaN     NaN        NaN      NaN        NaN
Mumbai            NaN      NaN     NaN     3342.0      NaN        NaN
Exploratory Data Analysis in Python

pd.cut()

Provide the bins

planes["Price_Category"] = pd.cut(planes["Price"],
                                  labels=labels,
                                  bins=bins)
Exploratory Data Analysis in Python

Data snooping

Heatmap with correlation coefficient scores for each number of stops

Exploratory Data Analysis in Python

Generating hypotheses

sns.barplot(data=planes, x="Airline", y="Duration")
plt.show()

Bar plot of duration versus airline

Exploratory Data Analysis in Python

Next steps

Exploratory Data Analysis in Python

Congratulations!

Exploratory Data Analysis in Python

Preparing Video For Download...