Congratulations

Exploratory Data Analysis in Python

George Boorman

Curriculum Manager, DataCamp

Inspection and validation

a histogram of book ratings

books["year"] = books["year"].astype(int)
books.dtypes

name       object
author     object
rating    float64
year        int64
genre      object
dtype: object

Aggregation

books.groupby("genre").agg(
    mean_rating=("rating", "mean"),
    std_rating=("rating", "std"),
    median_year=("year", "median")
)

|  genre      | mean_rating | std_rating | median_year |
|-------------|-------------|------------|-------------|
|   Childrens |    4.780000 |   0.122370 |      2015.0 |
|     Fiction |    4.570229 |   0.281123 |      2013.0 |
| Non Fiction |    4.598324 |   0.179411 |      2013.0 |

Address missing data

print(salaries.isna().sum())

Working_Year            12
Designation             27
Experience              33
Employment_Status       31
Employee_Location       28
Company_Size            40
Remote_Working_Ratio    24
Salary_USD              60
dtype: int64

Address missing data

Drop missing values

Impute mean, median, mode

Impute by sub-group

salaries_dict = salaries.groupby("Experience")["Salary_USD"].median().to_dict()
salaries["Salary_USD"] = salaries["Salary_USD"].fillna(salaries["Experience"].map(salaries_dict))

Analyze categorical data

salaries["Job_Category"] = np.select(conditions, 
                                     job_categories, 
                                     default="Other")

Bar plot displaying the count of jobs by category

Apply lambda functions

Apply a lambda function

salaries["std_dev"] = salaries.groupby("Experience")["Salary_USD"].transform(lambda x: x.std())

Handle outliers

sns.boxplot(data=salaries,
            y="Salary_USD")
plt.show()

Box plot of salaries for data professionals, showing the 25th percentile at the bottom of the box, the 50th percentile as the middle line, and the 75th percentile at the top of the box

Patterns over time

sns.lineplot(data=divorce, x="marriage_month", y="marriage_duration")
plt.show()

A line plot showing the relationship between month of marriage and marriage duration

Correlation

sns.heatmap(divorce.corr(numeric_only=True), annot=True)
plt.show()

A heat map of divorce correlations

Distributions

sns.kdeplot(data=divorce, x="marriage_duration", hue="education_man", cut=0)
plt.show()

marriage duration kde with hue set to education_man and cut equal to zero

Cross-tabulation

pd.crosstab(planes["Source"], planes["Destination"],
            values=planes["Price"], aggfunc="median")

Destination  Banglore   Cochin   Delhi  Hyderabad  Kolkata  New Delhi
Source                                                               
Banglore          NaN      NaN  4823.0        NaN      NaN    10976.5
Chennai           NaN      NaN     NaN        NaN   3850.0        NaN
Delhi             NaN  10262.0     NaN        NaN      NaN        NaN
Kolkata        9345.0      NaN     NaN        NaN      NaN        NaN
Mumbai            NaN      NaN     NaN     3342.0      NaN        NaN

pd.cut()

Provide the bins

planes["Price_Category"] = pd.cut(planes["Price"],
                                  labels=labels,
                                  bins=bins)

Data snooping

Heatmap with correlation coefficient scores for each number of stops

Generating hypotheses

sns.barplot(data=planes, x="Airline", y="Duration")
plt.show()

Bar plot of duration versus airline

Next steps

Sampling in Python

Hypothesis Testing in Python

Supervised Learning with scikit-learn

Congratulations!

Exploratory Data Analysis in Python