Congratulations

Analisi esplorativa dei dati in Python

George Boorman

Curriculum Manager, DataCamp

Inspection and validation

a histogram of book ratings

books["year"] = books["year"].astype(int)
books.dtypes
name       object
author     object
rating    float64
year        int64
genre      object
dtype: object
Analisi esplorativa dei dati in Python

Aggregation

books.groupby("genre").agg(
    mean_rating=("rating", "mean"),
    std_rating=("rating", "std"),
    median_year=("year", "median")
)
|  genre      | mean_rating | std_rating | median_year |
|-------------|-------------|------------|-------------|
|   Childrens |    4.780000 |   0.122370 |      2015.0 |
|     Fiction |    4.570229 |   0.281123 |      2013.0 |
| Non Fiction |    4.598324 |   0.179411 |      2013.0 |
Analisi esplorativa dei dati in Python

Address missing data

print(salaries.isna().sum())
Working_Year            12
Designation             27
Experience              33
Employment_Status       31
Employee_Location       28
Company_Size            40
Remote_Working_Ratio    24
Salary_USD              60
dtype: int64
Analisi esplorativa dei dati in Python

Address missing data

  • Drop missing values

 

  • Impute mean, median, mode

 

  • Impute by sub-group

 

salaries_dict = salaries.groupby("Experience")["Salary_USD"].median().to_dict()
salaries["Salary_USD"] = salaries["Salary_USD"].fillna(salaries["Experience"].map(salaries_dict))
Analisi esplorativa dei dati in Python

Analyze categorical data

salaries["Job_Category"] = np.select(conditions, 
                                     job_categories, 
                                     default="Other")

Bar plot displaying the count of jobs by category

Analisi esplorativa dei dati in Python

Apply lambda functions

Apply a lambda function

salaries["std_dev"] = salaries.groupby("Experience")["Salary_USD"].transform(lambda x: x.std())
Analisi esplorativa dei dati in Python

Handle outliers

sns.boxplot(data=salaries,
            y="Salary_USD")
plt.show()

Box plot of salaries for data professionals, showing the 25th percentile at the bottom of the box, the 50th percentile as the middle line, and the 75th percentile at the top of the box

Analisi esplorativa dei dati in Python

Patterns over time

sns.lineplot(data=divorce, x="marriage_month", y="marriage_duration")
plt.show()

A line plot showing the relationship between month of marriage and marriage duration

Analisi esplorativa dei dati in Python

Correlation

sns.heatmap(divorce.corr(numeric_only=True), annot=True)
plt.show()

A heat map of divorce correlations

Analisi esplorativa dei dati in Python

Distributions

sns.kdeplot(data=divorce, x="marriage_duration", hue="education_man", cut=0)
plt.show()

marriage duration kde with hue set to education_man and cut equal to zero

Analisi esplorativa dei dati in Python

Cross-tabulation

pd.crosstab(planes["Source"], planes["Destination"],
            values=planes["Price"], aggfunc="median")
Destination  Banglore   Cochin   Delhi  Hyderabad  Kolkata  New Delhi
Source                                                               
Banglore          NaN      NaN  4823.0        NaN      NaN    10976.5
Chennai           NaN      NaN     NaN        NaN   3850.0        NaN
Delhi             NaN  10262.0     NaN        NaN      NaN        NaN
Kolkata        9345.0      NaN     NaN        NaN      NaN        NaN
Mumbai            NaN      NaN     NaN     3342.0      NaN        NaN
Analisi esplorativa dei dati in Python

pd.cut()

Provide the bins

planes["Price_Category"] = pd.cut(planes["Price"],
                                  labels=labels,
                                  bins=bins)
Analisi esplorativa dei dati in Python

Data snooping

Heatmap with correlation coefficient scores for each number of stops

Analisi esplorativa dei dati in Python

Generating hypotheses

sns.barplot(data=planes, x="Airline", y="Duration")
plt.show()

Bar plot of duration versus airline

Analisi esplorativa dei dati in Python

Next steps

Analisi esplorativa dei dati in Python

Congratulations!

Analisi esplorativa dei dati in Python

Preparing Video For Download...