Exploratory Data Analysis in Python
George Boorman
Curriculum Manager, DataCamp
books["year"] = books["year"].astype(int)
books.dtypes
name object
author object
rating float64
year int64
genre object
dtype: object
books.groupby("genre").agg(
mean_rating=("rating", "mean"),
std_rating=("rating", "std"),
median_year=("year", "median")
)
| genre | mean_rating | std_rating | median_year |
|-------------|-------------|------------|-------------|
| Childrens | 4.780000 | 0.122370 | 2015.0 |
| Fiction | 4.570229 | 0.281123 | 2013.0 |
| Non Fiction | 4.598324 | 0.179411 | 2013.0 |
print(salaries.isna().sum())
Working_Year 12
Designation 27
Experience 33
Employment_Status 31
Employee_Location 28
Company_Size 40
Remote_Working_Ratio 24
Salary_USD 60
dtype: int64
salaries_dict = salaries.groupby("Experience")["Salary_USD"].median().to_dict()
salaries["Salary_USD"] = salaries["Salary_USD"].fillna(salaries["Experience"].map(salaries_dict))
salaries["Job_Category"] = np.select(conditions,
job_categories,
default="Other")
salaries["std_dev"] = salaries.groupby("Experience")["Salary_USD"].transform(lambda x: x.std())
sns.boxplot(data=salaries,
y="Salary_USD")
plt.show()
sns.lineplot(data=divorce, x="marriage_month", y="marriage_duration")
plt.show()
sns.heatmap(divorce.corr(), annot=True)
plt.show()
sns.kdeplot(data=divorce, x="marriage_duration", hue="education_man", cut=0)
plt.show()
pd.crosstab(planes["Source"], planes["Destination"],
values=planes["Price"], aggfunc="median")
Destination Banglore Cochin Delhi Hyderabad Kolkata New Delhi
Source
Banglore NaN NaN 4823.0 NaN NaN 10976.5
Chennai NaN NaN NaN NaN 3850.0 NaN
Delhi NaN 10262.0 NaN NaN NaN NaN
Kolkata 9345.0 NaN NaN NaN NaN NaN
Mumbai NaN NaN NaN 3342.0 NaN NaN
planes["Price_Category"] = pd.cut(planes["Price"],
labels=labels,
bins=bins)
sns.barplot(data=planes, x="Airline", y="Duration")
plt.show()
Exploratory Data Analysis in Python