Exploratory Data Analysis in Python
George Boorman
Curriculum Manager, DataCamp
print(salaries.select_dtypes("object").head())
Designation Experience Employment_Status Employee_Location Company_Size
0 Data Scientist Mid FT DE L
1 Machine Learning Scientist Senior FT JP S
2 Big Data Engineer Senior FT GB M
3 Product Data Analyst Mid FT HN S
4 Machine Learning Engineer Senior FT US L
print(salaries["Designation"].value_counts())
Data Scientist 143
Data Engineer 132
Data Analyst 97
Machine Learning Engineer 41
Research Scientist 16
Data Science Manager 12
Data Architect 11
Big Data Engineer 8
Machine Learning Scientist 8
...
print(salaries["Designation"].nunique())
50
Current format limits our ability generate insights
pandas.Series.str.contains()
salaries["Designation"].str.contains("Scientist")
0 True
1 True
2 False
3 False
...
604 False
605 False
606 True
Name: Designation, Length: 607, dtype: bool
salaries["Designation"].str.contains("Machine Learning|AI")
0 False
1 True
2 False
3 False
...
604 False
605 False
606 True
Name: Designation, Length: 607, dtype: bool
salaries["Designation"].str.contains("^Data")
0 True
1 False
2 False
3 False
...
604 True
605 True
606 False
Name: Designation, Length: 607, dtype: bool
job_categories = ["Data Science", "Data Analytics",
"Data Engineering", "Machine Learning",
"Managerial", "Consultant"]
data_science = "Data Scientist|NLP"
data_analyst = "Analyst|Analytics"
data_engineer = "Data Engineer|ETL|Architect|Infrastructure"
ml_engineer = "Machine Learning|ML|Big Data|AI"
manager = "Manager|Head|Director|Lead|Principal|Staff"
consultant = "Consultant|Freelance"
conditions = [ (salaries["Designation"].str.contains(data_science)),
(salaries["Designation"].str.contains(data_analyst)),
(salaries["Designation"].str.contains(data_engineer)),
(salaries["Designation"].str.contains(ml_engineer)),
(salaries["Designation"].str.contains(manager)),
(salaries["Designation"].str.contains(consultant)) ]
salaries["Job_Category"] =
salaries["Job_Category"] = np.select(conditions,
salaries["Job_Category"] = np.select(conditions,
job_categories,
salaries["Job_Category"] = np.select(conditions,
job_categories,
default="Other")
print(salaries[["Designation", "Job_Category"]].head())
Designation Job_Category
0 Data Scientist Data Science
1 Machine Learning Scientist Machine Learning
2 Big Data Engineer Data Engineering
3 Product Data Analyst Data Analytics
4 Machine Learning Engineer Machine Learning
sns.countplot(data=salaries, x="Job_Category")
plt.show()
Exploratory Data Analysis in Python