Exploratory Data Analysis in Python
George Boorman
Curriculum Manager, DataCamp
Column | Description | Data type |
---|---|---|
Working_Year |
Year the data was obtained | Float |
Designation |
Job title | String |
Experience |
Experience level e.g., "Mid" , "Senior" |
String |
Employment_Status |
Type of employment contract e.g., "FT" , "PT" |
String |
Employee_Location |
Country of employment | String |
Company_Size |
Labels for company size e.g., "S" , "M" , "L" |
String |
Remote_Working_Ratio |
Percentage of time working remotely | Integer |
Salary_USD |
Salary in US dollars | Float |
print(salaries.isna().sum())
Working_Year 12
Designation 27
Experience 33
Employment_Status 31
Employee_Location 28
Company_Size 40
Remote_Working_Ratio 24
Salary_USD 60
dtype: int64
threshold = len(salaries) * 0.05
print(threshold)
30
cols_to_drop = salaries.columns[salaries.isna().sum() <= threshold]
print(cols_to_drop)
Index(['Working_Year', 'Designation', 'Employee_Location',
'Remote_Working_Ratio'],
dtype='object')
salaries.dropna(subset=cols_to_drop, inplace=True)
cols_with_missing_values = salaries.columns[salaries.isna().sum() > 0]
print(cols_with_missing_values)
Index(['Experience', 'Employment_Status', 'Company_Size', 'Salary_USD'],
dtype='object')
for col in cols_with_missing_values[:-1]:
salaries[col].fillna(salaries[col].mode()[0])
print(salaries.isna().sum())
Working_Year 0
Designation 0
Experience 0
Employment_Status 0
Employee_Location 0
Company_Size 0
Remote_Working_Ratio 0
Salary_USD 41
salaries_dict = salaries.groupby("Experience")["Salary_USD"].median().to_dict()
print(salaries_dict)
{'Entry': 55380.0, 'Executive': 135439.0, 'Mid': 74173.5, 'Senior': 128903.0}
salaries["Salary_USD"] = salaries["Salary_USD"].fillna(salaries["Experience"].map(salaries_dict))
print(salaries.isna().sum())
Working_Year 0
Designation 0
Experience 0
Employment_Status 0
Employee_Location 0
Company_Size 0
Remote_Working_Ratio 0
Salary_USD 0
dtype: int64
Exploratory Data Analysis in Python