Analyzing Survey Data in Python
EbunOluwa Andrew
Data Scientist
Entire population divided into several subgroups
Population -> Clusters
Does not sample individuals, but randomly selects the whole subgroup
print(mh_survey)
| gender | sought_treatment | country_work |
|--------|------------------|--------------------------|
| Male | 0 | United Kingdom |
| Male | 1 | United States of America |
| Male | 1 | United Kingdom |
| Male | 1 | United Kingdom |
| Female | 1 | United States of America |
| Male | 1 | United Kingdom |
| Male | 0 | United States of America |
...
mh_survey.groupby('country_work')[
'gender'].count()
groups = mh_survey.groupby(
'country_work')['gender'].count(
).reset_index()
groups.columns=['country_work','count']
groups.plot.bar(x='country_work',
y='count')
unique_countries = list(set(mh_survey.country_work))
random_clusters = np.random.choice(unique_countries, size = 10, replace = False)
print(random_clusters)
array(['Finland', 'Australia', 'Sweden', 'South Africa', 'Pakistan',
'France', 'Ecuador', 'United Arab Emirates', 'United Kingdom',
'Bangladesh'], dtype='<U24')
cluster_sample = mh_survey[mh_survey.country_work.isin(random_clusters)]
print(cluster_sample.head())
| gender | sought_treatment | US_state_live |
|--------|------------------|----------------------|
| Male | 1 | Pakistan |
| Male | 1 | Pakistan |
| Male | 1 | United Arab Emirates |
| Male | 1 | Pakistan |
| Female | 0 | Bangladesh |
treatment_pie = cluster_sample.sought_treatment.value_counts(normalize = True)
treatment_pie.plot.pie()
array(['Bangladesh', 'South Africa', 'Other', 'Norway', 'Poland',
'Romania', 'New Zealand', 'France', 'United States of America',
'Bulgaria'], dtype='<U24')
Analyzing Survey Data in Python