Cluster Sampling

Analyzing Survey Data in Python

EbunOluwa Andrew

Data Scientist

What is cluster sampling?

Entire population divided into several subgroups
- Subgroups has characteristics similar to the population
Population -> Clusters
Does not sample individuals, but randomly selects the whole subgroup

pie chart composed of people

Why cluster sampling is important

We cannot always gather data from the entire population
Minimize error due to the large number in the population

overpopulation

Steps in cluster sampling analysis

First, divide the population into clusters
Second, perform a random selection of these clusters

cluster of people

Sample dataset

print(mh_survey)

| gender | sought_treatment | country_work             |
|--------|------------------|--------------------------|
| Male   |                0 | United Kingdom           |
| Male   |                1 | United States of America |
| Male   |                1 | United Kingdom           |
| Male   |                1 | United Kingdom           |
| Female |                1 | United States of America |
| Male   |                1 | United Kingdom           |
| Male   |                0 | United States of America |
...

Sample dataset and plot

mh_survey.groupby('country_work')[
  'gender'].count()

groups = mh_survey.groupby(
  'country_work')['gender'].count(
).reset_index()

groups.columns=['country_work','count']

groups.plot.bar(x='country_work',
                y='count')

bar plot of where tech workers live

¹ _partial data plotted due to space_

Choose clusters

unique_countries = list(set(mh_survey.country_work))

random_clusters = np.random.choice(unique_countries, size = 10, replace = False)

print(random_clusters)

array(['Finland', 'Australia', 'Sweden', 'South Africa', 'Pakistan',
       'France', 'Ecuador', 'United Arab Emirates', 'United Kingdom',
       'Bangladesh'], dtype='<U24')

Create cluster sample

cluster_sample = mh_survey[mh_survey.country_work.isin(random_clusters)]
print(cluster_sample.head())

| gender | sought_treatment | US_state_live        |
|--------|------------------|----------------------|
| Male   |                1 |             Pakistan |
| Male   |                1 |             Pakistan |
| Male   |                1 | United Arab Emirates |
| Male   |                1 |             Pakistan |
| Female |                0 |           Bangladesh |

Plot cluster sample

treatment_pie = cluster_sample.sought_treatment.value_counts(normalize = True)
treatment_pie.plot.pie()

pie plot of sought_treatment

Plot cluster sample

array(['Bangladesh', 'South Africa', 'Other', 'Norway', 'Poland',
       'Romania', 'New Zealand', 'France', 'United States of America',
       'Bulgaria'], dtype='<U24')

pie plot of sought_treatment

Let's practice!

Analyzing Survey Data in Python