Cluster Sampling

Analyzing Survey Data in Python

EbunOluwa Andrew

Data Scientist

What is cluster sampling?

  • Entire population divided into several subgroups

    • Subgroups has characteristics similar to the population
  • Population -> Clusters

  • Does not sample individuals, but randomly selects the whole subgroup

pie chart composed of people

Analyzing Survey Data in Python

Why cluster sampling is important

  • We cannot always gather data from the entire population
  • Minimize error due to the large number in the population

overpopulation

Analyzing Survey Data in Python

Steps in cluster sampling analysis

  • First, divide the population into clusters
  • Second, perform a random selection of these clusters

cluster of people

Analyzing Survey Data in Python

Sample dataset

print(mh_survey)
| gender | sought_treatment | country_work             |
|--------|------------------|--------------------------|
| Male   |                0 | United Kingdom           |
| Male   |                1 | United States of America |
| Male   |                1 | United Kingdom           |
| Male   |                1 | United Kingdom           |
| Female |                1 | United States of America |
| Male   |                1 | United Kingdom           |
| Male   |                0 | United States of America |
...
Analyzing Survey Data in Python

Sample dataset and plot

mh_survey.groupby('country_work')[
  'gender'].count()
groups = mh_survey.groupby(
  'country_work')['gender'].count(
).reset_index()
groups.columns=['country_work','count']

groups.plot.bar(x='country_work',
                y='count')

bar plot of where tech workers live

1 _partial data plotted due to space_
Analyzing Survey Data in Python

Choose clusters

unique_countries = list(set(mh_survey.country_work))

random_clusters = np.random.choice(unique_countries, size = 10, replace = False)

print(random_clusters)
array(['Finland', 'Australia', 'Sweden', 'South Africa', 'Pakistan',
       'France', 'Ecuador', 'United Arab Emirates', 'United Kingdom',
       'Bangladesh'], dtype='<U24')
Analyzing Survey Data in Python

Create cluster sample

cluster_sample = mh_survey[mh_survey.country_work.isin(random_clusters)]
print(cluster_sample.head())
| gender | sought_treatment | US_state_live        |
|--------|------------------|----------------------|
| Male   |                1 |             Pakistan |
| Male   |                1 |             Pakistan |
| Male   |                1 | United Arab Emirates |
| Male   |                1 |             Pakistan |
| Female |                0 |           Bangladesh |
Analyzing Survey Data in Python

Plot cluster sample

treatment_pie = cluster_sample.sought_treatment.value_counts(normalize = True)
treatment_pie.plot.pie()

pie plot of sought_treatment

Analyzing Survey Data in Python

Plot cluster sample

array(['Bangladesh', 'South Africa', 'Other', 'Norway', 'Poland',
       'Romania', 'New Zealand', 'France', 'United States of America',
       'Bulgaria'], dtype='<U24')

pie plot of sought_treatment

Analyzing Survey Data in Python

Let's practice!

Analyzing Survey Data in Python

Preparing Video For Download...