Considerations for categorical data

Exploratory Data Analysis in Python

George Boorman

Curriculum Manager, DataCamp

Why perform EDA?

Detecting patterns and relationships

Generating questions, or hypotheses

Preparing data for machine learning

Question mark in a red neon light

¹ Image credit: https://unsplash.com/@simonesecci

Representative data

Sample represents the population

For example:

Education versus income in USA
- Can't use data from France

USA flag

France flag

¹ Image credits: https://unsplash.com/@cristina_glebova; https://unsplash.com/@nimbus_vulpis

Categorical classes

Classes = labels

Survey people's attitudes towards marriage
- Marital status
  - Single
  - Married
  - Divorced

Class imbalance

bar plot showing counts of marital statuses in a sample - 700 divorced, 250 single, and 50 married

Class frequency

print(planes["Destination"].value_counts())

Cochin       4391
Banglore     2773
Delhi        1219
New Delhi     888
Hyderabad     673
Kolkata       369
Name: Destination, dtype: int64

Relative class frequency

40% of internal Indian flights have a destination of Delhi

planes["Destination"].value_counts(normalize=True)

Cochin       0.425773
Banglore     0.268884
Delhi        0.118200
New Delhi    0.086105
Hyderabad    0.065257
Kolkata      0.035780
Name: Destination, dtype: float64

Is our sample representative of the population (Indian internal flights)?

Cross-tabulation

Call pd-dot-crosstab

pd.crosstab(

Select index

Select column to use as the index

pd.crosstab(planes["Source"],

Select columns

Select the column

pd.crosstab(planes["Source"], planes["Destination"])

Cross-tabulation

Destination  Banglore  Cochin  Delhi  Hyderabad  Kolkata  New Delhi
Source                                                             
Banglore            0       0   1199          0        0        868
Chennai             0       0      0          0      364          0
Delhi               0    4318      0          0        0          0
Kolkata          2720       0      0          0        0          0
Mumbai              0       0      0        662        0          0

Extending cross-tabulation

`Source`	`Destination`	`Median Price (IDR)`
Banglore	Delhi	4232.21
Banglore	New Delhi	12114.56
Chennai	Kolkata	3859.76
Delhi	Cochin	9987.63
Kolkata	Banglore	9654.21
Mumbai	Hyderabad	3431.97

Aggregated values with pd.crosstab()

pd.crosstab(planes["Source"], planes["Destination"],

            values=planes["Price"], aggfunc="median")

Destination  Banglore   Cochin   Delhi  Hyderabad  Kolkata  New Delhi
Source                                                               
Banglore          NaN      NaN  4823.0        NaN      NaN    10976.5
Chennai           NaN      NaN     NaN        NaN   3850.0        NaN
Delhi             NaN  10262.0     NaN        NaN      NaN        NaN
Kolkata        9345.0      NaN     NaN        NaN      NaN        NaN
Mumbai            NaN      NaN     NaN     3342.0      NaN        NaN

Comparing sample to population

`Source`	`Destination`	`Median Price (IDR)`	`Median Price (dataset)`
Banglore	Delhi	4232.21	4823.0
Banglore	New Delhi	12114.56	10976.50
Chennai	Kolkata	3859.76	3850.0
Delhi	Cochin	9987.63	10260.0
Kolkata	Banglore	9654.21	9345.0
Mumbai	Hyderabad	3431.97	3342.0

Let's practice!

Exploratory Data Analysis in Python