Considerations for categorical data

Exploratory Data Analysis in Python

George Boorman

Curriculum Manager, DataCamp

Why perform EDA?

  • Detecting patterns and relationships

 

 

  • Generating questions, or hypotheses

 

 

  • Preparing data for machine learning

Question mark in a red neon light

1 Image credit: https://unsplash.com/@simonesecci
Exploratory Data Analysis in Python

Representative data

  • Sample represents the population

For example:

  • Education versus income in USA
    • Can't use data from France

USA flag

France flag

1 Image credits: https://unsplash.com/@cristina_glebova; https://unsplash.com/@nimbus_vulpis
Exploratory Data Analysis in Python

Categorical classes

  • Classes = labels

 

  • Survey people's attitudes towards marriage
    • Marital status
      • Single
      • Married
      • Divorced
Exploratory Data Analysis in Python

Class imbalance

bar plot showing counts of marital statuses in a sample - 700 divorced, 250 single, and 50 married

Exploratory Data Analysis in Python

Class frequency

print(planes["Destination"].value_counts())
Cochin       4391
Banglore     2773
Delhi        1219
New Delhi     888
Hyderabad     673
Kolkata       369
Name: Destination, dtype: int64
Exploratory Data Analysis in Python

Relative class frequency

  • 40% of internal Indian flights have a destination of Delhi
planes["Destination"].value_counts(normalize=True)
Cochin       0.425773
Banglore     0.268884
Delhi        0.118200
New Delhi    0.086105
Hyderabad    0.065257
Kolkata      0.035780
Name: Destination, dtype: float64
  • Is our sample representative of the population (Indian internal flights)?
Exploratory Data Analysis in Python

Cross-tabulation

Call pd-dot-crosstab

pd.crosstab(
Exploratory Data Analysis in Python

Select index

Select column to use as the index

pd.crosstab(planes["Source"],
Exploratory Data Analysis in Python

Select columns

Select the column

pd.crosstab(planes["Source"], planes["Destination"])
Exploratory Data Analysis in Python

Cross-tabulation

Destination  Banglore  Cochin  Delhi  Hyderabad  Kolkata  New Delhi
Source                                                             
Banglore            0       0   1199          0        0        868
Chennai             0       0      0          0      364          0
Delhi               0    4318      0          0        0          0
Kolkata          2720       0      0          0        0          0
Mumbai              0       0      0        662        0          0
Exploratory Data Analysis in Python

Extending cross-tabulation

Source Destination Median Price (IDR)
Banglore Delhi 4232.21
Banglore New Delhi 12114.56
Chennai Kolkata 3859.76
Delhi Cochin 9987.63
Kolkata Banglore 9654.21
Mumbai Hyderabad 3431.97
Exploratory Data Analysis in Python

Aggregated values with pd.crosstab()

pd.crosstab(planes["Source"], planes["Destination"],

values=planes["Price"], aggfunc="median")
Destination  Banglore   Cochin   Delhi  Hyderabad  Kolkata  New Delhi
Source                                                               
Banglore          NaN      NaN  4823.0        NaN      NaN    10976.5
Chennai           NaN      NaN     NaN        NaN   3850.0        NaN
Delhi             NaN  10262.0     NaN        NaN      NaN        NaN
Kolkata        9345.0      NaN     NaN        NaN      NaN        NaN
Mumbai            NaN      NaN     NaN     3342.0      NaN        NaN
Exploratory Data Analysis in Python

Comparing sample to population

Source Destination Median Price (IDR) Median Price (dataset)
Banglore Delhi 4232.21 4823.0
Banglore New Delhi 12114.56 10976.50
Chennai Kolkata 3859.76 3850.0
Delhi Cochin 9987.63 10260.0
Kolkata Banglore 9654.21 9345.0
Mumbai Hyderabad 3431.97 3342.0
Exploratory Data Analysis in Python

Let's practice!

Exploratory Data Analysis in Python

Preparing Video For Download...