Understanding your qualitative variables

Categorical Data in the Tidyverse

Emily Robinson

Data Scientist

Introduction to the dataset

  • Dataset: Kaggle 2017 Data Science survey
# A tibble: 16,716 x 228
   GenderSelect        Country    Age EmploymentStatus     
   <chr>               <chr>    <int> <chr>                
 1 Non-binary, gender... NA          NA Employed full-time   
 2 Female              United ...    30 Not employed, but lo...
 3 Male                Canada      28 Not employed, but lo…
 4 Male                United ...    56 Independent contract...
 5 Male                Taiwan      38 Employed full-time   
 6 Male                Brazil      46 Employed full-time   
 7 Male                United ...    35 Employed full-time   
 8 Female              India       22 Employed full-time   
 9 Female              Austral...    43 Employed full-time   
10 Male                Russia      33 Employed full-time   
# ... with 16,706 more rows, and 224 more variables:
#   StudentStatus <chr>, LearningDataScience <chr>,
#   CodeWriter <chr>, CareerSwitcher <chr>, ...
Categorical Data in the Tidyverse

Converting characters to factors

is.character(multipleChoiceResponses$LearningDataScienceTime)
TRUE
multipleChoiceResponses %>%
    mutate(across(where(is.character, as.factor))
# A tibble: 16,716 x 228
   GenderSelect        Country    Age EmploymentStatus     
   <fct>               <fct>    <int> <fct>                
 1 Non-binary, gender  NA          NA Employed full-time   
 2 Female              United ...    30 Not employed, but lo...
 3 Male                Canada      28 Not employed, but lo...
 4 Male                United ...    56 Independent contract...
# ... with 16,710 more rows, and 224 more variables:
#   StudentStatus <fct>, LearningDataScience <fct>,
#   CodeWriter <fct>, CareerSwitcher <fct>, ...
Categorical Data in the Tidyverse

Summarizing factors

  • Get the number of categories (nlevels())
nlevels(multipleChoiceResponses$LearningDataScienceTime)
6
  • Get the list of categories (levels())
levels(multipleChoiceResponses$LearningDataScienceTime)
[1] "< 1 year"    "1-2 years"   "10-15 years" "15+ years"  
[5] "3-5 years"   "5-10 years"
Categorical Data in the Tidyverse

Summarizing factors

  • Get number of levels for every factor variable
multipleChoiceResponses %>%
  summarize(across(where(is.factor), nlevels)
# A tibble: 1 x 215
  GenderSelect Country EmploymentStatus StudentStatus
         <int>   <int>            <int>         <int>
1            4      52                7             2
# ... with 211 more variables: LearningDataScience <int>,
#   CodeWriter <int>, CareerSwitcher <int>,
Categorical Data in the Tidyverse

everything()

multipleChoiceResponses %>%
  select(everything())
multipleChoiceResponses %>%
  pivot_longer(everything())
Categorical Data in the Tidyverse

Let's practice!

Categorical Data in the Tidyverse

Preparing Video For Download...