Be fruitful and dplyr

Programming with dplyr

Dr. Chester Ismay

Educator, Data Scientist, and R/Python Consultant

Course prerequisites

 

  • Joining Data with dplyr

  • Introduction to Writing Functions in R

Programming with dplyr

Course outline

Chapter 1

  • Refresh dplyr pipelines
  • Choose columns based on patterns

Chapter 2

  • Move columns around in your data
  • Transform across multiple columns of data

Chapter 3

  • Strengthen dplyr join knowledge
  • Use set theory clauses to improve programming skills with multiple data sources

Chapter 4

  • Create functions to wrap dplyr and ggplot2 code
  • Use the rlang package to decipher tidy evaluation
Programming with dplyr

The world_bank_data tibble

country region year infant_mortality_rate fertility_rate perc_rural_pop
Saudi Arabia Western Asia 2013 13.3 2.64 17.260
Greece Southern Europe 2014 3.7 1.54 22.298
Latvia Northern Europe 2014 7.2 1.62 32.048
Romania Eastern Europe 2014 10.1 1.43 46.100
Netherlands Western Europe 2015 3.2 1.78 9.827
Programming with dplyr

world_bank_data columns

names(world_bank_data)
 [1] "iso"                   "country"               "continent"            
 [4] "region"                "year"                  "infant_mortality_rate"
 [7] "fertility_rate"        "perc_electric_access"  "perc_college_complete"
[10] "perc_cvd_crd_70"       "unemployment_rate"     "perc_rural_pop" 
Programming with dplyr

Select some columns from world_bank_data

world_bank_data %>%
    select(country, continent, region, year, perc_rural_pop, perc_college_complete)
# A tibble: 300 x 6
   country      continent region           year perc_rural_pop perc_college_complete
   <chr>        <fct>     <fct>           <dbl>          <dbl>                 <dbl>
 1 Portugal     Europe    Southern Europe  2000          45.6                   7.26
 2 Armenia      Asia      Western Asia     2001          35.6                  20.4 
 3 Bulgaria     Europe    Eastern Europe   2001          30.8                  18.0 
 4 Portugal     Europe    Southern Europe  2001          45.0                   7.57
 5 Qatar        Asia      Western Asia     2004           2.91                 20.9 
 6 Saudi Arabia Asia      Western Asia     2004          19.2                  14.9 
 7 Pakistan     Asia      Southern Asia    2005          66.0                   3.92
# ... with 293 more rows
Programming with dplyr

Filter rows to match continent values

continents_vector <- c("Africa", "Asia")
asia_africa_results <- world_bank_data %>%
    select(country, continent, region, year, perc_rural_pop, perc_college_complete) %>%
    filter(continent %in% continents_vector)
Programming with dplyr

Results of row filter

asia_africa_results
# A tibble: 111 x 6
   country      continent region              year perc_rural_pop perc_college_complete
   <chr>        <fct>     <fct>              <dbl>          <dbl>                 <dbl>
 1 Armenia      Asia      Western Asia        2001          35.6                  20.4 
 2 Qatar        Asia      Western Asia        2004           2.91                 20.9 
 3 Saudi Arabia Asia      Western Asia        2004          19.2                  14.9 
 4 Pakistan     Asia      Southern Asia       2005          66.0                   3.92
 5 Nigeria      Africa    Western Africa      2006          60.1                   9.04
 6 Pakistan     Asia      Southern Asia       2006          65.8                   6.30
 7 Singapore    Asia      South-Eastern Asia  2006           0                    19.6 
 8 Azerbaijan   Asia      Western Asia        2007          47.2                  14.9 
 9 Qatar        Asia      Western Asia        2007           2.08                 25.1 
10 Singapore    Asia      South-Eastern Asia  2007           0                    20.1 
# ... with 101 more rows
Programming with dplyr

Mutate a new column

asia_africa_results <- asia_africa_results %>%
    mutate(perc_urban_pop = 100 - perc_rural_pop)
Programming with dplyr

Results of mutate

# A tibble: 111 x 7
   country      continent region              year perc_rural_pop perc_college_complete perc_urban_pop
   <chr>        <fct>     <fct>              <dbl>          <dbl>                 <dbl>          <dbl>
 1 Armenia      Asia      Western Asia        2001          35.6                  20.4            64.4
 2 Qatar        Asia      Western Asia        2004           2.91                 20.9            97.1
 3 Saudi Arabia Asia      Western Asia        2004          19.2                  14.9            80.8
 4 Pakistan     Asia      Southern Asia       2005          66.0                   3.92           34.0
 5 Nigeria      Africa    Western Africa      2006          60.1                   9.04           39.9
 6 Pakistan     Asia      Southern Asia       2006          65.8                   6.30           34.2
 7 Singapore    Asia      South-Eastern Asia  2006           0                    19.6           100  
 8 Azerbaijan   Asia      Western Asia        2007          47.2                  14.9            52.8
 9 Qatar        Asia      Western Asia        2007           2.08                 25.1            97.9
10 Singapore    Asia      South-Eastern Asia  2007           0                    20.1           100  
# ... with 101 more rows
Programming with dplyr

Analyze urban percentage across regions

asia_africa_results %>%

group_by(region) %>%
summarize( mean_urban = mean(perc_urban_pop) )
# A tibble: 9 x 2
  region             mean_urban
  <fct>                   <dbl>
1 Central Asia             49.2
2 Eastern Africa           19.5
3 Eastern Asia             74.2
4 Middle Africa            42.4
5 South-Eastern Asia       79.8
6 Southern Africa          64.8
7 Southern Asia            40.0
8 Western Africa           39.6
9 Western Asia             78.9
Programming with dplyr

Let's practice!

Programming with dplyr

Preparing Video For Download...