Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Miriam | Day | 6042 Sollicitudin Avenue | 313 | 
| 2 | Miriam | Day | 6042 Sollicitudin Avenue | 313 | 
| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Tamekah | Forbes | P.O. Box 147, 511 Velit St | 356 | 
| 2 | Tamekah | Forbes | P.O. Box 147, 511 Velit St | 342 | 



| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Harper | Taylor | P.O. Box 212, 6557 Nunc Road | 655 | 
| 2 | Miriam | Day | 6042 Sollicitudin Avenue | 313 | 
| 3 | Eagan | Schmidt | 507-6740 Cursus Avenue | 728 | 
| 4 | Miriam | Day | 6042 Sollicitudin Avenue | 313 | 
| 5 | Katell | Roy | Ap #434-4081 Mi Av. | 455 | 
| 6 | Katell | Roy | Ap #434-4081 Mi Av. | 455 | 
| ... | ... | ... | ... | ... | 
duplicated(credit_scores)
FALSE FALSE FALSE TRUE FALSE ...
sum(duplicated(credit_scores))
2
filter(credit_scores, duplicated(credit_scores))
  first_name last_name                  address credit_score
1     Miriam       Day 6042 Sollicitudin Avenue          313
2     Katell       Roy      Ap #434-4081 Mi Av.          455
credit_scores_unique <- distinct(credit_scores)sum(duplicated(credit_scores_unique))
0
| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Harper | Taylor | P.O. Box 212, 6557 Nunc Road | 655 | 
| 2 | Eagan | Schmidt | 507-6740 Cursus Avenue | 728 | 
| 3 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 | 
| 4 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 342 | 
| 5 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 | 
| 6 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 636 | 
| ... | ... | ... | ... | ... | 
credit_scores %>%count(first_name, last_name) %>%filter(n > 1)
  first_name last_name     n
  <fct>      <fct>     <int>
1 Katell     Roy           2
2 Miriam     Day           2
3 Tamekah    Forbes        2
4 Xandra     Barrett       2
dup_ids <- credit_scores %>% count(first_name, last_name) %>% filter(n > 1)credit_scores %>% filter(first_name %in% dup_ids$first_name, last_name %in% dup_ids$last_name)
  first_name last_name                           address credit_score
1     Xandra   Barrett  P.O. Box 309, 2462 Pharetra, Rd.          620
2    Tamekah    Forbes    P.O. Box 147, 511 Velit Street          356
3     Miriam       Day          6042 Sollicitudin Avenue          313
4     Xandra   Barrett  P.O. Box 309, 2462 Pharetra, Rd.          636
5    Tamekah    Forbes    P.O. Box 147, 511 Velit Street          342
...
Drop all duplicates except one
| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 | 
| 2 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 342 | 
| 3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 | 
| 4 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 636 | 
Drop all duplicates except one
| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 | 
| 2 | ||||
| 3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 | 
| 4 | 
Drop all duplicates except one
| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 | 
| 3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 | 
credit_scores %>%
  distinct(first_name, last_name, .keep_all = TRUE)
    first_name   last_name                                address credit_score
1       Harlan      Hebert             P.O. Box 356, 3869 Non Av.          305
2        Drake        Soto                     643-1409 Ac Avenue          642
3        Felix     Morales                     741-1497 Velit Ave          780
4       Brynne     Charles                  313-3757 Ultrices St.          513
5       Aquila      Dillon      P.O. Box 945, 5550 Aliquam Street          748
...
Summarize differing values using statistical summary functions (mean(), max(), etc.)
| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 | 
| 2 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 342 | 
| 3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 | 
| 4 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 636 | 
Summarize differing values using statistical summary functions (mean(), max(), etc.)
| First name | Last name | Address | Credit score | Mean credit score | |
|---|---|---|---|---|---|
| 1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 | 349 | 
| 2 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 342 | |
| 3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 | 628 | 
| 4 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 636 | 
Summarize differing values using statistical summary functions (mean(), max(), etc.)
| First name | Last name | Address | Credit score | Mean credit score | |
|---|---|---|---|---|---|
| 1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 | 349 | 
| 2 | |||||
| 3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 | 628 | 
| 4 | 
Summarize differing values using statistical summary functions (mean(), max(), etc.)
| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 349 | 
| 2 | ||||
| 3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 628 | 
| 4 | 
Summarize differing values using statistical summary functions (mean(), max(), etc.)
| First name | Last name | Address | Credit score | |
|---|---|---|---|---|
| 1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 349 | 
| 3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 628 | 
credit_scores %>%group_by(first_name, last_name) %>%mutate(mean_credit_score = mean(credit_score))
  first_name last_name address                          credit_score mean_score
1 Tamekah    Forbes    P.O. Box 147, 511 Velit Street            356        349
2 Tamekah    Forbes    P.O. Box 147, 511 Velit Street            342        349
3 Xandra     Barrett   P.O. Box 309, 2462 Pharetra, Rd.          636        628
4 Xandra     Barrett   P.O. Box 309, 2462 Pharetra, Rd.          620        628
5 Katell     Roy       Ap #434-4081 Mi Av.                       455        455
...
credit_scores %>% group_by(first_name, last_name) %>% mutate(mean_credit_score = mean(credit_score)) %>%distinct(first_name, last_name, .keep_all = TRUE) %>%select(-credit_score)
  first_name last_name address                          mean_score
  <fct>      <fct>     <fct>                                 <dbl>
1 Tamekah    Forbes    P.O. Box 147, 511 Velit Street          349
2 Xandra     Barrett   P.O. Box 309, 2462 Pharetra, Rd.        628
3 Katell     Roy       Ap #434-4081 Mi Av.                     455
4 Miriam     Day       6042 Sollicitudin Avenue                313
...
Cleaning Data in R