Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Miriam | Day | 6042 Sollicitudin Avenue | 313 |
2 | Miriam | Day | 6042 Sollicitudin Avenue | 313 |
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Tamekah | Forbes | P.O. Box 147, 511 Velit St | 356 |
2 | Tamekah | Forbes | P.O. Box 147, 511 Velit St | 342 |
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Harper | Taylor | P.O. Box 212, 6557 Nunc Road | 655 |
2 | Miriam | Day | 6042 Sollicitudin Avenue | 313 |
3 | Eagan | Schmidt | 507-6740 Cursus Avenue | 728 |
4 | Miriam | Day | 6042 Sollicitudin Avenue | 313 |
5 | Katell | Roy | Ap #434-4081 Mi Av. | 455 |
6 | Katell | Roy | Ap #434-4081 Mi Av. | 455 |
... | ... | ... | ... | ... |
duplicated(credit_scores)
FALSE FALSE FALSE TRUE FALSE ...
sum(duplicated(credit_scores))
2
filter(credit_scores, duplicated(credit_scores))
first_name last_name address credit_score
1 Miriam Day 6042 Sollicitudin Avenue 313
2 Katell Roy Ap #434-4081 Mi Av. 455
credit_scores_unique <- distinct(credit_scores)
sum(duplicated(credit_scores_unique))
0
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Harper | Taylor | P.O. Box 212, 6557 Nunc Road | 655 |
2 | Eagan | Schmidt | 507-6740 Cursus Avenue | 728 |
3 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 |
4 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 342 |
5 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 |
6 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 636 |
... | ... | ... | ... | ... |
credit_scores %>%
count(first_name, last_name) %>%
filter(n > 1)
first_name last_name n
<fct> <fct> <int>
1 Katell Roy 2
2 Miriam Day 2
3 Tamekah Forbes 2
4 Xandra Barrett 2
dup_ids <- credit_scores %>% count(first_name, last_name) %>% filter(n > 1)
credit_scores %>% filter(first_name %in% dup_ids$first_name, last_name %in% dup_ids$last_name)
first_name last_name address credit_score
1 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 620
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
3 Miriam Day 6042 Sollicitudin Avenue 313
4 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 636
5 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
...
Drop all duplicates except one
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 |
2 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 342 |
3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 |
4 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 636 |
Drop all duplicates except one
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 |
2 | ||||
3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 |
4 |
Drop all duplicates except one
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 |
3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 |
credit_scores %>%
distinct(first_name, last_name, .keep_all = TRUE)
first_name last_name address credit_score
1 Harlan Hebert P.O. Box 356, 3869 Non Av. 305
2 Drake Soto 643-1409 Ac Avenue 642
3 Felix Morales 741-1497 Velit Ave 780
4 Brynne Charles 313-3757 Ultrices St. 513
5 Aquila Dillon P.O. Box 945, 5550 Aliquam Street 748
...
Summarize differing values using statistical summary functions (mean()
, max()
, etc.)
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 |
2 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 342 |
3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 |
4 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 636 |
Summarize differing values using statistical summary functions (mean()
, max()
, etc.)
First name | Last name | Address | Credit score | Mean credit score | |
---|---|---|---|---|---|
1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 | 349 |
2 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 342 | |
3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 | 628 |
4 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 636 |
Summarize differing values using statistical summary functions (mean()
, max()
, etc.)
First name | Last name | Address | Credit score | Mean credit score | |
---|---|---|---|---|---|
1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 356 | 349 |
2 | |||||
3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 620 | 628 |
4 |
Summarize differing values using statistical summary functions (mean()
, max()
, etc.)
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 349 |
2 | ||||
3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 628 |
4 |
Summarize differing values using statistical summary functions (mean()
, max()
, etc.)
First name | Last name | Address | Credit score | |
---|---|---|---|---|
1 | Tamekah | Forbes | P.O. Box 147, 511 Velit Street | 349 |
3 | Xandra | Barrett | P.O. Box 309, 2462 Pharetra Rd. | 628 |
credit_scores %>%
group_by(first_name, last_name) %>%
mutate(mean_credit_score = mean(credit_score))
first_name last_name address credit_score mean_score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356 349
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 342 349
3 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 636 628
4 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 620 628
5 Katell Roy Ap #434-4081 Mi Av. 455 455
...
credit_scores %>% group_by(first_name, last_name) %>% mutate(mean_credit_score = mean(credit_score)) %>%
distinct(first_name, last_name, .keep_all = TRUE) %>%
select(-credit_score)
first_name last_name address mean_score
<fct> <fct> <fct> <dbl>
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 349
2 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 628
3 Katell Roy Ap #434-4081 Mi Av. 455
4 Miriam Day 6042 Sollicitudin Avenue 313
...
Cleaning Data in R