Uniqueness constraints

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

What's a duplicate?

First name Last name Address Credit score
1 Miriam Day 6042 Sollicitudin Avenue 313
2 Miriam Day 6042 Sollicitudin Avenue 313

 

First name Last name Address Credit score
1 Tamekah Forbes P.O. Box 147, 511 Velit St 356
2 Tamekah Forbes P.O. Box 147, 511 Velit St 342
Cleaning Data in R

Why do duplicates occur?

Computer screen with data table on screen. Data entry and human error

Cleaning Data in R

Why do duplicates occur?

Computer with bug on it. Bugs and design errors.

Cleaning Data in R

Why do duplicates occur?

Circle of 5 databases, each with an arrow pointing inward at one larger database. Join or merge errors.

Cleaning Data in R

Full duplicates

First name Last name Address Credit score
1 Harper Taylor P.O. Box 212, 6557 Nunc Road 655
2 Miriam Day 6042 Sollicitudin Avenue 313
3 Eagan Schmidt 507-6740 Cursus Avenue 728
4 Miriam Day 6042 Sollicitudin Avenue 313
5 Katell Roy Ap #434-4081 Mi Av. 455
6 Katell Roy Ap #434-4081 Mi Av. 455
... ... ... ... ...
Cleaning Data in R

Finding full duplicates

duplicated(credit_scores)
FALSE FALSE FALSE TRUE FALSE ...
sum(duplicated(credit_scores))
2
Cleaning Data in R

Finding full duplicates

filter(credit_scores, duplicated(credit_scores))
  first_name last_name                  address credit_score
1     Miriam       Day 6042 Sollicitudin Avenue          313
2     Katell       Roy      Ap #434-4081 Mi Av.          455
Cleaning Data in R

Dropping full duplicates

credit_scores_unique <- distinct(credit_scores)

sum(duplicated(credit_scores_unique))
0
Cleaning Data in R

Partial duplicates

First name Last name Address Credit score
1 Harper Taylor P.O. Box 212, 6557 Nunc Road 655
2 Eagan Schmidt 507-6740 Cursus Avenue 728
3 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
4 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
5 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
6 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636
... ... ... ... ...
Cleaning Data in R

Finding partial duplicates

credit_scores %>%

count(first_name, last_name) %>%
filter(n > 1)
  first_name last_name     n
  <fct>      <fct>     <int>
1 Katell     Roy           2
2 Miriam     Day           2
3 Tamekah    Forbes        2
4 Xandra     Barrett       2
Cleaning Data in R

Finding partial duplicates

dup_ids <- credit_scores %>%
  count(first_name, last_name) %>%
  filter(n > 1)

credit_scores %>% filter(first_name %in% dup_ids$first_name, last_name %in% dup_ids$last_name)
  first_name last_name                           address credit_score
1     Xandra   Barrett  P.O. Box 309, 2462 Pharetra, Rd.          620
2    Tamekah    Forbes    P.O. Box 147, 511 Velit Street          356
3     Miriam       Day          6042 Sollicitudin Avenue          313
4     Xandra   Barrett  P.O. Box 309, 2462 Pharetra, Rd.          636
5    Tamekah    Forbes    P.O. Box 147, 511 Velit Street          342
...
Cleaning Data in R

Handling partial duplicates: dropping

Drop all duplicates except one

First name Last name Address Credit score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
4 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636
Cleaning Data in R

Handling partial duplicates: dropping

Drop all duplicates except one

First name Last name Address Credit score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
2
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
4
Cleaning Data in R

Handling partial duplicates: dropping

Drop all duplicates except one

First name Last name Address Credit score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
Cleaning Data in R

Dropping partial duplicates

credit_scores %>%
  distinct(first_name, last_name, .keep_all = TRUE)
    first_name   last_name                                address credit_score
1       Harlan      Hebert             P.O. Box 356, 3869 Non Av.          305
2        Drake        Soto                     643-1409 Ac Avenue          642
3        Felix     Morales                     741-1497 Velit Ave          780
4       Brynne     Charles                  313-3757 Ultrices St.          513
5       Aquila      Dillon      P.O. Box 945, 5550 Aliquam Street          748
...
Cleaning Data in R

Handling partial duplicates: summarizing

Summarize differing values using statistical summary functions (mean(), max(), etc.)

First name Last name Address Credit score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
4 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636
Cleaning Data in R

Handling partial duplicates: summarizing

Summarize differing values using statistical summary functions (mean(), max(), etc.)

First name Last name Address Credit score Mean credit score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356 349
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620 628
4 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636
Cleaning Data in R

Handling partial duplicates: summarizing

Summarize differing values using statistical summary functions (mean(), max(), etc.)

First name Last name Address Credit score Mean credit score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356 349
2
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620 628
4
Cleaning Data in R

Handling partial duplicates: summarizing

Summarize differing values using statistical summary functions (mean(), max(), etc.)

First name Last name Address Credit score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 349
2
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 628
4
Cleaning Data in R

Handling partial duplicates: summarizing

Summarize differing values using statistical summary functions (mean(), max(), etc.)

First name Last name Address Credit score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 349
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 628
Cleaning Data in R

Summarizing partial duplicates

credit_scores %>%

group_by(first_name, last_name) %>%
mutate(mean_credit_score = mean(credit_score))
  first_name last_name address                          credit_score mean_score
1 Tamekah    Forbes    P.O. Box 147, 511 Velit Street            356        349
2 Tamekah    Forbes    P.O. Box 147, 511 Velit Street            342        349
3 Xandra     Barrett   P.O. Box 309, 2462 Pharetra, Rd.          636        628
4 Xandra     Barrett   P.O. Box 309, 2462 Pharetra, Rd.          620        628
5 Katell     Roy       Ap #434-4081 Mi Av.                       455        455
...
Cleaning Data in R

Summarizing partial duplicates

credit_scores %>%
  group_by(first_name, last_name) %>%
  mutate(mean_credit_score = mean(credit_score)) %>%

distinct(first_name, last_name, .keep_all = TRUE) %>%
select(-credit_score)
  first_name last_name address                          mean_score
  <fct>      <fct>     <fct>                                 <dbl>
1 Tamekah    Forbes    P.O. Box 147, 511 Velit Street          349
2 Xandra     Barrett   P.O. Box 309, 2462 Pharetra, Rd.        628
3 Katell     Roy       Ap #434-4081 Mi Av.                     455
4 Miriam     Day       6042 Sollicitudin Avenue                313
...
Cleaning Data in R

Let's practice!

Cleaning Data in R

Preparing Video For Download...