Uniqueness constraints

Cleaning Data in Python

Adel Nehme

Content Developer @ DataCamp

What are duplicate values?

All columns have the same values

first_name last_name address height weight
Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 193 cm 87 kg
Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 193 cm 87 kg
Cleaning Data in Python

What are duplicate values?

Most columns have the same values

first_name last_name address height weight
Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 193 cm 87 kg
Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 194 cm 87 kg
Cleaning Data in Python

Why do they happen?

duplicate_1

Cleaning Data in Python

Why do they happen?

duplicate_2

Cleaning Data in Python

Why do they happen?

duplicate_3

Cleaning Data in Python

How to find duplicate values?

# Print the header
height_weight.head()
  first_name last_name                       address  height  weight
0       Lane     Reese              534-1559 Nam St.     181      64
1       Ivor    Pierce             102-3364 Non Road     168      66
2      Roary    Gibson   P.O. Box 344, 7785 Nisi Ave     191      99
3    Shannon    Little  691-2550 Consectetuer Street     185      65
4      Abdul       Fry                4565 Risus St.     169      65
Cleaning Data in Python

How to find duplicate values?

# Get duplicates across all columns
duplicates = height_weight.duplicated()
print(duplicates)
1       False
...     ....
22      True
23      False
...     ...
Cleaning Data in Python

How to find duplicate values?

# Get duplicate rows
duplicates = height_weight.duplicated()
height_weight[duplicates]
    first_name last_name                               address  height  weight
100       Mary     Colon                           4674 Ut Rd.     179      75
101       Ivor    Pierce                     102-3364 Non Road     168      88
102       Cole    Palmer                       8366 At, Street     178      91
103    Desirae   Shannon  P.O. Box 643, 5251 Consectetuer, Rd.     196      83
Cleaning Data in Python

How to find duplicate rows?

The .duplicated() method

subset: List of column names to check for duplication.

keep: Whether to keep first ('first'), last ('last') or all (False) duplicate values.

# Column names to check for duplication
column_names = ['first_name','last_name','address']
duplicates = height_weight.duplicated(subset = column_names, keep = False)
Cleaning Data in Python

How to find duplicate rows?

# Output duplicate values
height_weight[duplicates]
    first_name last_name                               address  height  weight
1         Ivor    Pierce                     102-3364 Non Road     168      66
22        Cole    Palmer                       8366 At, Street     178      91
28     Desirae   Shannon  P.O. Box 643, 5251 Consectetuer, Rd.     195      83
37        Mary     Colon                           4674 Ut Rd.     179      75
100       Mary     Colon                           4674 Ut Rd.     179      75
101       Ivor    Pierce                     102-3364 Non Road     168      88
102       Cole    Palmer                       8366 At, Street     178      91
103    Desirae   Shannon  P.O. Box 643, 5251 Consectetuer, Rd.     196      83
Cleaning Data in Python

How to find duplicate rows?

# Output duplicate values
height_weight[duplicates].sort_values(by = 'first_name')
    first_name last_name                               address  height  weight
22        Cole    Palmer                       8366 At, Street     178      91
102       Cole    Palmer                       8366 At, Street     178      91
28     Desirae   Shannon  P.O. Box 643, 5251 Consectetuer, Rd.     195      83
103    Desirae   Shannon  P.O. Box 643, 5251 Consectetuer, Rd.     196      83
1         Ivor    Pierce                     102-3364 Non Road     168      66
101       Ivor    Pierce                     102-3364 Non Road     168      88
37        Mary     Colon                           4674 Ut Rd.     179      75
100       Mary     Colon                           4674 Ut Rd.     179      75
Cleaning Data in Python

How to find duplicate rows?

# Output duplicate values
height_weight[duplicates].sort_values(by = 'first_name')

1_3_full_duplicates.png

Cleaning Data in Python

How to find duplicate rows?

# Output duplicate values
height_weight[duplicates].sort_values(by = 'first_name')

1_3_partial_duplicates.png

Cleaning Data in Python

How to treat duplicate values?

# Output duplicate values
height_weight[duplicates].sort_values(by = 'first_name')

1_3_full_duplicates.png

Cleaning Data in Python

How to treat duplicate values?

The .drop_duplicates() method

subset: List of column names to check for duplication.

keep: Whether to keep first ('first'), last ('last') or all (False) duplicate values.

inplace: Drop duplicated rows directly inside DataFrame without creating new object (True).

# Drop duplicates
height_weight.drop_duplicates(inplace = True)
Cleaning Data in Python

How to treat duplicate values?

# Output duplicate values
column_names = ['first_name','last_name','address']
duplicates = height_weight.duplicated(subset = column_names, keep = False)
height_weight[duplicates].sort_values(by = 'first_name')
    first_name last_name                               address  height  weight
28     Desirae   Shannon  P.O. Box 643, 5251 Consectetuer, Rd.     195      83
103    Desirae   Shannon  P.O. Box 643, 5251 Consectetuer, Rd.     196      83
1         Ivor    Pierce                     102-3364 Non Road     168      66
101       Ivor    Pierce                     102-3364 Non Road     168      88
Cleaning Data in Python

How to treat duplicate values?

# Output duplicate values
column_names = ['first_name','last_name','address']
duplicates = height_weight.duplicated(subset = column_names, keep = False)
height_weight[duplicates].sort_values(by = 'first_name')

1_3_duplicate_aggregation.png

Cleaning Data in Python

How to treat duplicate values?

The .groupby() and .agg() methods

# Group by column names and produce statistical summaries
column_names = ['first_name','last_name','address']
summaries = {'height': 'max', 'weight': 'mean'}
height_weight = height_weight.groupby(by = column_names).agg(summaries).reset_index()

# Make sure aggregation is done duplicates = height_weight.duplicated(subset = column_names, keep = False) height_weight[duplicates].sort_values(by = 'first_name')
first_name    last_name    address    height    weight

Cleaning Data in Python

Let's practice!

Cleaning Data in Python

Preparing Video For Download...