Comparing strings

Cleaning Data in Python

Adel Nehme

Content Developer @ DataCamp

In this chapter

 

 

 

 

 

 

Chapter 4 - Record linkage

Cleaning Data in Python

Minimum edit distance

Least possible amount of steps needed to transition from one string to another

Cleaning Data in Python

Minimum edit distance

Least possible amount of steps needed to transition from one string to another

Cleaning Data in Python

Minimum edit distance

Cleaning Data in Python

Minimum edit distance

Minimum edit distance so far: 2

Cleaning Data in Python

Minimum edit distance

Minimum edit distance: 5

Cleaning Data in Python

Minimum edit distance

 

Cleaning Data in Python

Minimum edit distance algorithms

Algorithm Operations
Damerau-Levenshtein insertion, substitution, deletion, transposition
Levenshtein insertion, substitution, deletion
Hamming substitution only
Jaro distance transposition only
... ...

 

Possible packages: nltk, thefuzz, textdistance ..

Cleaning Data in Python

Minimum edit distance algorithms

Algorithm Operations
Damerau-Levenshtein insertion, substitution, deletion, transposition
Levenshtein insertion, substitution, deletion
Hamming substitution only
Jaro distance transposition only
... ...

 

Possible packages: thefuzz

Cleaning Data in Python

Simple string comparison

# Lets us compare between two strings
from thefuzz import fuzz

# Compare reeding vs reading fuzz.WRatio('Reeding', 'Reading')
86
Cleaning Data in Python

Partial strings and different orderings

# Partial string comparison
fuzz.WRatio('Houston Rockets', 'Rockets')
90
# Partial string comparison with different order
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')
86
Cleaning Data in Python

Comparison with arrays

# Import process
from thefuzz import process

# Define string and array of possible matches
string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(['Rockets vs Lakers', 'Lakers vs Rockets', 
                     'Houson vs Los Angeles', 'Heat vs Bulls'])

process.extract(string, choices, limit = 2)
[('Rockets vs Lakers', 86, 0), ('Lakers vs Rockets', 86, 1)]
Cleaning Data in Python

Collapsing categories with string similarity

Chapter 2

Use .replace() to collapse "eur" into "Europe"

 

What if there are too many variations?

"EU", "eur", "Europ", "Europa", "Erope", "Evropa"...

 

                                                                                                String similarity!

Cleaning Data in Python

Collapsing categories with string matching

print(survey['state'].unique())
id          state
0      California
1            Cali
2      Calefornia
3      Calefornie
4      Californie
5       Calfornia
6      Calefernia
7        New York
8   New York City
...
categories
  state
0 California
1 New York
Cleaning Data in Python

Collapsing all of the state

# For each correct category
for state in categories['state']:

# Find potential matches in states with typoes matches = process.extract(state, survey['state'], limit = survey.shape[0])
# For each potential match match for potential_match in matches: # If high similarity score if potential_match[1] >= 80:
# Replace typo with correct category survey.loc[survey['state'] == potential_match[0], 'state'] = state
Cleaning Data in Python

Record linkage

record linkage

Cleaning Data in Python

Let's practice!

Cleaning Data in Python

Preparing Video For Download...