Comparing strings

Pulizia dei dati in Python

Adel Nehme

VP of AI Curriculum, DataCamp

In this chapter

 

 

 

 

 

 

Chapter 4 - Record linkage

Pulizia dei dati in Python

Minimum edit distance

Least possible amount of steps needed to transition from one string to another

Pulizia dei dati in Python

Minimum edit distance

Least possible amount of steps needed to transition from one string to another

Pulizia dei dati in Python

Minimum edit distance

Pulizia dei dati in Python

Minimum edit distance

Minimum edit distance so far: 2

Pulizia dei dati in Python

Minimum edit distance

Minimum edit distance: 5

Pulizia dei dati in Python

Minimum edit distance

 

Pulizia dei dati in Python

Minimum edit distance algorithms

Algorithm Operations
Damerau-Levenshtein insertion, substitution, deletion, transposition
Levenshtein insertion, substitution, deletion
Hamming substitution only
Jaro distance transposition only
... ...

 

Possible packages: nltk, thefuzz, textdistance ..

Pulizia dei dati in Python

Minimum edit distance algorithms

Algorithm Operations
Damerau-Levenshtein insertion, substitution, deletion, transposition
Levenshtein insertion, substitution, deletion
Hamming substitution only
Jaro distance transposition only
... ...

 

Possible packages: thefuzz

Pulizia dei dati in Python

Simple string comparison

# Lets us compare between two strings
from thefuzz import fuzz

# Compare reeding vs reading fuzz.WRatio('Reeding', 'Reading')
86
Pulizia dei dati in Python

Partial strings and different orderings

# Partial string comparison
fuzz.WRatio('Houston Rockets', 'Rockets')
90
# Partial string comparison with different order
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')
86
Pulizia dei dati in Python

Comparison with arrays

# Import process
from thefuzz import process

# Define string and array of possible matches
string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(['Rockets vs Lakers', 'Lakers vs Rockets', 
                     'Houson vs Los Angeles', 'Heat vs Bulls'])

process.extract(string, choices, limit = 2)
[('Rockets vs Lakers', 86, 0), ('Lakers vs Rockets', 86, 1)]
Pulizia dei dati in Python

Collapsing categories with string similarity

Chapter 2

Use .replace() to collapse "eur" into "Europe"

 

What if there are too many variations?

"EU", "eur", "Europ", "Europa", "Erope", "Evropa"...

 

                                                                                                String similarity!

Pulizia dei dati in Python

Collapsing categories with string matching

print(survey['state'].unique())
id          state
0      California
1            Cali
2      Calefornia
3      Calefornie
4      Californie
5       Calfornia
6      Calefernia
7        New York
8   New York City
...
categories
  state
0 California
1 New York
Pulizia dei dati in Python

Collapsing all of the state

# For each correct category
for state in categories['state']:

# Find potential matches in states with typoes matches = process.extract(state, survey['state'], limit = survey.shape[0])
# For each potential match match for potential_match in matches: # If high similarity score if potential_match[1] >= 80:
# Replace typo with correct category survey.loc[survey['state'] == potential_match[0], 'state'] = state
Pulizia dei dati in Python

Record linkage

record linkage

Pulizia dei dati in Python

Let's practice!

Pulizia dei dati in Python

Preparing Video For Download...