Cleaning Data in Python
Adel Nehme
Content Developer @ DataCamp
Least possible amount of steps needed to transition from one string to another
Least possible amount of steps needed to transition from one string to another
Minimum edit distance so far: 2
Minimum edit distance: 5
Algorithm | Operations |
---|---|
Damerau-Levenshtein | insertion, substitution, deletion, transposition |
Levenshtein | insertion, substitution, deletion |
Hamming | substitution only |
Jaro distance | transposition only |
... | ... |
Possible packages: nltk
, thefuzz
, textdistance
..
Algorithm | Operations |
---|---|
Damerau-Levenshtein | insertion, substitution, deletion, transposition |
Levenshtein | insertion, substitution, deletion |
Hamming | substitution only |
Jaro distance | transposition only |
... | ... |
Possible packages: thefuzz
# Lets us compare between two strings from thefuzz import fuzz
# Compare reeding vs reading fuzz.WRatio('Reeding', 'Reading')
86
# Partial string comparison
fuzz.WRatio('Houston Rockets', 'Rockets')
90
# Partial string comparison with different order
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')
86
# Import process
from thefuzz import process
# Define string and array of possible matches
string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(['Rockets vs Lakers', 'Lakers vs Rockets',
'Houson vs Los Angeles', 'Heat vs Bulls'])
process.extract(string, choices, limit = 2)
[('Rockets vs Lakers', 86, 0), ('Lakers vs Rockets', 86, 1)]
Chapter 2
Use .replace()
to collapse "eur"
into "Europe"
What if there are too many variations?
"EU"
, "eur"
, "Europ"
, "Europa"
, "Erope"
, "Evropa"
...
String similarity!
print(survey['state'].unique())
id state
0 California
1 Cali
2 Calefornia
3 Calefornie
4 Californie
5 Calfornia
6 Calefernia
7 New York
8 New York City
...
categories
state
0 California
1 New York
# For each correct category for state in categories['state']:
# Find potential matches in states with typoes matches = process.extract(state, survey['state'], limit = survey.shape[0])
# For each potential match match for potential_match in matches: # If high similarity score if potential_match[1] >= 80:
# Replace typo with correct category survey.loc[survey['state'] == potential_match[0], 'state'] = state
Cleaning Data in Python