Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
What's the distance between typhoon and baboon?
How many typos are needed to get from one string to another?
How many typos are needed to get from one string to another?
How many typos are needed to get from one string to another?
How many typos are needed to get from one string to another?
Total: 4
Which is best?
library(stringdist)
stringdist("baboon",
"typhoon",
method = "dl")
4
# LCS
stringdist("baboon", "typhoon",
method = "lcs")
7
# Jaccard
stringdist("baboon", "typhoon",
method = "jaccard")
0.75
"EU"
, "eur"
, "Europ"
$\rightarrow$ "Europe"
"EU"
, "eur"
, "Europ"
, "Europa"
, "Erope"
, "Evropa"
, ... $\rightarrow$ "Europe"
?survey
city move_score
1 chicgo 4
2 los angles 4
3 chicogo 5
4 new yrk 5
5 new yoork 2
6 seatttle 3
7 losangeles 4
8 seeatle 2
...
cities
city
1 new york
2 chicago
3 los angeles
4 seattle
library(fuzzyjoin)
stringdist_left_join(survey, cities, by = "city", method = "dl")
city.x move_score city.y
1 chicgo 4 chicago
2 los angles 4 los angeles
3 chicogo 5 chicago
4 new yrk 5 new york
5 new yoork 2 new york
6 seatttle 3 seattle
7 losangeles 4 los angeles
8 seeatle 2 seattle
9 siattle 1 seattle
...
stringdist_left_join(survey, cities, by = "city", method = "dl", max_dist = 1)
city.x move_score city.y
1 chicgo 4 chicago
2 los angles 4 los angeles
3 chicogo 5 chicago
4 new yrk 5 new york
5 new yoork 2 new york
6 seatttle 3 seattle
7 losangeles 4 los angeles
8 seeatle 2 <NA>
9 siattle 1 seattle
...
Cleaning Data in R