Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp



What's the distance between typhoon and baboon?
How many typos are needed to get from one string to another?

How many typos are needed to get from one string to another?

How many typos are needed to get from one string to another?

How many typos are needed to get from one string to another?









Total: 4

Which is best?
library(stringdist)
stringdist("baboon", 
           "typhoon",
           method = "dl")
4

# LCS
stringdist("baboon", "typhoon",
           method = "lcs")
7
# Jaccard
stringdist("baboon", "typhoon",
           method = "jaccard")
0.75
"EU", "eur", "Europ" $\rightarrow$ "Europe""EU", "eur", "Europ", "Europa", "Erope", "Evropa", ... $\rightarrow$ "Europe"?survey
          city move_score
1       chicgo          4
2   los angles          4
3      chicogo          5
4      new yrk          5
5    new yoork          2
6     seatttle          3
7   losangeles          4
8      seeatle          2
...
cities
         city
1    new york
2     chicago
3 los angeles
4     seattle
library(fuzzyjoin)
stringdist_left_join(survey, cities, by = "city", method = "dl")
        city.x move_score      city.y
1       chicgo          4     chicago
2   los angles          4 los angeles
3      chicogo          5     chicago
4      new yrk          5    new york
5    new yoork          2    new york
6     seatttle          3     seattle
7   losangeles          4 los angeles
8      seeatle          2     seattle
9      siattle          1     seattle
...
stringdist_left_join(survey, cities, by = "city", method = "dl", max_dist = 1)
        city.x move_score      city.y
1       chicgo          4     chicago
2   los angles          4 los angeles
3      chicogo          5     chicago
4      new yrk          5    new york
5    new yoork          2    new york
6     seatttle          3     seattle
7   losangeles          4 los angeles
8      seeatle          2        <NA>
9      siattle          1     seattle
...
Cleaning Data in R