Comparing strings

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

Measuring distance between values

A number line ranging from 0 to 15 with a dot on 3 and a dot on 10.

Cleaning Data in R

Measuring distance between values

The same number line with a red bracket highlighting the length between the two dots on 3 and 10.

Cleaning Data in R

Measuring distance between values

Below the red line showing distance, 10 minus 3 equals 7.

                                            What's the distance between typhoon and baboon?

Cleaning Data in R

Minimum edit distance

How many typos are needed to get from one string to another?

Green plus sign for insertion.

Cleaning Data in R

Minimum edit distance

How many typos are needed to get from one string to another?

Red minus sign for deletion

Cleaning Data in R

Minimum edit distance

How many typos are needed to get from one string to another?

Purple cycle sign for substitution.

Cleaning Data in R

Minimum edit distance

How many typos are needed to get from one string to another?

Blue double arrow for transposition.

Cleaning Data in R

Edit distance = 1

Showing the difference between dog and dogs with a green plus sign.

Cleaning Data in R

Edit distance = 1

Showing the difference between bath and bat with a red minus sign.

Cleaning Data in R

Edit distance = 1

Showing the difference between cats and rats with a purple cycle sign.

Cleaning Data in R

Edit distance = 1

Showing the difference between sign and sign using a blue double arrow.

Cleaning Data in R

A more complex example

baboon $\rightarrow$ typhoon

the word baboon

Cleaning Data in R

A more complex example

baboon $\rightarrow$ typhoon
  • Insert h

Beneath baboon, the word babhoon with a green plus sign above the h.

Cleaning Data in R

A more complex example

baboon $\rightarrow$ typhoon
  • Insert h
  • Substitute b $\rightarrow$ t

Below that, the word tabhoon, with a purple cycle sign above the t

Cleaning Data in R

A more complex example

baboon $\rightarrow$ typhoon
  • Insert h
  • Substitute b $\rightarrow$ t
  • Substitute a $\rightarrow$ y

Below that, the word tybhoon with a purple cycle sign above the y.

Cleaning Data in R

A more complex example

baboon $\rightarrow$ typhoon
  • Insert h
  • Substitute b $\rightarrow$ t
  • Substitute a $\rightarrow$ y
  • Substitute b $\rightarrow$ p

Total: 4

Below that, the word typhoon with a purple cycle sign above the p.

Cleaning Data in R

Types of edit distance

  • Damerau-Levenshtein
    • What you just learned
  • Levenshtein
    • Considers only substitution, insertion, and deletion
  • LCS (Longest Common Subsequence)
    • Considers only insertion and deletion
  • Others
    • Jaro-Winkler
    • Jaccard

Which is best?

Cleaning Data in R

String distance in R

library(stringdist)
stringdist("baboon", 
           "typhoon",
           method = "dl")
4

Picture from previous slides showing how to get from baboon to typhoon.

Cleaning Data in R

Other methods

# LCS
stringdist("baboon", "typhoon",
           method = "lcs")
7
# Jaccard
stringdist("baboon", "typhoon",
           method = "jaccard")
0.75
Cleaning Data in R

Comparing strings to clean data

  • In Chapter 2:
    • "EU", "eur", "Europ" $\rightarrow$ "Europe"
  • What if there are too many variations?
    • "EU", "eur", "Europ", "Europa", "Erope", "Evropa", ... $\rightarrow$ "Europe"?
    • Use string distance!
Cleaning Data in R

Comparing strings to clean data

survey
          city move_score
1       chicgo          4
2   los angles          4
3      chicogo          5
4      new yrk          5
5    new yoork          2
6     seatttle          3
7   losangeles          4
8      seeatle          2
...
cities
         city
1    new york
2     chicago
3 los angeles
4     seattle
Cleaning Data in R

Remapping using string distance

library(fuzzyjoin)
stringdist_left_join(survey, cities, by = "city", method = "dl")
        city.x move_score      city.y
1       chicgo          4     chicago
2   los angles          4 los angeles
3      chicogo          5     chicago
4      new yrk          5    new york
5    new yoork          2    new york
6     seatttle          3     seattle
7   losangeles          4 los angeles
8      seeatle          2     seattle
9      siattle          1     seattle
...
Cleaning Data in R

Remapping using string distance

stringdist_left_join(survey, cities, by = "city", method = "dl", max_dist = 1)
        city.x move_score      city.y
1       chicgo          4     chicago
2   los angles          4 los angeles
3      chicogo          5     chicago
4      new yrk          5    new york
5    new yoork          2    new york
6     seatttle          3     seattle
7   losangeles          4 los angeles
8      seeatle          2        <NA>
9      siattle          1     seattle
...
Cleaning Data in R

Let's practice!

Cleaning Data in R

Preparing Video For Download...