Methods of string distances

Intermediate Regular Expressions in R

Angelo Zehr

Data Journalist

Damerau-Levenshtein

rick caplan typo

Intermediate Regular Expressions in R

Method abbreviations

Regular Levenshtein distance:

stringdist(a, b, method = "lv")

Damerau-Levenshtein distance:

stringdist(a, b, method = "dl")

Optimal String Alignment distance:

stringdist(a, b, method = "osa")
Intermediate Regular Expressions in R

Q-Grams (or n-grams)

honolulu qgrams

Intermediate Regular Expressions in R

Q-Grams (or n-grams)

honolulu hanolulu qgrams

Intermediate Regular Expressions in R

Inspecting q-grams

qgrams("Honolulu", "Hanolulu", q = 2)

Returns:

   Ho on ul no ol lu la
V1  1  1  1  1  1  2  0
V2  1  1  1  1  1  1  1
Intermediate Regular Expressions in R

Method abbreviations

Sum of qgrams that are not shared

stringdist(a, b, method = "qgram") # equals 4

Not shared qgrams divided by total number of qgrams

stringdist(a, b, method = "jaccard") # equals 0.5

Optimal String Alignment distance

stringdist(a, b, method = "cosine") # equals 0.22
Intermediate Regular Expressions in R

Let's practice!

Intermediate Regular Expressions in R

Preparing Video For Download...