Comparing the similarity between strings

Cleaning Data in SQL Server Databases

Miriam Antona

Software Engineer

Describing the problem

  • Messy strings
| airport_state |
|---------------|
| Caalifornia   |
| California    |
| Californiaa   |
| Illinois      |
| Ilynois       |
| Tejas         |
| Texas         |
SOUNDEX
DIFFERENCE
Cleaning Data in SQL Server Databases

SOUNDEX

SOUNDEX ( character_expression ) 
  • Phonetic algorithm
  • Returns four-character code
  • Based on English language, but also works with many words in other languages
SELECT SOUNDEX('Illinois') AS soundex_code1;
SELECT SOUNDEX('Ilynois') AS soundex_code2;
SELECT SOUNDEX('California') AS soundex_code3;
| soundex_code1 |   | soundex_code2 |   | soundex_code3 |
|---------------|   |---------------|   |---------------|
| I452          |   | I452          |   | I416          |
Cleaning Data in SQL Server Databases

SOUNDEX - how it works

Example: "Illinois"
  • Writes the first letter of the word
  • Replaces to zero(0) vowels and letters "h", "w", "y" to zero(0), after the first letter

 

"Illinois" -> I

"Illinois" -> Ill0n00s

Cleaning Data in SQL Server Databases

SOUNDEX - how it works

Example: "Illinois"
  • Replaces consonants after the first letter
Cleaning Data in SQL Server Databases

SOUNDEX - how it works

Example: "Illinois"
  • Replaces consonants after the first letter
Letters Represented by
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6

 

"Ill0n00s" -> I4405002

Cleaning Data in SQL Server Databases

SOUNDEX - how it works

  • Replaces same adjacent digits with one
  • Removes all the zeros (0)
  • If the letter's digit is the same as the first digit, it removes the first digit.
  • Appends zeros if code contains less than 3 digits.
  • Removes final digits if code has more than 3 digits.

"I4405002" -> I40502

"I40502" -> I452

 

"I452" (don't apply)

Cleaning Data in SQL Server Databases

SOUNDEX - Exceptions

SELECT SOUNDEX('Arizona') AS soundex_code1;
SELECT SOUNDEX('Arkansas') AS soundex_code2;
| soundex_code1 |   | soundex_code2 |
|---------------|   |---------------|
| A625          |   | A625          |
Cleaning Data in SQL Server Databases

SOUNDEX - checking similarities

SELECT DISTINCT A1.airport_state
FROM airports A1 
INNER JOIN airports A2 
    ON SOUNDEX(A1.airport_state) = SOUNDEX(A2.airport_state)
    AND A1.airport_state <> A2.airport_state
| airport_state |
|---------------|
| Caalifornia   |
| California    |
| Californiaa   |
| Illinois      |
| Ilynois       |
| New Jersey    |
| New York      |
| Tejas         |
| Texas         |
Cleaning Data in SQL Server Databases

SOUNDEX - checking similarities

SELECT DISTINCT A1.airport_state
FROM airports A1 
INNER JOIN airports A2 
    ON SOUNDEX(REPLACE(A1.airport_state, ' ', '')) = SOUNDEX(REPLACE(A2.airport_state, ' ', ''))
    AND A1.airport_state <> A2.airport_state

"New York" -> "NewYork"

| airport_state |
|---------------|
| Caalifornia   |
| California    |
| Californiaa   |
| Illinois      |
| Ilynois       |
| Tejas         |
| Texas         |
Cleaning Data in SQL Server Databases

DIFFERENCE

DIFFERENCE ( character_expression , character_expression )  
  • Compares two SOUNDEX values
  • Returns a value from 0 to 4
    • 0 -> little or no similarity
    • 4 -> very similar or identically matching
Cleaning Data in SQL Server Databases

DIFFERENCE

SELECT DIFFERENCE('Illinois', 'Ilynois') AS dif_1;
| dif1 |
|------|
|  4   |
SELECT DIFFERENCE('Illinois', 'California') AS dif_2;
| dif2 |
|------|
| 1    |
Cleaning Data in SQL Server Databases

DIFFERENCE - checking similarities

SELECT DISTINCT A1.airport_state, A2.airport_state
FROM airports A1 
INNER JOIN airports A2 
    ON DIFFERENCE(REPLACE(A1.airport_state, ' ', ''), REPLACE(A2.airport_state, ' ', '')) = 4
    AND A1.airport_state <> A2.airport_state
| airport_state | airport_state |
|---------------|---------------|
| Caalifornia   | California    |
| Caalifornia   | Californiaa   |
| California    | Caalifornia   |
| California    | Californiaa   |
| Californiaa   | Caalifornia   |
| Californiaa   | California    |
| Illinois      | Ilynois       |
| Ilynois       | Illinois      |
| Massachusetts | Michigan      |
| Tejas         | Texas         |
| Texas         | Tejas         |
Cleaning Data in SQL Server Databases

Let's practice!

Cleaning Data in SQL Server Databases

Preparing Video For Download...