Measuring semantic similarity with spaCy

Natural Language Processing with spaCy

Azadeh Mobasher

Principal Data Scientist

The semantic similarity method

Process of analyzing texts to identify similarities
Categorizes texts into predefined categories or detect relevant texts
Similarity score measures how similar two pieces of text are

What is the cheapest flight from Boston to Seattle?
Which airline serves Denver, Pittsburgh and Atlanta?
What kinds of planes are used by American Airlines?

Similarity score

A metric defined over texts
To measure similarity use Cosine similarity and word vectors
Cosine similarity is any number between 0 and 1

Cosine similarity and vectors

Token similarity

spaCy calculates similarity scores between Token objects

nlp = spacy.load("en_core_web_md")
doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")

token1 = doc1[2]
token2 = doc2[4]
print(f"Similarity between {token1} and {token2} = ", round(token1.similarity(token2), 3))

>>> Similarity between pizza and pasta =  0.685

Span similarity

spaCy calculates semantic similarity of two given Span objects

doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")

span1 = doc1[1:]
span2 = doc2[1:]

print(f"Similarity between \"{span1}\" and \"{span2}\" = ",
        round(span1.similarity(span2), 3))

>>> Similarity between "eat pizza" and "like to eat pasta" =  0.588

print(f"Similarity between \"{doc1[1:]}\" and \"{doc2[3:]}\" = ",
        round(doc1[1:].similarity(doc2[3:]), 3))

>>> Similarity between "eat pizza" and "eat pasta" =  0.936

Doc similarity

spaCy calculates the similarity scores between two documents

nlp = spacy.load("en_core_web_md")

doc1 = nlp("I like to play basketball")
doc2 = nlp("I love to play basketball")
print("Similarity score :", round(doc1.similarity(doc2), 3))

>>> Similarity score : 0.975

High cosine similarity shows highly semantically similar contents
Doc vectors default to an average of word vectors

Sentence similarity

spaCy finds relevant content to a given keyword
Finding similar customer questions to the word price:

sentences = nlp("What is the cheapest flight from Boston to Seattle? 
                 Which airline serves Denver, Pittsburgh and Atlanta? 
                 What kinds of planes are used by American Airlines?")

keyword = nlp("price")

for i, sentence in enumerate(sentences.sents):
    print(f"Similarity score with sentence {i+1}: ", round(sentence.similarity(keyword), 5))

>>> Similarity score with sentence 1:  0.26136
Similarity score with sentence 2:  0.14021
Similarity score with sentence 3:  0.13885

Let's practice!

Natural Language Processing with spaCy