Measuring semantic similarity with spaCy

Natural Language Processing with spaCy

Azadeh Mobasher

Principal Data Scientist

The semantic similarity method

 

  • Process of analyzing texts to identify similarities
  • Categorizes texts into predefined categories or detect relevant texts
  • Similarity score measures how similar two pieces of text are

 

What is the cheapest flight from Boston to Seattle?
Which airline serves Denver, Pittsburgh and Atlanta?
What kinds of planes are used by American Airlines?
Natural Language Processing with spaCy

Similarity score

  • A metric defined over texts
  • To measure similarity use Cosine similarity and word vectors
  • Cosine similarity is any number between 0 and 1

Cosine similarity and vectors

Natural Language Processing with spaCy

Token similarity

  • spaCy calculates similarity scores between Token objects
nlp = spacy.load("en_core_web_md")
doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")

token1 = doc1[2] token2 = doc2[4] print(f"Similarity between {token1} and {token2} = ", round(token1.similarity(token2), 3))
>>> Similarity between pizza and pasta =  0.685
Natural Language Processing with spaCy

Span similarity

  • spaCy calculates semantic similarity of two given Span objects
doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")

span1 = doc1[1:]
span2 = doc2[1:]

print(f"Similarity between \"{span1}\" and \"{span2}\" = ", round(span1.similarity(span2), 3))
>>> Similarity between "eat pizza" and "like to eat pasta" =  0.588
print(f"Similarity between \"{doc1[1:]}\" and \"{doc2[3:]}\" = ",
        round(doc1[1:].similarity(doc2[3:]), 3))
>>> Similarity between "eat pizza" and "eat pasta" =  0.936
Natural Language Processing with spaCy

Doc similarity

  • spaCy calculates the similarity scores between two documents
nlp = spacy.load("en_core_web_md")

doc1 = nlp("I like to play basketball")
doc2 = nlp("I love to play basketball")
print("Similarity score :", round(doc1.similarity(doc2), 3))
>>> Similarity score : 0.975
  • High cosine similarity shows highly semantically similar contents
  • Doc vectors default to an average of word vectors
Natural Language Processing with spaCy

Sentence similarity

  • spaCy finds relevant content to a given keyword
  • Finding similar customer questions to the word price:
sentences = nlp("What is the cheapest flight from Boston to Seattle? 
                 Which airline serves Denver, Pittsburgh and Atlanta? 
                 What kinds of planes are used by American Airlines?")

keyword = nlp("price")

for i, sentence in enumerate(sentences.sents): print(f"Similarity score with sentence {i+1}: ", round(sentence.similarity(keyword), 5))
>>> Similarity score with sentence 1:  0.26136
Similarity score with sentence 2:  0.14021
Similarity score with sentence 3:  0.13885
Natural Language Processing with spaCy

Let's practice!

Natural Language Processing with spaCy

Preparing Video For Download...