Mengukur kemiripan semantik dengan spaCy

Pemrosesan Bahasa Alami dengan spaCy

Azadeh Mobasher

Principal Data Scientist

Metode kemiripan semantik

 

  • Proses menganalisis teks untuk mengidentifikasi kemiripan
  • Mengelompokkan teks ke kategori yang telah ditetapkan atau mendeteksi teks relevan
  • Skor kemiripan mengukur seberapa mirip dua potongan teks

 

What is the cheapest flight from Boston to Seattle?
Which airline serves Denver, Pittsburgh and Atlanta?
What kinds of planes are used by American Airlines?
Pemrosesan Bahasa Alami dengan spaCy

Skor kemiripan

  • Sebuah metrik untuk teks
  • Untuk mengukur kemiripan gunakan cosine similarity dan word vector
  • Cosine similarity bernilai antara 0 dan 1

Cosine similarity and vectors

Pemrosesan Bahasa Alami dengan spaCy

Kemiripan token

  • spaCy menghitung skor kemiripan antar objek Token
nlp = spacy.load("en_core_web_md")
doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")

token1 = doc1[2] token2 = doc2[4] print(f"Similarity between {token1} and {token2} = ", round(token1.similarity(token2), 3))
>>> Similarity between pizza and pasta =  0.685
Pemrosesan Bahasa Alami dengan spaCy

Kemiripan span

  • spaCy menghitung kemiripan semantik dua objek Span
doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")

span1 = doc1[1:]
span2 = doc2[1:]

print(f"Similarity between \"{span1}\" and \"{span2}\" = ", round(span1.similarity(span2), 3))
>>> Similarity between "eat pizza" and "like to eat pasta" =  0.588
print(f"Similarity between \"{doc1[1:]}\" and \"{doc2[3:]}\" = ",
        round(doc1[1:].similarity(doc2[3:]), 3))
>>> Similarity between "eat pizza" and "eat pasta" =  0.936
Pemrosesan Bahasa Alami dengan spaCy

Kemiripan dokumen

  • spaCy menghitung skor kemiripan antar dua dokumen
nlp = spacy.load("en_core_web_md")

doc1 = nlp("I like to play basketball")
doc2 = nlp("I love to play basketball")
print("Similarity score :", round(doc1.similarity(doc2), 3))
>>> Similarity score : 0.975
  • Cosine similarity tinggi menunjukkan konten sangat mirip secara semantik
  • Vektor Doc defaultnya adalah rata-rata vektor kata
Pemrosesan Bahasa Alami dengan spaCy

Kemiripan kalimat

  • spaCy menemukan konten relevan untuk sebuah kata kunci
  • Mencari pertanyaan pelanggan yang mirip dengan kata price:
sentences = nlp("What is the cheapest flight from Boston to Seattle? 
                 Which airline serves Denver, Pittsburgh and Atlanta? 
                 What kinds of planes are used by American Airlines?")

keyword = nlp("price")

for i, sentence in enumerate(sentences.sents): print(f"Similarity score with sentence {i+1}: ", round(sentence.similarity(keyword), 5))
>>> Similarity score with sentence 1:  0.26136
Similarity score with sentence 2:  0.14021
Similarity score with sentence 3:  0.13885
Pemrosesan Bahasa Alami dengan spaCy

Ayo berlatih!

Pemrosesan Bahasa Alami dengan spaCy

Preparing Video For Download...