Menggabungkan model dan aturan

NLP Lanjutan dengan spaCy

Ines Montani

spaCy core developer

Prediksi statistik vs. aturan

	Model statistik	Sistem berbasis aturan
Kasus penggunaan	aplikasi perlu menggeneralisasi dari contoh
Contoh nyata	nama produk, nama orang, relasi subjek/objek
Fitur spaCy	pengenal entitas, parser ketergantungan, penanda kelas kata

Prediksi statistik vs. aturan

	Model statistik	Sistem berbasis aturan
Kasus penggunaan	aplikasi perlu menggeneralisasi dari contoh	kamus dengan jumlah contoh terbatas
Contoh nyata	nama produk, nama orang, relasi subjek/objek	negara di dunia, kota, nama obat, ras anjing
Fitur spaCy	pengenal entitas, parser ketergantungan, penanda kelas kata	tokenizer, `Matcher`, `PhraseMatcher`

Rekap: Pencocokan berbasis aturan

# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

Menambahkan prediksi statistik

matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)

    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)

    # Get the previous token and its POS tag
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever

Root token: Retriever
Root head token: have

Previous token: a DET

Pencocokan frasa yang efisien (1)

PhraseMatcher mirip regex atau pencarian kata kunci – tetapi dengan akses ke token!
Menerima objek Doc sebagai pola
Lebih efisien dan cepat daripada Matcher
Cocok untuk mencocokkan daftar kata besar

Pencocokan frasa yang efisien (2)

from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)

doc = nlp("I have a Golden Retriever")


# iterate over the matches
for match_id, start, end in matcher(doc):
    # get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever

Ayo berlatih!

NLP Lanjutan dengan spaCy