Combining models and rules

Advanced NLP with spaCy

Ines Montani

spaCy core developer

Statistical predictions vs. rules

	Statistical models	Rule-based systems
Use cases	application needs to generalize based on examples
Real-world examples	product names, person names, subject/object relationships
spaCy features	entity recognizer, dependency parser, part-of-speech tagger

Statistical predictions vs. rules

	Statistical models	Rule-based systems
Use cases	application needs to generalize based on examples	dictionary with finite number of examples
Real-world examples	product names, person names, subject/object relationships	countries of the world, cities, drug names, dog breeds
spaCy features	entity recognizer, dependency parser, part-of-speech tagger	tokenizer, `Matcher`, `PhraseMatcher`

Recap: Rule-based Matching

# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

Adding statistical predictions

matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)

    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)

    # Get the previous token and its POS tag
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever

Root token: Retriever
Root head token: have

Previous token: a DET

Efficient phrase matching (1)

PhraseMatcher like regular expressions or keyword search – but with access to the tokens!
Takes Doc object as patterns
More efficient and faster than the Matcher
Great for matching large word lists

Efficient phrase matching (2)

from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)

doc = nlp("I have a Golden Retriever")


# iterate over the matches
for match_id, start, end in matcher(doc):
    # get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever

Let's practice!

Advanced NLP with spaCy