Combining models and rules

Advanced NLP with spaCy

Ines Montani

spaCy core developer

Statistical predictions vs. rules

Statistical models Rule-based systems
Use cases application needs to generalize based on examples                                                                                                
Real-world examples product names, person names, subject/object relationships                                                                                                
spaCy features entity recognizer, dependency parser, part-of-speech tagger                                                                                                
Advanced NLP with spaCy

Statistical predictions vs. rules

Statistical models Rule-based systems
Use cases application needs to generalize based on examples dictionary with finite number of examples
Real-world examples product names, person names, subject/object relationships countries of the world, cities, drug names, dog breeds
spaCy features entity recognizer, dependency parser, part-of-speech tagger tokenizer, Matcher, PhraseMatcher
Advanced NLP with spaCy

Recap: Rule-based Matching

# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}] matcher.add('LOVE_CATS', None, pattern)
# Operators can specify how often a token should be matched pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]
# Calling matcher on doc returns list of (match_id, start, end) tuples doc = nlp("I love cats and I'm very very happy") matches = matcher(doc)
Advanced NLP with spaCy

Adding statistical predictions

matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc): span = doc[start:end] print('Matched span:', span.text)
# Get the span's root token and root head token print('Root token:', span.root.text) print('Root head token:', span.root.head.text)
# Get the previous token and its POS tag print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)
Matched span: Golden Retriever

Root token: Retriever Root head token: have
Previous token: a DET
Advanced NLP with spaCy

Efficient phrase matching (1)

  • PhraseMatcher like regular expressions or keyword search – but with access to the tokens!
  • Takes Doc object as patterns
  • More efficient and faster than the Matcher
  • Great for matching large word lists
Advanced NLP with spaCy

Efficient phrase matching (2)

from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever") matcher.add('DOG', None, pattern) doc = nlp("I have a Golden Retriever")
# iterate over the matches for match_id, start, end in matcher(doc): # get the matched span span = doc[start:end] print('Matched span:', span.text)
Matched span: Golden Retriever
Advanced NLP with spaCy

Let's practice!

Advanced NLP with spaCy

Preparing Video For Download...