Rule-based Matching

Advanced NLP with spaCy

Ines Montani

spaCy core developer

Why not just regular expressions?

  • Match on Doc objects, not just strings
  • Match on tokens and token attributes
  • Use the model's predictions
  • Example: "duck" (verb) vs. "duck" (noun)
Advanced NLP with spaCy

Match patterns

  • Lists of dictionaries, one per token

  • Match exact token texts

    [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
    
  • Match lexical attributes

    [{'LOWER': 'iphone'}, {'LOWER': 'x'}]
    
  • Match any token attributes

    [{'LEMMA': 'buy'}, {'POS': 'NOUN'}]
    
Advanced NLP with spaCy

Using the Matcher (1)

import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object nlp = spacy.load('en_core_web_sm')
# Initialize the matcher with the shared vocab matcher = Matcher(nlp.vocab)
# Add the pattern to the matcher pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}] matcher.add('IPHONE_PATTERN', None, pattern)
# Process some text doc = nlp("New iPhone X release date leaked")
# Call the matcher on the doc matches = matcher(doc)
Advanced NLP with spaCy

Using the Matcher (2)

# Call the matcher on the doc
doc = nlp("New iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches for match_id, start, end in matches:
# Get the matched span matched_span = doc[start:end] print(matched_span.text)
iPhone X
  • match_id: hash value of the pattern name
  • start: start index of matched span
  • end: end index of matched span
Advanced NLP with spaCy

Matching lexical attributes

pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]
doc = nlp("2018 FIFA World Cup: France won!")
2018 FIFA World Cup:
Advanced NLP with spaCy

Matching other token attributes

pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
doc = nlp("I loved dogs but now I love cats more.")
loved dogs
love cats
Advanced NLP with spaCy

Using operators and quantifiers (1)

pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
doc = nlp("I bought a smartphone. Now I'm buying apps.")
bought a smartphone
buying apps
Advanced NLP with spaCy

Using operators and quantifiers (2)

Description
{'OP': '!'} Negation: match 0 times
{'OP': '?'} Optional: match 0 or 1 times
{'OP': '+'} Match 1 or more times
{'OP': '*'} Match 0 or more times
Advanced NLP with spaCy

Let's practice!

Advanced NLP with spaCy

Preparing Video For Download...