Rule-based Matching

Advanced NLP with spaCy

Ines Montani

spaCy core developer

Why not just regular expressions?

Match on Doc objects, not just strings
Match on tokens and token attributes
Use the model's predictions
Example: "duck" (verb) vs. "duck" (noun)

Match patterns

Lists of dictionaries, one per token
Match exact token texts
```
[{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
```
Match lexical attributes
```
[{'LOWER': 'iphone'}, {'LOWER': 'x'}]
```
Match any token attributes
```
[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]
```

Using the Matcher (1)

import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

Using the Matcher (2)

# Call the matcher on the doc
doc = nlp("New iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:

    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X

match_id: hash value of the pattern name
start: start index of matched span
end: end index of matched span

Matching lexical attributes

pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

doc = nlp("2018 FIFA World Cup: France won!")

2018 FIFA World Cup:

Matching other token attributes

pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

doc = nlp("I loved dogs but now I love cats more.")

loved dogs
love cats

Using operators and quantifiers (1)

pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")

bought a smartphone
buying apps

Using operators and quantifiers (2)

	Description
`{'OP': '!'}`	Negation: match 0 times
`{'OP': '?'}`	Optional: match 0 or 1 times
`{'OP': '+'}`	Match 1 or more times
`{'OP': '*'}`	Match 0 or more times

Let's practice!

Advanced NLP with spaCy