RegEx with spaCy

Natural Language Processing with spaCy

Azadeh Mobasher

Principal Data Scientist

What is RegEx?

 

  • Rule-based information extraction (IR) is useful for many NLP tasks
  • Regular expression (RegEx) is used with complex string matching patterns
  • RegEx finds and retrieves patterns or replace matching patterns

RegEx: link and phone information extraction

Natural Language Processing with spaCy

RegEx strengths and weaknesses

Pros:

  • Enables writing robust rules to retrieve information
  • Can allow us to find many types of variance in strings
  • Runs fast
  • Supported by programming languages

Cons:

  • Syntax is challenging for beginners
  • Requires knowledge of all the ways a pattern may be mentioned in texts
Natural Language Processing with spaCy

RegEx in Python

 

  • Python comes prepackaged with a RegEx library, re.
  • The first step in using re package is to define a pattern.
  • The resulting pattern is used to find matching content.

 

import re

pattern = r"((\d){3}-(\d){3}-(\d){4})"
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."
Natural Language Processing with spaCy

RegEx in Python

 

  • We use .finditer() method from re package
iter_matches = re.finditer(pattern, text)

for match in iter_matches: start_char = match.start() end_char = match.end()
print ("Start character: ", start_char, "| End character: ", end_char, "| Matching text: ", text[start_char:end_char])
>>> Start character:  20 | End character:  32 | Matching text:  832-123-5555
Start character:  59 | End character:  71 | Matching text:  425-123-4567
Natural Language Processing with spaCy

RegEx in spaCy

  • RegEx in three pipeline components: Matcher, PhraseMatcher and EntityRuler.
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."

nlp = spacy.blank("en") patterns = [{"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"}, {"ORTH": "-"}, {"SHAPE": "ddd"}, {"ORTH": "-"}, {"SHAPE": "dddd"}]}]
ruler = nlp.add_pipe("entity_ruler") ruler.add_patterns(patterns) doc = nlp(text) print ([(ent.text, ent.label_) for ent in doc.ents])
>>> [('832-123-5555', 'PHONE_NUMBER'), ('425-123-4567', 'PHONE_NUMBER')]
Natural Language Processing with spaCy

Let's practice!

Natural Language Processing with spaCy

Preparing Video For Download...