Advanced tokenization with regex

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

Regex groups using or "|"

  • OR is represented using |
  • You can define a group using ()
  • You can define explicit character ranges using []
import re 
 match_digits_and_words = ('(\d+|\w+)')

re.findall(match_digits_and_words, 'He has 11 cats.')
['He', 'has', '11', 'cats']
Introduction to Natural Language Processing in Python

Regex ranges and groups

pattern matches example
[A-Za-z]+ upper and lowercase English alphabet 'ABCDEFghijk'
[0-9] numbers from 0 to 9 9
[A-Za-z\-\.]+ upper and lowercase English alphabet, - and . 'My-Website.com'
(a-z) a, - and z 'a-z'
(\s+l,) spaces or a comma ', '
Introduction to Natural Language Processing in Python

Character range with `re.match()`

import re 
my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)
<_sre.SRE_Match object; 
          span=(0, 42), match='match lowercase spaces nums like 12'>
Introduction to Natural Language Processing in Python

Let's practice!

Introduction to Natural Language Processing in Python

Preparing Video For Download...