Advanced tokenization with regex

Introduction to Natural Language Processing in Python

Katharine Jarmul

Founder, kjamistan

Regex groups using or "|"

import re 
 match_digits_and_words = ('(\d+|\w+)')

re.findall(match_digits_and_words, 'He has 11 cats.')

['He', 'has', '11', 'cats']

pattern	matches	example
[A-Za-z]+	upper and lowercase English alphabet	'ABCDEFghijk'
[0-9]	numbers from 0 to 9	9
[A-Za-z\-\.]+	upper and lowercase English alphabet, - and .	'My-Website.com'
(a-z)	a, - and z	'a-z'
(\s+l,)	spaces or a comma	', '

import re 
my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

<_sre.SRE_Match object; 
          span=(0, 42), match='match lowercase spaces nums like 12'>

Introduction to Natural Language Processing in Python