Introduction to Natural Language Processing in Python
Katharine Jarmul
Founder, kjamistan
nltk
: natural language toolkitfrom nltk.tokenize import word_tokenize
word_tokenize("Hi there!")
['Hi', 'there', '!']
sent_tokenize
: tokenize a document into sentences
regexp_tokenize
: tokenize a string or document based on a regular expression pattern
TweetTokenizer
: special class just for tweet tokenization, allowing you to separate hashtags, mentions and lots of exclamation points!!!
re.search()
and re.match()
import re
re.match('abc', 'abcde')
<_sre.SRE_Match object; span=(0, 3), match='abc'>
re.search('abc', 'abcde')
<_sre.SRE_Match object; span=(0, 3), match='abc'>
re.match('cd', 'abcde')
re.search('cd', 'abcde')
<_sre.SRE_Match object; span=(2, 4), match='cd'>
Introduction to Natural Language Processing in Python