Capturing a token pattern

Sentiment Analysis in Python

Violeta Misheva

Data Scientist

String operators and comparisons

# Checks if a string is composed only of letters  
my_string.isalpha()
# Checks if a string is composed only of digits 
my_string.isdigit()
# Checks if a string is composed only of alphanumeric characters
my_string.isalnum()
Sentiment Analysis in Python

String operators with list comprehension

# Original word tokenization
word_tokens = [word_tokenize(review) for review in reviews.review]
# Keeping only tokens composed of letters
cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens]
len(word_tokens[0])
87
len(cleaned_tokens[0])
78
Sentiment Analysis in Python

Regular expressions

import re
my_string = '#Wonderfulday'
# Extract #, followed by any letter, small or capital
x = re.search('#[A-Za-z]', my_string)
x
<re.Match object; span=(0, 2), match='#W'>
Sentiment Analysis in Python

Token pattern with a BOW

# Default token pattern in CountVectorizer
'\b\w\w+\b'
# Specify a particular token pattern
CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b')
Sentiment Analysis in Python

Let's practice!

Sentiment Analysis in Python

Preparing Video For Download...