Build new features from text

Sentiment Analysis in Python

Violeta Misheva

Data Scientist

Goal of the video

 

Goal : Enrich the existing dataset with features related to the text column (capturing the sentiment)

Sentiment Analysis in Python

Product reviews data

reviews.head()

top 5 rows of the Amazon product reviews

Sentiment Analysis in Python

Features from the review column

 

  • How long is each review?
  • How many sentences does it contain?
  • What parts of speech are involved?
  • How many punctuation marks?
Sentiment Analysis in Python

Tokenizing a string

from nltk import word_tokenize
anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'
word_tokenize(anna_k)

['Happy','families','are', 'all','alike',',',
 'every','unhappy', 'family', 'is','unhappy','in',
 'its','own','way','.']
Sentiment Analysis in Python

Tokens from a column

# General form of list comprehension
[expression for item in iterable]
word_tokens = [word_tokenize(review) for review in reviews.review]
type(word_tokens)
list
type(word_tokens[0])
list
Sentiment Analysis in Python

Tokens from a column

len_tokens = []

# Iterate over the word_tokens list
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the length of each review
reviews['n_tokens'] = len_tokens
Sentiment Analysis in Python

Dealing with punctuation

  • We did not address it but you can exclude it
  • A feature that measures the number of punctuation signs
    • A review with many punctuation signs could signal a very emotionally charged opinion
Sentiment Analysis in Python

Reviews with a feature for the length

reviews.head()

top 5 rows of Amazon product reviews, including the added column for length of a review

Sentiment Analysis in Python

Let's practice!

Sentiment Analysis in Python

Preparing Video For Download...