Getting granular with n-grams

Sentiment Analysis in Python

Violeta Misheva

Data Scientist

Context matters

I am happy, not sad.

I am sad, not happy.

  • Putting 'not' in front of a word (negation) is one example of how context matters.
Sentiment Analysis in Python

Capturing context with a BOW

  • Unigrams : single tokens

  • Bigrams: pairs of tokens

  • Trigrams: triples of tokens

  • n-grams: sequence of n-tokens

Sentiment Analysis in Python

Capturing context with BOW

The weather today is wonderful.

  • Unigrams : { The, weather, today, is, wonderful }

  • Bigrams: {The weather, weather today, today is, is wonderful}

  • Trigrams: {The weather today, weather today is, today is wonderful}

Sentiment Analysis in Python

n-grams with the CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(min_n, max_n))
# Only unigrams
ngram_range=(1, 1)
# Uni- and bigrams
ngram_range=(1, 2)
Sentiment Analysis in Python

What is the best n?

Longer sequence of tokens
  • Results in more features
  • Higher precision of machine learning models
  • Risk of overfitting
Sentiment Analysis in Python

Specifying vocabulary size

CountVectorizer(max_features, max_df, min_df)
  • max_features: if specified, it will include only the top most frequent words in the vocabulary
    • If max_features = None, all words will be included
  • max_df: ignore terms with higher than specified frequency
    • If it is set to integer, then absolute count; if a float, then it is a proportion
    • Default is 1.0, which means it does not ignore any terms
  • min_df: ignore terms with lower than specified frequency
    • If it is set to integer, then absolute count; if a float, then it is a proportion
    • Default is 1.0, which means it does not ignore any terms
Sentiment Analysis in Python

Let's practice!

Sentiment Analysis in Python

Preparing Video For Download...