Getting granular with n-grams

Sentiment Analysis in Python

Violeta Misheva

Data Scientist

Context matters

I am happy, not sad.

I am sad, not happy.

Putting 'not' in front of a word (negation) is one example of how context matters.

The weather today is wonderful.

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Only unigrams
ngram_range=(1, 1)

# Uni- and bigrams
ngram_range=(1, 2)

CountVectorizer(max_features, max_df, min_df)

max_features: if specified, it will include only the top most frequent words in the vocabulary
- If max_features = None, all words will be included
max_df: ignore terms with higher than specified frequency
- If it is set to integer, then absolute count; if a float, then it is a proportion
- Default is 1.0, which means it does not ignore any terms
min_df: ignore terms with lower than specified frequency
- If it is set to integer, then absolute count; if a float, then it is a proportion
- Default is 1.0, which means it does not ignore any terms

Sentiment Analysis in Python