Building n-gram models

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

BoW shortcomings

review	label
`'The movie was good and not boring'`	positive
`'The movie was not good and boring'`	negative

n = 2, n-grams:

[
'for you',
'you a',
'a thousand',
'thousand times',
'times over'
]

'for you a thousand times over'

n = 3, n-grams:

[
'for you a',
'you a thousand',
'a thousand times',
'thousand times over'
]

Generates only bigrams.

bigrams = CountVectorizer(ngram_range=(2,2))

Generates unigrams, bigrams and trigrams.

ngrams = CountVectorizer(ngram_range=(1,3))

Feature Engineering for NLP in Python