Building n-gram models

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

BoW shortcomings

review label
'The movie was good and not boring' positive
'The movie was not good and boring' negative

 

  • Exactly the same BoW representation!
  • Context of the words is lost.
  • Sentiment dependent on the position of 'not'.
Feature Engineering for NLP in Python

n-grams

  • Contiguous sequence of n elements (or words) in a given document.
  • n = 1 → bag-of-words
    'for you a thousand times over'
    
  • n = 2, n-grams:
    [
    'for you',
    'you a',
    'a thousand',
    'thousand times',
    'times over'
    ]
    
Feature Engineering for NLP in Python

n-grams

'for you a thousand times over'
  • n = 3, n-grams:
    [
    'for you a',
    'you a thousand',
    'a thousand times',
    'thousand times over'
    ]
    
  • Captures more context.
Feature Engineering for NLP in Python

Applications

  • Sentence completion
  • Spelling correction
  • Machine translation correction
Feature Engineering for NLP in Python

Building n-gram models using scikit-learn

Generates only bigrams.

bigrams = CountVectorizer(ngram_range=(2,2))

Generates unigrams, bigrams and trigrams.

ngrams = CountVectorizer(ngram_range=(1,3))
Feature Engineering for NLP in Python

Shortcomings

  • Curse of dimensionality
  • Higher order n-grams are rare
  • Keep n small
Feature Engineering for NLP in Python

Let's practice!

Feature Engineering for NLP in Python

Preparing Video For Download...