Building tf-idf document vectors

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

n-gram modeling

  • Weight of dimension dependent on the frequency of the word corresponding to the dimension.
    • Document contains the word human in five places.
    • Dimension corresponding to human has weight 5.
Feature Engineering for NLP in Python

Motivation

  • Some words occur very commonly across all documents
  • Corpus of documents on the universe
    • One document has jupiter and universe occurring 20 times each.
    • jupiter rarely occurs in the other documents. universe is common.
    • Give more weight to jupiter on account of exclusivity.
Feature Engineering for NLP in Python

Applications

  • Automatically detect stopwords
  • Search
  • Recommender systems
  • Better performance in predictive modeling for some cases
Feature Engineering for NLP in Python

Term frequency-inverse document frequency

  • Proportional to term frequency
  • Inverse function of the number of documents in which it occurs
Feature Engineering for NLP in Python

Mathematical formula

$$\red{w_{i,j}} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$

$$\red{w_{i,j}} \rightarrow \text{weight of term } i \text{ in document } j$$

Feature Engineering for NLP in Python

Mathematical formula

$$w_{i,j} = \red{tf_{i,j}} \cdot \log\left(\frac{N}{df_{i}}\right) $$

$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$

$$\red{tf_{i,j}} \rightarrow \text{term frequency of term } i \text{ in document } j $$

Feature Engineering for NLP in Python

Mathematical formula

$$w_{i,j} = tf_{i,j} \cdot \red{\log\left(\frac{N}{df_{i}}\right)} $$

$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$

$$tf_{i,j} \rightarrow \text{term frequency of term } i \text{in document } j $$

$$\red{N} \rightarrow \text{number of documents in the corpus } $$

$$\red{df_{i}} \rightarrow \text{number of documents containing term } i$$

Feature Engineering for NLP in Python

Mathematical formula

$$w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$

$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$

$$tf_{i,j} \rightarrow term \; frequency \; of \; term \; i \; in \; document \; j $$

$$N \rightarrow number \; of \; documents \; in \; the \; corpus $$

$$df_{i} \rightarrow number \; of \; documents \; cotaining \; term \; i$$

Example:

$\red{w_{library, document}} = 5 \cdot log(\frac{20}{8}) \approx 2 $

Feature Engineering for NLP in Python

tf-idf using scikit-learn

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object vectorizer = TfidfVectorizer()
# Generate matrix of word vectors tfidf_matrix = vectorizer.fit_transform(corpus) print(tfidf_matrix.toarray())
[[0.         0.         0.         0.         0.25434658 0.33443519
  0.33443519 0.         0.25434658 0.         0.25434658 0.
  0.76303975]
 [0.         0.46735098 0.         0.46735098 0.         0.
  0.         0.46735098 0.         0.46735098 0.35543247 0.
  0.        ]
...
Feature Engineering for NLP in Python

Let's practice!

Feature Engineering for NLP in Python

Preparing Video For Download...