Building tf-idf document vectors

Feature Engineering for NLP in Python

Rounak Banik

Data Scientist

n-gram modeling

Weight of dimension dependent on the frequency of the word corresponding to the dimension.
- Document contains the word human in five places.
- Dimension corresponding to human has weight 5.

Motivation

Some words occur very commonly across all documents
Corpus of documents on the universe
- One document has jupiter and universe occurring 20 times each.
- jupiter rarely occurs in the other documents. universe is common.
- Give more weight to jupiter on account of exclusivity.

Applications

Automatically detect stopwords
Search
Recommender systems
Better performance in predictive modeling for some cases

Term frequency-inverse document frequency

Proportional to term frequency
Inverse function of the number of documents in which it occurs

Mathematical formula

$$\red{w_{i,j}} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$

$$\red{w_{i,j}} \rightarrow \text{weight of term } i \text{ in document } j$$

Mathematical formula

$$w_{i,j} = \red{tf_{i,j}} \cdot \log\left(\frac{N}{df_{i}}\right) $$

$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$

$$\red{tf_{i,j}} \rightarrow \text{term frequency of term } i \text{ in document } j $$

Mathematical formula

$$w_{i,j} = tf_{i,j} \cdot \red{\log\left(\frac{N}{df_{i}}\right)} $$

$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$

$$tf_{i,j} \rightarrow \text{term frequency of term } i \text{in document } j $$

$$\red{N} \rightarrow \text{number of documents in the corpus } $$

$$\red{df_{i}} \rightarrow \text{number of documents containing term } i$$

Mathematical formula

$$w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$

$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$

$$tf_{i,j} \rightarrow term \; frequency \; of \; term \; i \; in \; document \; j $$

$$N \rightarrow number \; of \; documents \; in \; the \; corpus $$

$$df_{i} \rightarrow number \; of \; documents \; cotaining \; term \; i$$

Example:

$\red{w_{library, document}} = 5 \cdot log(\frac{20}{8}) \approx 2 $

tf-idf using scikit-learn

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())

[[0.         0.         0.         0.         0.25434658 0.33443519
  0.33443519 0.         0.25434658 0.         0.25434658 0.
  0.76303975]
 [0.         0.46735098 0.         0.46735098 0.         0.
  0.         0.46735098 0.         0.46735098 0.35543247 0.
  0.        ]
...

Let's practice!

Feature Engineering for NLP in Python