Feature Engineering for NLP in Python
Rounak Banik
Data Scientist
human
in five places.human
has weight 5
.jupiter
and universe
occurring 20 times each.jupiter
rarely occurs in the other documents. universe
is common.jupiter
on account of exclusivity.$$\red{w_{i,j}} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$
$$\red{w_{i,j}} \rightarrow \text{weight of term } i \text{ in document } j$$
$$w_{i,j} = \red{tf_{i,j}} \cdot \log\left(\frac{N}{df_{i}}\right) $$
$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$
$$\red{tf_{i,j}} \rightarrow \text{term frequency of term } i \text{ in document } j $$
$$w_{i,j} = tf_{i,j} \cdot \red{\log\left(\frac{N}{df_{i}}\right)} $$
$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$
$$tf_{i,j} \rightarrow \text{term frequency of term } i \text{in document } j $$
$$\red{N} \rightarrow \text{number of documents in the corpus } $$
$$\red{df_{i}} \rightarrow \text{number of documents containing term } i$$
$$w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$
$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$
$$tf_{i,j} \rightarrow term \; frequency \; of \; term \; i \; in \; document \; j $$
$$N \rightarrow number \; of \; documents \; in \; the \; corpus $$
$$df_{i} \rightarrow number \; of \; documents \; cotaining \; term \; i$$
Example:
$\red{w_{library, document}} = 5 \cdot log(\frac{20}{8}) \approx 2 $
# Import TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer object vectorizer = TfidfVectorizer()
# Generate matrix of word vectors tfidf_matrix = vectorizer.fit_transform(corpus) print(tfidf_matrix.toarray())
[[0. 0. 0. 0. 0.25434658 0.33443519
0.33443519 0. 0.25434658 0. 0.25434658 0.
0.76303975]
[0. 0.46735098 0. 0.46735098 0. 0.
0. 0.46735098 0. 0.46735098 0.35543247 0.
0. ]
...
Feature Engineering for NLP in Python