Feature Engineering for NLP in Python
Rounak Banik
Data Scientist
human in five places.human has weight 5.jupiter and universe occurring 20 times each.jupiter rarely occurs in the other documents. universe is common.jupiter on account of exclusivity.$$\red{w_{i,j}} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$
$$\red{w_{i,j}} \rightarrow \text{weight of term } i \text{ in document } j$$
$$w_{i,j} = \red{tf_{i,j}} \cdot \log\left(\frac{N}{df_{i}}\right) $$
$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$
$$\red{tf_{i,j}} \rightarrow \text{term frequency of term } i \text{ in document } j $$
$$w_{i,j} = tf_{i,j} \cdot \red{\log\left(\frac{N}{df_{i}}\right)} $$
$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$
$$tf_{i,j} \rightarrow \text{term frequency of term } i \text{in document } j $$
$$\red{N} \rightarrow \text{number of documents in the corpus } $$
$$\red{df_{i}} \rightarrow \text{number of documents containing term } i$$
$$w_{i,j} = tf_{i,j} \cdot \log\left(\frac{N}{df_{i}}\right) $$
$$w_{i,j} \rightarrow \text{weight of term } i \text{ in document } j$$
$$tf_{i,j} \rightarrow term \; frequency \; of \; term \; i \; in \; document \; j $$
$$N \rightarrow number \; of \; documents \; in \; the \; corpus $$
$$df_{i} \rightarrow number \; of \; documents \; cotaining \; term \; i$$
Example:
$\red{w_{library, document}} = 5 \cdot log(\frac{20}{8}) \approx 2 $
# Import TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer# Create TfidfVectorizer object vectorizer = TfidfVectorizer()# Generate matrix of word vectors tfidf_matrix = vectorizer.fit_transform(corpus) print(tfidf_matrix.toarray())
[[0. 0. 0. 0. 0.25434658 0.33443519
0.33443519 0. 0.25434658 0. 0.25434658 0.
0.76303975]
[0. 0.46735098 0. 0.46735098 0. 0.
0. 0.46735098 0. 0.46735098 0.35543247 0.
0. ]
...
Feature Engineering for NLP in Python