Deep Learning for Text with PyTorch
Shubham Jain
Data Scientist
import torch vocab = ['cat', 'dog', 'rabbit']
vocab_size = len(vocab)
one_hot_vectors = torch.eye(vocab_size)
one_hot_dict = {word: one_hot_vectors[i] for i, word in enumerate(vocab)}
print(one_hot_dict)
{'cat': tensor([1., 0., 0.]),
'dog': tensor([0., 1., 0.]),
'rabbit': tensor([0., 0., 1.])}
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names_out())
[[0 1 1 1 0 0 1 0 1] [0 2 0 1 0 1 1 0 1] [1 0 0 1 1 0 1 1 1] [0 1 1 1 0 0 1 0 1]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer()
corpus = ['This is the first document.','This document is the second document.', 'And this is the third one.','Is this the first document?']
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names_out())
[[0. 0. 0.68091856 0.51785612 0.51785612 0. ] [0. 0. 0. 0.51785612 0.51785612 0.68091856] [0.85151335 0.42575668 0. 0.32274454 0.32274454 0. ] [0. 0. 0.68091856 0.51785612 0.51785612 0. ]]
['and' 'document' 'first' 'is' 'one' 'second']
Techniques: One-hot encoding, bag-of-words, and TF-IDF
Deep Learning for Text with PyTorch