Encoding text data

Deep Learning for Text with PyTorch

Shubham Jain

Data Scientist

Text encoding

Pytorch Processing Pipeline

  • Convert text into machine-readable numbers
  • Enable analysis and modeling

Sequential Data Image to find Insights

Deep Learning for Text with PyTorch

Encoding techniques

  • One-hot encoding: transforms words into unique numerical representations
  • Bag-of-Words (BoW): captures word frequency, disregarding order
  • TF-IDF: balances uniqueness and importance
  • Embedding: converts words into vectors, capturing semantic meaning (Chapter 2)
Deep Learning for Text with PyTorch

One-hot encoding

  • Mapping each word to a distinct vector
  • Binary vector:
    • 1 for the presence of a word
    • 0 for the absence of a word
  • ['cat', 'dog', 'rabbit']
    • 'cat' [1, 0, 0]
    • 'dog' [0, 1, 0]
    • 'rabbit' [0, 0, 1]
Deep Learning for Text with PyTorch

One-hot encoding with PyTorch

import torch
vocab = ['cat', 'dog', 'rabbit']

vocab_size = len(vocab)
one_hot_vectors = torch.eye(vocab_size)
one_hot_dict = {word: one_hot_vectors[i] for i, word in enumerate(vocab)}
print(one_hot_dict)
{'cat': tensor([1., 0., 0.]),
  'dog': tensor([0., 1., 0.]),
  'rabbit': tensor([0., 0., 1.])}
Deep Learning for Text with PyTorch

Bag-of-words

  • Example: "The cat sat on the mat"
  • Bag-of-words:
    • {'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}
  • Treating each document as an unordered collection of words
  • Focuses on frequency, not order
Deep Learning for Text with PyTorch

CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names_out())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Deep Learning for Text with PyTorch

TF-IDF

  • Term Frequency-Inverse Document Frequency
    • Scores the importance of words in a document
    • Rare words have a higher score
    • Common ones have a lower score
    • Emphasizes informative words
Deep Learning for Text with PyTorch

TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

corpus = ['This is the first document.','This document is the second document.', 'And this is the third one.','Is this the first document?']
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names_out())
[[0.         0.         0.68091856 0.51785612 0.51785612 0.        ]
 [0.         0.         0.          0.51785612 0.51785612 0.68091856]
 [0.85151335 0.42575668 0.         0.32274454 0.32274454 0.        ]
 [0.         0.         0.68091856 0.51785612 0.51785612 0.        ]]

['and' 'document' 'first' 'is' 'one' 'second']
Deep Learning for Text with PyTorch

TfidfVectorizer

TFIDF Code

Deep Learning for Text with PyTorch

Encoding techniques

Techniques: One-hot encoding, bag-of-words, and TF-IDF

  • Allows models to understand and process text
  • Choose one technique to avoid redudancy
  • More techniques exist
Deep Learning for Text with PyTorch

Let's practice!

Deep Learning for Text with PyTorch

Preparing Video For Download...