Introduction to preprocessing for text

Deep Learning for Text with PyTorch

Shubham Jain

Data Scientist

What we will learn

Text classification
Text generation
Encoding
Deep learning models for text
Transformer architecture
Protecting models

Use cases:

Sentiment analysis
Text summarization
Machine translation

Sentiment Analysis

What you should know

Prerequisite course: Intermediate Deep Learning with PyTorch

Deep learning models with PyTorch
Training and evaluation loops
Convolutional neural networks (CNNs) and recurrent neural networks (RNNs)

Text processing pipeline

Pytorch Processing Pipeline

Text processing pipeline

Pytorch Processing Pipeline

Clean and prepare text

PyTorch and NLTK

PyTorch Logo

NLTK Logo

Natural language tooklit
- Transform raw text to processed text

Preprocessing techniques

Tokenization
Stop word removal
Stemming
Rare word removal

Tokenization

Tokens or words are extracted from text
Tokenization using torchtext

from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")

tokens = tokenizer("I am reading a book now. I love to read books!")
print(tokens)

["I", "am", "reading", "a", "book", "now", ".", "I", "love", "to", "read", 
"books", "!"]

Stop word removal

Eliminate common words that do not contribute to the meaning
Stop words: "a", "the", "and", "or", and more

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

tokens = ["I", "am", "reading", "a", "book", "now", ".", "I", "love", "to", "read",
"books", "!"]
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

print(filtered_tokens)

["reading", "book", ".", "love", "read", "books", "!"]

Stemming

Reducing words to their base form
For example: "running", "runs", "ran" becomes run

import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

filtered_tokens = ["reading", "book", ".", "love", "read", "books", "!"]

stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

print(stemmed_tokens)

["read", "book", ".", "love", "read", "book", "!"]

Rare word removal

Removing infrequent words that don't add value

from nltk.probability import FreqDist
stemmed_tokens= ["read", "book", ".", "love", "read", "book", "!"]  
freq_dist = FreqDist(stemmed_tokens)

threshold = 2

common_tokens = [token for token in stemmed_tokens if freq_dist[token] > threshold]
print(common_tokens)

["read", "book", "read", "book"]

Preprocessing techniques

Tokenization, stopword removal, stemming, and rare word removal

Reduce features
Cleaner, more representative datasets
More techniques exist

Let's practice!

Deep Learning for Text with PyTorch