Introduction to preprocessing for text

Deep Learning for Text with PyTorch

Shubham Jain

Data Scientist

What we will learn

  • Text classification
  • Text generation
  • Encoding
  • Deep learning models for text
  • Transformer architecture
  • Protecting models

Use cases:

  • Sentiment analysis
  • Text summarization
  • Machine translation

Sentiment Analysis

Deep Learning for Text with PyTorch

What you should know

Prerequisite course: Intermediate Deep Learning with PyTorch

  • Deep learning models with PyTorch
  • Training and evaluation loops
  • Convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
Deep Learning for Text with PyTorch

Text processing pipeline

 

 

Pytorch Processing Pipeline

Deep Learning for Text with PyTorch

Text processing pipeline

 

 

Pytorch Processing Pipeline

 

  • Clean and prepare text
Deep Learning for Text with PyTorch

PyTorch and NLTK

PyTorch Logo

NLTK Logo

  • Natural language tooklit
    • Transform raw text to processed text
Deep Learning for Text with PyTorch

Preprocessing techniques

  • Tokenization
  • Stop word removal
  • Stemming
  • Rare word removal
Deep Learning for Text with PyTorch

Tokenization

  • Tokens or words are extracted from text
  • Tokenization using torchtext
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("I am reading a book now. I love to read books!") print(tokens)
["I", "am", "reading", "a", "book", "now", ".", "I", "love", "to", "read", 
"books", "!"]
Deep Learning for Text with PyTorch

Stop word removal

  • Eliminate common words that do not contribute to the meaning
  • Stop words: "a", "the", "and", "or", and more
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = ["I", "am", "reading", "a", "book", "now", ".", "I", "love", "to", "read", "books", "!"] filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print(filtered_tokens)
["reading", "book", ".", "love", "read", "books", "!"]
Deep Learning for Text with PyTorch

Stemming

  • Reducing words to their base form
  • For example: "running", "runs", "ran" becomes run
import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
filtered_tokens = ["reading", "book", ".", "love", "read", "books", "!"]
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)
["read", "book", ".", "love", "read", "book", "!"]
Deep Learning for Text with PyTorch

Rare word removal

  • Removing infrequent words that don't add value
from nltk.probability import FreqDist
stemmed_tokens= ["read", "book", ".", "love", "read", "book", "!"]  
freq_dist = FreqDist(stemmed_tokens)

threshold = 2
common_tokens = [token for token in stemmed_tokens if freq_dist[token] > threshold] print(common_tokens)
["read", "book", "read", "book"]
Deep Learning for Text with PyTorch

Preprocessing techniques

Tokenization, stopword removal, stemming, and rare word removal

  • Reduce features
  • Cleaner, more representative datasets
  • More techniques exist
Deep Learning for Text with PyTorch

Let's practice!

Deep Learning for Text with PyTorch

Preparing Video For Download...