Preprocess text for training

Efficient AI Model Training with PyTorch

Dennis Lee

Data Engineer

Text transformation: preparing data for model mastery

  • Summarize text in documents
  • Paraphrase identification
  • MRPC dataset: sentence pairs with labels

Tall stack of documents that need to be reviewed.

Efficient AI Model Training with PyTorch

Dataset structure

from datasets import load_dataset
dataset = load_dataset("glue", "mrpc")
print(dataset)
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
    })
})
Efficient AI Model Training with PyTorch

Manipulating the text dataset

  • Nested dictionary of train/validation/test splits
  • Example of accessing the train split:
dataset["train"]
  • Access dataset-specific features within a split
  • MRPC dataset features: sentence1, sentence2, label
dataset["train"]["sentence1"]
  • Load pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
Efficient AI Model Training with PyTorch

Define an encoding function

  • Define a function to encode examples from our dataset
  • Call the tokenizer; extract sentence1 and sentence2 from the training example
  • truncation: Truncate inputs if longer than max length (512 tokens)
  • padding: Pad short sequences with zeros so all inputs have the same length
def encode(example):

return tokenizer( example["sentence1"], example["sentence2"],
truncation=True,
padding="max_length", )
Efficient AI Model Training with PyTorch

Format column names

  • Apply encode to each example in the train split using map
train_dataset = dataset["train"].map(encode, batched=True)
  • Rename label to labels
train_dataset = train_dataset.map(
    lambda examples: {"labels": examples["label"]}, batched=True
)
  • Look up model requirements for columns in the Hugging Face documentation
Efficient AI Model Training with PyTorch

Saving and loading checkpoints

  • Place dataset on available GPUs
dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
dataloader = accelerator.prepare(dataloader)
  • Works with any PyTorch dataset (torch.utils.data.Dataset) in a DataLoader
  • Save the state of preprocessed text, called a checkpoint
checkpoint_dir = Path("preprocess_checkpoint")
accelerator.save_state(checkpoint_dir)
  • Load the checkpoint when we want to resume training
accelerator.load_state(checkpoint_dir)
Efficient AI Model Training with PyTorch

Let's practice!

Efficient AI Model Training with PyTorch

Preparing Video For Download...