Preprocess text for training

Efficient AI Model Training with PyTorch

Dennis Lee

Data Engineer

Text transformation: preparing data for model mastery

Summarize text in documents
Paraphrase identification
MRPC dataset: sentence pairs with labels

Tall stack of documents that need to be reviewed.

Dataset structure

from datasets import load_dataset
dataset = load_dataset("glue", "mrpc")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
    })
})

Manipulating the text dataset

Nested dictionary of train/validation/test splits
Example of accessing the train split:

dataset["train"]

Access dataset-specific features within a split
MRPC dataset features: sentence1, sentence2, label

dataset["train"]["sentence1"]

Load pre-trained tokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

Define an encoding function

Define a function to encode examples from our dataset
Call the tokenizer; extract sentence1 and sentence2 from the training example
truncation: Truncate inputs if longer than max length (512 tokens)
padding: Pad short sequences with zeros so all inputs have the same length

def encode(example):

    return tokenizer(
        example["sentence1"],
        example["sentence2"],

        truncation=True,

        padding="max_length",
    )

Format column names

Apply encode to each example in the train split using map

train_dataset = dataset["train"].map(encode, batched=True)

Rename label to labels

train_dataset = train_dataset.map(
    lambda examples: {"labels": examples["label"]}, batched=True
)

Look up model requirements for columns in the Hugging Face documentation

Saving and loading checkpoints

Place dataset on available GPUs

dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
dataloader = accelerator.prepare(dataloader)

Works with any PyTorch dataset (torch.utils.data.Dataset) in a DataLoader
Save the state of preprocessed text, called a checkpoint

checkpoint_dir = Path("preprocess_checkpoint")
accelerator.save_state(checkpoint_dir)

Load the checkpoint when we want to resume training

accelerator.load_state(checkpoint_dir)

Let's practice!

Efficient AI Model Training with PyTorch