Efficient AI Model Training with PyTorch
Dennis Lee
Data Engineer

from datasets import load_dataset
dataset = load_dataset("glue", "mrpc")
print(dataset)
DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
})
validation: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
})
test: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
})
})
dataset["train"]
sentence1, sentence2, labeldataset["train"]["sentence1"]
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
sentence1 and sentence2 from the training exampletruncation: Truncate inputs if longer than max length (512 tokens)padding: Pad short sequences with zeros so all inputs have the same lengthdef encode(example):return tokenizer( example["sentence1"], example["sentence2"],truncation=True,padding="max_length", )
encode to each example in the train split using maptrain_dataset = dataset["train"].map(encode, batched=True)
label to labelstrain_dataset = train_dataset.map(
lambda examples: {"labels": examples["label"]}, batched=True
)
dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
dataloader = accelerator.prepare(dataloader)
torch.utils.data.Dataset) in a DataLoadercheckpoint_dir = Path("preprocess_checkpoint")
accelerator.save_state(checkpoint_dir)
accelerator.load_state(checkpoint_dir)
Efficient AI Model Training with PyTorch