Efficient AI Model Training with PyTorch
Dennis Lee
Data Engineer
from datasets import load_dataset
dataset = load_dataset("glue", "mrpc")
print(dataset)
DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
})
validation: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
})
test: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
})
})
dataset["train"]
sentence1
, sentence2
, label
dataset["train"]["sentence1"]
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
sentence1
and sentence2
from the training exampletruncation
: Truncate inputs if longer than max length (512 tokens)padding
: Pad short sequences with zeros so all inputs have the same lengthdef encode(example):
return tokenizer( example["sentence1"], example["sentence2"],
truncation=True,
padding="max_length", )
encode
to each example in the train split using map
train_dataset = dataset["train"].map(encode, batched=True)
label
to labels
train_dataset = train_dataset.map(
lambda examples: {"labels": examples["label"]}, batched=True
)
dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
dataloader = accelerator.prepare(dataloader)
torch.utils.data.Dataset
) in a DataLoader
checkpoint_dir = Path("preprocess_checkpoint")
accelerator.save_state(checkpoint_dir)
accelerator.load_state(checkpoint_dir)
Efficient AI Model Training with PyTorch