Loading data in PyTorch

Introduction to Deep Learning with PyTorch

Maham Faisal Khan

Senior Data Scientist

Introducing the PyTorch Dataset class

We need a way to load our dataset
PyTorch Dataset class provides an interface to store and manipulate data
PyTorch Dataset allows us to decouple data preprocessing from training, leading to better readability and modularity

The TensorDataset class

from torch.data.utils import TensorDataset

data = np.array([[1, 2, 3], [4, 5, 6]])
tensor = torch.tensor(data)

dataset = TensorDataset(tensor)
dataset[0]

tensor([[1, 2, 3], dtype=torch.float64)

TensorDataset:

takes any number of PyTorch tensors as inputs
outputs a tuple that contains the indexed element of each input tensor
useful when the data is loaded as a NumPy array

df = pd.read_csv(...)
df_numpy = df.to_numpy()
dataset = TensorDataset(df_numpy)

Creating a custom dataset

Nine features and one target (potability)

a sample of the water potability dataset

The potability (suitability for drinking) is either zero (non potable) or one (potable)
The dataset has been normalized: each feature is bounded between zero and one

Create a custom dataset

from torch.utils.data import Dataset

class WaterDataset(Dataset):
    def __init__(self, csv_path):
        super(WaterDataset, self).__init__()
        df = pd.read_csv(dataset_path, index_col=0)
        self.data = df.to_numpy()

    def __len__(self):
        # Return the number of samples in our dataset
        return self.data.shape[0]

    def __getitem__(self, index):
        # Return features (array for 9 values) and label (single float) at the given index
        return features, label

PyTorch dataloader

from torch.utils.data import DataLoader

dataset = WaterDataset(csv_path)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

PyTorch DataLoader:
- takes a PyTorch Dataset as input
- returns an iterable
- allows batching (returns multiple samples at once)
- allows shuffling (returns the data in random order).
Outputs of the DataLoader are inputs of the neural network

PyTorch dataloader

# Create an iterator
dataloader = iter(dataloader)
# Get the next data sample
features, labels = next(dataloader)
# Run a forward pass
predictions = model(features)

# Loop through the dataloader directly
for data in dataloader:
  # Extract features and labels
  features, labels = data
  # Run a forward pass
  predictions = model(features)

Let's practice!

Introduction to Deep Learning with PyTorch