Hugging Face Datasets

Working with Hugging Face

Jacob H. Marquez

Lead Data Engineer

Datasets in Hugging Face

Hugging Face Hub datasets

1 https://huggingface.co/datasets
Working with Hugging Face

Searching for datasets

Searching for a dataset

1 https://huggingface.co/datasets
Working with Hugging Face

Dataset cards

Dataset card for imdb dataset

1 https://huggingface.co/datasets/imdb
Working with Hugging Face

Dataset cards

$$

  • Path
  • Description
  • Dataset structure
  • An example
  • Field metadata

Dataset card for imdb dataset

1 https://huggingface.co/datasets/imdb
Working with Hugging Face

Dataset cards

Previewing a dataset

1 https://huggingface.co/datasets/imdb
Working with Hugging Face

Installing Datasets Package

 

pip install datasets

  • 🌐 Access
  • 📥 Download
  • ✂ Mutate
  • 🔧 Use
  • 🤝 Share

Installing Datasets Package

1 https://huggingface.co/docs/datasets/index
Working with Hugging Face

Inspecting a dataset

from datasets import load_dataset_builder

# Load dataset metadata
data_builder = load_dataset_builder("imdb")

# Access dataset size dataset_size_mb = data_builder.info.dataset_size / (1024 ** 2)
print(f"Dataset size: {round(dataset_size_mb, 2)} MB")
Dataset size: 127.02 MB
1 https://huggingface.co/docs/datasets/load_hub
Working with Hugging Face

Downloading a dataset

from datasets import load_dataset

data = load_dataset("imdb")

$$

Split parameter

data = load_dataset("imdb", split="train")
1 https://huggingface.co/docs/datasets/v2.15.0/loading
Working with Hugging Face

Apache Arrow dataset formats

Apache Arrow dataset

1 https://arrow.apache.org/overview/
Working with Hugging Face

Data manipulation

imdb = load_dataset("imdb", split="train")


# Filter imdb filtered = imdb.filter(lambda row: row['label']==0)
Dataset({
    features: ['text', 'label'],
    num_rows: 12500
})
1 https://huggingface.co/docs/datasets/process#select-and-filter
Working with Hugging Face

Data manipulation

# Slicing
sliced = filtered.select(range(2))


print(sliced)
Dataset({features: ['text', 'label'], num_rows: 2})
print(sliced[0]['text'])
I rented I AM CURIOUS-YELLOW...
1 https://huggingface.co/docs/datasets/process#select-and-filter
Working with Hugging Face

Benefits of datasets

$$

An image of dataset

$$

  • 🌐 Accessible and shareable

$$

  • 💻 Relevant to common ML tasks

$$

  • 📈 Efficient processing on large data
Working with Hugging Face

Let's practice!

Working with Hugging Face

Preparing Video For Download...