Hugging Face Datasets

Werken met Hugging Face

Jacob H. Marquez

Lead Data Engineer

HF Datasets 1

1 https://huggingface.co/datasets
Werken met Hugging Face

HF Datasets 2

1 https://huggingface.co/datasets
Werken met Hugging Face
Werken met Hugging Face
Werken met Hugging Face
Werken met Hugging Face

Package Datasets installeren

 

pip install datasets

  • 🌐 Toegang tot
  • 📥 Download
  • 🔧 Gebruik
  • 🤝 Deel

HF Datasets

1 https://huggingface.co/docs/datasets/loading
Werken met Hugging Face

Een dataset downloaden

from datasets import load_dataset

data = load_dataset("IVN-RIN/BioBERT_Italian")

$$

Parameter split

data = load_dataset("IVN-RIN/BioBERT_Italian", split="train")
1 https://huggingface.co/docs/datasets/v2.15.0/loading
Werken met Hugging Face

Apache Arrow-datasetformaten

 

Apache Arrow-dataset

1 https://arrow.apache.org/overview/
Werken met Hugging Face

Data manipuleren

data = load_dataset("IVN-RIN/BioBERT_Italian", split="train")


# Filter op patroon " bella " filtered = data.filter(lambda row: " bella " in row['text']) print(filtered)
Dataset({
    features: ['text'],
    num_rows: 1122
})
1 https://huggingface.co/docs/datasets/process#select-and-filter
Werken met Hugging Face

Data manipuleren

# Selecteer de eerste twee rijen
sliced = filtered.select(range(2))


print(sliced)
Dataset({features: ['text'], num_rows: 2})
# Haal de 'text' van de eerste rij op
print(sliced[0]['text'])
Concentrazioni atmosferiche di PCDD/PCDF...
1 https://huggingface.co/docs/datasets/process#select-and-filter
Werken met Hugging Face

Laten we oefenen!

Werken met Hugging Face

Preparing Video For Download...