İnce ayar için veriyi ön işleme

Llama 3 ile İnce Ayar (Fine-Tuning)

Francesca Donadoni

Curriculum Manager, DataCamp

İnce ayar için veri kümeleri kullanma

Veri kalitesi kritiktir
Eğitim Kümesi:
- Modeli eğitmek için
- Verinin çoğunluğu

Eğitim kümeli bir veri kümesi diyagramı.

İnce ayar için veri kümeleri kullanma

Veri kalitesi kritiktir
Eğitim Kümesi:
- Modeli eğitmek için
- Verinin çoğunluğu
Doğrulama Kümesi:
- En iyi model sürümünü seçmek için

Eğitim ve doğrulama kümesi diyagramı.

İnce ayar için veri kümeleri kullanma

Veri kalitesi kritiktir
Eğitim Kümesi:
- Modeli eğitmek için
- Verinin çoğunluğu
Doğrulama Kümesi:
- En iyi model sürümünü seçmek için
Test Kümesi:
- Model performansını değerlendirmek için

Eğitim, doğrulama ve test kümesi diyagramı.

Datasets kütüphanesiyle veriyi hazırlama

Datasets kütüphanesi
Ön işleme
Bölme
Yükleme
Bellek yönetimi

Veri akışının diyagramı: veri, datasets kütüphanesine akar; burada ön işleme, yükleme/yönetim ve entegrasyonlar için 3 yeşil kutu görülür; ok, çıktıya (hazır veri kutusu) yönelir.

Müşteri hizmetleri veri kümesini yükleme

from datasets import load_dataset

ds = load_dataset(
    'bitext/Bitext-customer-support-llm-chatbot-training-dataset',

    split="train"

)

print(ds.column_names)

['flags', 'instruction', 'category', 'intent', 'response']

Veriye hızlı bakış

import pprint
pprint.pprint(ds[0])

{'category': 'ORDER',
 'flags': 'B',
 'instruction': 'question about cancelling order {{Order Number}}',
 'intent': 'cancel_order',
 'response': "I've understood you have a question regarding canceling order "
             "{{Order Number}}, and I'm here to provide you with the "
             'information you need. Please go ahead and ask your question, and '
             "I'll do my best to assist you."}

Veri kümesini filtreleme

from datasets import load_dataset, Dataset

ds = load_dataset(
    'bitext/Bitext-customer-support-llm-chatbot-training-dataset',
    split="train")

print(ds.shape)

(26872, 5)

first_thousand_points = ds[:1000]

ds = Dataset.from_dict(first_thousand_points)

Veri kümesini ön işleme

def merge_example(row):

    row['conversation'] = f"Query: {row['instruction']}\nResponse: {row['response']}"
    return row

ds = ds.map(merge_example)

print(ds[0]['conversation'])

Query: question about cancelling order {{Order Number}}
Response: I've understood you have a question regarding canceling order {{Order Number}}, 
and I'm here to provide you with the information you need. Please go ahead and ask your 
question, and I'll do my best to assist you.

Ön işlenmiş veri kümesini kaydetme

ds.save_to_disk("preprocessed_dataset")

Veri kümesi kaydediliyor (1/1 parça): 100%
26872/26872 [00:00<00:00, 383823.33 örnek/sn]

from datasets import load_from_disk
ds_preprocessed = load_from_disk("preprocessed_dataset")

Hugging Face veri kümelerini TorchTune ile kullanma

Hugging Face veri kümesini TorchTune ile kullanabilirsiniz
Bir veri kümesi yolu ve yapılandırmalar ayarlayın

tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=preprocessed_dataset dataset.split=train

Hadi pratik yapalım!

Llama 3 ile İnce Ayar (Fine-Tuning)