Fine-tuning text-to-speech models

Multi-Modal Models with Hugging Face

James Chapman

Curriculum Manager, DataCamp

Purpose of fine-tuning text-to-speech

 

  • Learn sounds in new languages and dialects
  • Apply to new contexts

E.g., large English language for general pretraining model ⇏ realistic Italian speech

cartoon english italian speech bubbles

Multi-Modal Models with Hugging Face

Purpose of fine-tuning text-to-speech

Diagram of how a speech embedding fits in a speech generation pipeline

  • Speaker embedding + text-to-speech model features → generative model
  • New speaker embedding are insufficient on their own without fine-tuning
Multi-Modal Models with Hugging Face

Preparing an audio dataset

VoxPopuli dataset: transcribed speech data for 18 languages from EU parliament

from datasets import load_dataset
dataset = load_dataset("facebook/voxpopuli", "it", split="train", 
                       trust_remote_code=True)
print(dataset.features)
['audio', 'raw_text', 'normalized_text', 'gender', 'speaker_id', ... ]
  • Need to preprocess the audio + add speech embeddings:
speaker_model = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb", 
                                       savedir="pretrained_models/spkrec-xvect-voxceleb")
Multi-Modal Models with Hugging Face

Audio preprocessing

from transformers import SpeechT5Processor
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")


def prepare_dataset(example): audio = example["audio"] example = processor(text=example["normalized_text"], audio_target=audio["array"], sampling_rate=audio["sampling_rate"], return_attention_mask=False)
example["labels"] = example["labels"][0]
with torch.no_grad(): speaker_embeddings = speaker_model.encode_batch(torch.tensor(audio["array"])) speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2) example["speaker_embeddings"] = speaker_embeddings.squeeze().cpu().numpy() return example
dataset = dataset.map(prepare_dataset)
Multi-Modal Models with Hugging Face

Training arguments

from transformers import Seq2SeqTrainingArguments


training_args = Seq2SeqTrainingArguments( per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=1e-5,
warmup_steps=500,
label_names=["labels"],
data_collator=data_collator )
Multi-Modal Models with Hugging Face

Putting it all together

model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

Trainer:

trainer = Seq2SeqTrainer(args=training_args, model=model,
    train_dataset=dataset["train"], eval_dataset=dataset["test"],
                         tokenizer=processor)

Run the training: trainer.train()

Multi-Modal Models with Hugging Face

Using the new model

text = "se sono italiano posso cantare l'opera lirica"


speaker_embedding = torch.tensor(dataset[5]["speaker_embeddings"]).unsqueeze(0)
inputs = processor(text=text, return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)
make_spectrogram(speech)
Multi-Modal Models with Hugging Face

Using the new model

Italian speech spectrogram

Multi-Modal Models with Hugging Face

Let's practice!

Multi-Modal Models with Hugging Face

Preparing Video For Download...