Multi-Modal Models with Hugging Face
James Chapman
Curriculum Manager, DataCamp
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')
text = "Do you need more éclairs?"
print(tokenizer.backend_tokenizer.normalizer.normalize_str(text))
do you need more eclairs
tokenizer(text, return_tensors='pt', padding=True)
{'input_ids': tensor([[ 101, ..., 102]]), ...}
Multimodal tasks require consistent preprocessing:
from transformers import BlipProcessor, BlipForConditionalGeneration
checkpoint = "Salesforce/blip-image-captioning-base"
model = BlipForConditionalGeneration.from_pretrained(checkpoint) processor = BlipProcessor.from_pretrained(checkpoint)
Encode image → transform to text encoding → decode text
image = load_dataset("nlphuji/flickr30k")['test'][11]["image"] inputs = processor(images=image, return_tensors="pt")
output = model.generate(**inputs)
print(processor.decode(output[0]))
[{'generated_text': 'a group of people jumping'}]
Feature extraction as model input (spectrogram)
from datasets import load_dataset, Audio
dataset = load_dataset("CSTR-Edinburgh/vctk")["train"] dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("openai/whisper-small")
audio_pp = processor(dataset[0]["audio"]["array"],
sampling_rate=16_000, return_tensors="pt")
Multi-Modal Models with Hugging Face