Preprocessing different modalities

Multi-Modal Models with Hugging Face

James Chapman

Curriculum Manager, DataCamp

Preprocessing text

An example text string.

  • Tokenizer: maps text → model input
Multi-Modal Models with Hugging Face

Preprocessing text

Flow diagram of text preprocessing up to normalization

  • Tokenizer: maps text → model input
    • Normalization: lowercasing, removing special characters, removing extra whitespace
Multi-Modal Models with Hugging Face

Preprocessing text

Continuation of flow diagram to tokenization

  • Tokenizer: maps text → model input
    • Normalization: lowercasing, removing special characters, removing extra whitespace
    • (Pre-)tokenization: splitting text into words/subwords
Multi-Modal Models with Hugging Face

Preprocessing text

Continuation of flow diagram to ID conversion and padding

  • Tokenizer: maps text → model input
    • Normalization: lowercasing, removing special characters, whitespace
    • (Pre-)tokenization: splitting text into words/subwords
    • ID conversion: Mapping of tokens to integers using a vocabulary
Multi-Modal Models with Hugging Face

Preprocessing text

preprocessing_text5.jpg

  • Tokenizer: maps text → model input
    • Normalization: lowercasing, removing special characters, whitespace
    • (Pre-)tokenization: splitting text into words/subwords
    • ID conversion: Mapping of tokens to integers using a vocabulary
    • Padding: Adding additional tokens for consistent length
Multi-Modal Models with Hugging Face

Preprocessing text

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')

text = "Do you need more éclairs?"
print(tokenizer.backend_tokenizer.normalizer.normalize_str(text))
do you need more eclairs
tokenizer(text, return_tensors='pt', padding=True)
{'input_ids': tensor([[  101,  ..., 102]]), ...}
Multi-Modal Models with Hugging Face

Preprocessing images

  • Normalization: pixel intensity updates
  • Resize: Match input layer of model
  • General rule → use preprocessing of original model

images of a group of people jumping before and after image preprocessing

1 https://huggingface.co/datasets/nlphuji/flickr30k
Multi-Modal Models with Hugging Face

Preprocessing images

Multimodal tasks require consistent preprocessing:

from transformers import BlipProcessor, BlipForConditionalGeneration

checkpoint = "Salesforce/blip-image-captioning-base"
model = BlipForConditionalGeneration.from_pretrained(checkpoint) processor = BlipProcessor.from_pretrained(checkpoint)

Encode image → transform to text encoding → decode text

image = load_dataset("nlphuji/flickr30k")['test'][11]["image"]
inputs = processor(images=image, return_tensors="pt")

output = model.generate(**inputs)
print(processor.decode(output[0]))
[{'generated_text': 'a group of people jumping'}]
Multi-Modal Models with Hugging Face

Preprocessing audio

Audio spectrum

  • Audio preprocessing:
    • Sequential array → filter/padding
    • Sampling rate → resampling

Feature extraction as model input (spectrogram)

spectro

Multi-Modal Models with Hugging Face

Preprocessing audio

from datasets import load_dataset, Audio

dataset = load_dataset("CSTR-Edinburgh/vctk")["train"] dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
  • Model specific full preprocessors should be available:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("openai/whisper-small")
audio_pp = processor(dataset[0]["audio"]["array"], 
                     sampling_rate=16_000, return_tensors="pt")
  • Sampling rate must match model input requirements
1 https://huggingface.co/datasets/CSTR-Edinburgh/vctk
Multi-Modal Models with Hugging Face

Let's practice!

Multi-Modal Models with Hugging Face

Preparing Video For Download...