Multimodal QA tasks

Multi-Modal Models with Hugging Face

James Chapman

Curriculum Manager, DataCamp

Multimodal QA tasks

Diagram showing the processing of image and text in a VQA model

  1. Separate encoding of question text and other modality
Multi-Modal Models with Hugging Face

Multimodal QA tasks

Diagram showing the processing of image and text in a VQA model

  1. Separate encoding of question text and other modality
  2. Combination of encoded features
Multi-Modal Models with Hugging Face

Multimodal QA tasks

Diagram showing the processing of image and text in a VQA model

  1. Separate encoding of question text and other modality
  2. Combination of encoded features
  3. Additional model layers to predict answer tokens
Multi-Modal Models with Hugging Face

VQA

import requests
from PIL import Image

url = "https://www.worldanimalprotection
.org/cdn-cgi/image/width=1920,format=
auto/globalassets/images/elephants/1
033551-elephant.jpg"


image = Image.open(requests.get(url, stream=True).raw)
text = "What animal is in this photo?"

Picture of an elephant in the wild

Multi-Modal Models with Hugging Face

VQA

 

  • Model knows image and text features of many objects
  • Reusable models with no extra fine-tuning

Diagram showing the focusing in on an animal and identification

Multi-Modal Models with Hugging Face

VQA

from transformers import ViltProcessor, ViltForQuestionAnswering


processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa") model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
encoding = processor(image, text, return_tensors="pt")
outputs = model(**encoding)
idx = outputs.logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[idx])
Predicted answer: elephant
Multi-Modal Models with Hugging Face

Document-text to text

  • Extension of VQA to detect graphs, tables, and text (OCR) in images
from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("lmms-lab/DocVQA")


import matplotlib.pyplot as plt plt.imshow(dataset["test"][2]["image"]) plt.show()

Image of document with graphs and bar charts

Multi-Modal Models with Hugging Face

Document-text to text

Google Tessearact logo

  • Extra dependencies needed to run OCR
  • pytesseract installed via pip
  • Tesseract OCR via package installer (e.g. apt-get, exe or homebrew/macports)

Picture of a coffee sign with the OCR

Multi-Modal Models with Hugging Face

Document-text to text

from transformers import pipeline
pipe = pipeline("document-question-answering", "impira/layoutlm-document-qa")

result = pipe( dataset["test"][2]["image"], "What was the gross income in 2011-2012?" )
Multi-Modal Models with Hugging Face

Document-text to text

print(result)
[{'score': 0.05149758607149124,
  'answer': '3 36073 Crores', ...}]

Image of document with graphs and bar charts

Multi-Modal Models with Hugging Face

Let's practice!

Multi-Modal Models with Hugging Face

Preparing Video For Download...