Speech-to-text

Multi-Modal Systems with the OpenAI API

James Chapman

Curriculum Manager, DataCamp

Coming up...

$$

Course goals
  • OpenAI's audio models
  • Text moderation
  • Case study: Customer support chatbot

An image showing audio models, text moderation and a case study

Multi-Modal Systems with the OpenAI API

Recap...

from openai import OpenAI


# Create the OpenAI client client = OpenAI(api_key="<OPENAI_API_TOKEN>")
# Create a request to the Chat Completions endpoint response = client.chat.completions.create(
model="gpt-4o-mini", messages=[{"role": "user", "content": "What is the OpenAI API?"}]
)
  • No API key required—it’s already set up for you 🎉
Multi-Modal Systems with the OpenAI API

Recap...

# Extract the content from the response
print(response.choices[0].message.content)
The OpenAI API is a cloud-based service provided by OpenAI that allows developers
to integrate advanced AI models into their applications.

$$

  • OpenAI API goes beyond text 🚀
Multi-Modal Systems with the OpenAI API

OpenAI's audio models

Speech-to-text capabilities:

  • Transcribe audio
  • Translate non-English audio
  • Supports mp3, mp4, mpeg, mpga, m4a, wav, and webm (25 MB limit)

 

Use cases:

  • Meeting transcripts
  • Video captions

An icon showing an audio recording and a text block.

  • Processing customer calls
Multi-Modal Systems with the OpenAI API

Loading audio files

 

Example: transcribe meeting_recording.mp3

audio_file = open("meeting_recording.mp3", "rb")

$$

If the file is located in a different directory

audio_file = open("path/to/file/meeting_recording.mp3", "rb")
Multi-Modal Systems with the OpenAI API

Creating the transcription

  • Audio endpoint
audio_file= open("meeting_recording.mp3", "rb")

response = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(response)
Transcription(text="Welcome everyone to the June product monthly. We'll get started in...)
1 https://platform.openai.com/docs/guides/speech-to-text
Multi-Modal Systems with the OpenAI API

The transcript

print(response.text)
Welcome everyone to the June product monthly. We'll get started in just a minute.
Alright, let's get started. Today's agenda will start with a spotlight from Chris
on the new mobile user onboarding flow, then we'll review how we're tracking on
our quarterly targets, and finally, we'll finish with another spotlight from Katie
who will discuss the upcoming branding updates...
Multi-Modal Systems with the OpenAI API

Transcribing non-English audio

An icon showing an audio recording and a text block.

Transcribing workflow:

  1. open() audio file
  2. Send a transcription request
  3. Extract the text
Multi-Modal Systems with the OpenAI API

Creating translations

audio_file = open("non_english_audio.m4a", "rb")


response = client.audio.translations.create(
model="whisper-1",
file=audio_file
)
print(response.text)
The search volume for keywords like A I has increased rapidly since the launch of
Cha GTP.
Multi-Modal Systems with the OpenAI API

Transcription performance

 

  • Performance can vary wildly, depending on:
    • Audio quality
    • Audio language
    • Model's knowledge of the subject matter

Different languages around the world.

Multi-Modal Systems with the OpenAI API

Let's practice!

Multi-Modal Systems with the OpenAI API

Preparing Video For Download...