Speech-to-text

Multi-Modal Systems with the OpenAI API

James Chapman

Curriculum Manager, DataCamp

Coming up...

Course goals

OpenAI's audio models
Text moderation
Case study: Customer support chatbot

An image showing audio models, text moderation and a case study

Recap...

from openai import OpenAI


# Create the OpenAI client
client = OpenAI(api_key="<OPENAI_API_TOKEN>")


# Create a request to the Chat Completions endpoint
response = client.chat.completions.create(

    model="gpt-4o-mini",
    messages=[{"role": "user", 
               "content": "What is the OpenAI API?"}]

)

No API key required—it’s already set up for you 🎉

Recap...

# Extract the content from the response
print(response.choices[0].message.content)

The OpenAI API is a cloud-based service provided by OpenAI that allows developers
to integrate advanced AI models into their applications.

OpenAI API goes beyond text 🚀

OpenAI's audio models

Speech-to-text capabilities:

Transcribe audio
Translate non-English audio
Supports mp3, mp4, mpeg, mpga, m4a, wav, and webm (25 MB limit)

Use cases:

Meeting transcripts
Video captions

An icon showing an audio recording and a text block.

Processing customer calls

Loading audio files

Example: transcribe meeting_recording.mp3

audio_file = open("meeting_recording.mp3", "rb")

If the file is located in a different directory

audio_file = open("path/to/file/meeting_recording.mp3", "rb")

Creating the transcription

Audio endpoint

audio_file= open("meeting_recording.mp3", "rb")


response = client.audio.transcriptions.create(

    model="whisper-1",

    file=audio_file

)


print(response)

Transcription(text="Welcome everyone to the June product monthly. We'll get started in...)

¹ https://platform.openai.com/docs/guides/speech-to-text

The transcript

print(response.text)

Welcome everyone to the June product monthly. We'll get started in just a minute.
Alright, let's get started. Today's agenda will start with a spotlight from Chris
on the new mobile user onboarding flow, then we'll review how we're tracking on
our quarterly targets, and finally, we'll finish with another spotlight from Katie
who will discuss the upcoming branding updates...

Transcribing non-English audio

An icon showing an audio recording and a text block.

Transcribing workflow:

open() audio file
Send a transcription request
Extract the text

Creating translations

audio_file = open("non_english_audio.m4a", "rb")


response = client.audio.translations.create(

    model="whisper-1",

    file=audio_file

)


print(response.text)

The search volume for keywords like A I has increased rapidly since the launch of
Cha GTP.

Transcription performance

Performance can vary wildly, depending on:
- Audio quality
- Audio language
- Model's knowledge of the subject matter

Different languages around the world.

Let's practice!

Multi-Modal Systems with the OpenAI API