Automatic speech recognition

Working with Hugging Face

Jacob H. Marquez

Lead Data Engineer

What is automatic speech recognition?

Speech soundwave

What is automatic speech recognition?

Speech soundwave turned into a text file.

Use cases of ASR

A person chatting with a digital assistant.

Use cases of ASR

A digital assistant responding with the weather in Seattle.

Use cases for ASR

A customer service agent.

Create transcripts
Finding relevant documentation and solutions

Use cases for ASR

Transcription using ASR for an online webinar

Models for ASR

Meta Wav2Vec logo

Models for ASR

OpenAI Whisper

Whisper performs better with punctuation and casing.

Instantiating a pipeline for ASR

transcriber = pipeline(task="automatic-speech-recognition", 
                       model="facebook/wav2vec2-base-960h")


# Path to audio file
transcriber("my_audio.wav")


# Numpy array
transcriber(numpy_audio_array)


# Dictionary
transcriber({"sampling_rate" = 16_000,"raw" = "my_audio.wav"})

Results from a pipeline

sampling_rate = 16_000
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))


input = data[0]['audio']['array']


prediction = transcriber(input)

print(prediction)

"what game do you want to play"

Predicting over a dataset

def data():
    for i in range(dataset):
        yield dataset[i]['audio']['array'], dataset[i]['sentence'].lower()


output = []

for audio, sentence in data():
    prediction = transcriber(audio)
    output.append((prediction, sentence))

[("what a nice black shirt", "what a nice blue shirt"), ...]

Evaluating ASR systems

Word Error Rate (WER)
Based on Levenshtein Distance
Metric for the difference between two sequences

Formula for word error rate.

Range from 0 to 1
Smaller value indicates closer similarity

¹ https://en.wikipedia.org/wiki/Levenshtein_distance

Word Error Rate

An example of using word error rate on a text string.

2 substitutions required to match to correct

2 / 6 = 0.33

Computing WER using Hugging Face

from evaluate import load


# Instantiate word error rate metric
wer = load("wer")


# Save true sentence as reference
reference = data[0]['sentence']

predictions = "I love DataCamp portraits on hay"

¹ https://huggingface.co/spaces/evaluate-metric/wer

Computing WER using Hugging Face

# Compute the WER between predictions and reference
wer_score = wer.compute(
  predictions=[prediction], 
  references=[reference]
  )


print(wer_score)

0.33

Let's practice!

Working with Hugging Face