Segment 5-7 min audio into sentence wise audio clips for creating speech recognition dataset

Question

I am trying to create a speech recognition dataset, especially for Indian Accents. I am taking from colleagues to build this. Daily I send an article link and ask them to record and upload it to google drive.

I have a problem with this approach. All audio recordings of length 5 -7 min. I am using the DeepSpeech model for this and it requires 10-sec audio sentences.

Suggest me any approach if possible to segment audio files into corresponding sentence phrases or to build a better with 5 min length audio files. Suggestions are more than welcome on a better way to create a speech-to-text dataset.

Daniel Warfield · Accepted Answer · 2025-09-17 20:04:09Z

What makes this tricky, I assume, is that you don't know exactly what text corresponds with each audio clip. Assuming the modeling objective is something like:

x = audio snippet y = textual transcription target = desired text

then you need to know what text corresponds with which piece of audio.

You could use a slightly more lax modeling objective where, instead of comparing the output sequence to some desired sequence, you can think of the output as a bag of words, then evaluate the model on if all of those words exist in the input. If the model is not already sophisticated enough to transcribe some of the audio then this won't be impactful, but if you have an existing model that's 80% of the way there, and you want to fine-tune it, then this might be an impactful approach.

If you did this approach, then Jon's answer will work nicely. Just grab some random window, transcribe the text, then see if the words exist in the desired output. If they do, reward the model. If they don't, penalize the model. You might do this through some custom loss function, probably something similar to reinforcement learning, or perhaps group relative policy optimization.

Jon Nordby · Accepted Answer · 2019-04-12 19:58:00Z

The typical approach is to just cut the clips into consecutive sections, and run the model on each such section. Sometimes a bit of overlap is used, say 10%. then you have to decide what to do with potential conflicts in these overlaps. A good model is usually robust against silence, otherwise you can try to cut silence in start and end of your 10-second window.

librosa.util.frame is a practical way of doing this in Python.

Stack Exchange Network

Segment 5-7 min audio into sentence wise audio clips for creating speech recognition dataset

2 Answers 2

Hot Network Questions

Segment 5-7 min audio into sentence wise audio clips for creating speech recognition dataset

2 Answers 2

Related

Hot Network Questions