The Python package vosk-api ticked my boxes: open source, respects privacy (works 'offline'), and supports the languages I'm interested in: English, French, Spanish. The list of supported languages, currently limited, is growing: I was lucky with my needs. Getting started took me a little while, so in this answer I'd like to detail a few steps.
The audio must first be converted to the correct wav format.
Long texts should be read and transcribed in chunks.
STEP 1: Convert to WAV
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Convert common audio file formats to wav Also installed PyAudio, ffmpeg: conda install PyAudio conda config --add channels conda-forge conda install ffmpeg See which formats are supported by ffmpeg: ffmpeg -formats """ import os import subprocess def convert_to_wav(source:str): """ Convert common audio file formats like mp3 to the wav format Args: source: path to source file with extension '.mp3', '.ogg', etc. Return: output: path to output file with extension '.wav' Help: option -y to overwrite existing file. """ outdir, ext = os.path.splitext(source) output = outdir+'.wav' try: # basic conversion: # process = subprocess.run(['ffmpeg', '-y', '-i', source, output]) # conversion to format expected by vosk: process = subprocess.run(['ffmpeg', '-y', '-i', source, '-ar', '16000', '-ac', '1', output]) except Exception as e: print(str(e)) return output # make path to the audio file: several input formats are supported filesdir = '/path/to/audio-files' filename = 'nixon-resignation-cleaned-1974-08-08.ogg' #filename = 'churchill-finest-hour-160k-1940-06-18.mp3' filepath = os.path.join(filesdir, filename) # convert audio file to wav: convert_to_wav(filepath)
I set up the ffmpeg options by trial and error after finding that vosk-api was complaining about the format of my WAV audio files.
STEP 2: Convert WAV to TEXT
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Speech Recognition with Python and Vosk Install vosk on linux: pip install https://github.com/alphacep/vosk-api/releases/download/0.3.7/vosk-0.3.7-cp37-cp37m-linux_aarch64.whl Install vosk on MacOS: pip install -U https://github.com/alphacep/vosk-api/releases/download/0.3.7/vosk-0.3.7-cp38-cp38-macosx_10_12_x86_64.whl Download the language model from https://github.com/alphacep/vosk-android-demo/releases, unpack it in the current directory, and renamed it as 'model-en'. KaldiRecognizer usage: model = Model(path/to/model) KaldiRecognizer(model, freq): second argument freq is the source sample frequency Progress bar: pip install progressbar2 """ import os import sys import wave from vosk import Model, KaldiRecognizer import json import progressbar # !! progressbar2 under the hood def convert_wav_to_txt(source:str, language='English'): """ Interprets a wav file with the Vosk Speech Recognition API and saves the transcription to a text file. source: wav file format mono PCM """ # set up the destination file: filename = os.path.splitext(os.path.basename(source))[0] outdir = os.path.abspath(os.path.join(os.path.splitext(source)[0], os.pardir, os.pardir, 'output', filename)) outfile = outdir+'.txt' # set up the model: d = {'English': 'vosk-model-small-en-us-0.3', 'French': 'vosk-model-small-fr-pguyot-0.3', 'Spanish': 'vosk-model-small-es-0.3'} modeldir = d[language] modelpath = os.path.abspath(os.path.join(outdir, os.pardir, os.pardir, 'models', modeldir)) model = Model(modelpath) # set up recognizer: with wave.open(source, 'rb') as audio: freq = audio.getframerate() recognizer = KaldiRecognizer(model, freq) total = audio.getnframes() # initialize a list to hold chunks chunks = [] # set bytes size to be processed at each iteration: chunk_size = 2000 # initialize counter and progress bar count = 0 widgets = [progressbar.Percentage(), progressbar.Bar(marker='■')] # widgets = [progressbar.Percentage(), progressbar.Bar()] # process audio file: with open(source, 'rb') as audio: audio.read(44) #skip header # set up a progress bar for long jobs with progressbar.ProgressBar(widgets=widgets, max_value=10) as bar: while True: # read chunk by chunk data = audio.read(chunk_size) if len(data) == 0: break # append text if recognizer.AcceptWaveform(data): words = json.loads(recognizer.Result()) chunks.append(words) count += chunk_size bar.update(count/total) words = json.loads(recognizer.FinalResult()) chunks.append(words) chunks = [t for t in chunks if 'result' in t] transcript = [t for t in chunks if len(t['result']) != 0] phrases = [t['text'] for t in transcript] text = ' '.join(phrases) # write text to file: with open(outfile, 'w') as output: print(text, file=output) # print full path to output file: return print('\nOutput saved in:\n', outfile) # make path to wav audio file: filesdir = '/path/to/audio-files' filename = 'de-gaulle-appel-18-juin-160k-1940-06-18.wav' # convert French audio: filepath = os.path.join(filesdir, filename) convert_wav_to_txt(filepath, language='French')
REMARKS: pip3 install vosk didn't work for me: see instructions above to use the wheel method to install vosk. I added a progress bar because some of the files could take a while to transcribe and I wasn't sure if the system was hanging or working in the background. I put the code together by picking bits and pieces on github, so for instance not sure what a good bytes size is for each chunk. Not entirely sure why recognizer.FinalResult() is needed in addition to recognizer.Result(). I struggled a bit figuring out differences between open() and wave.open(). In particular, I couldn't do audio.read() after with wave.open(), for some reason (appears to be a known limitation), but I wanted to get the number of frames of the audio file before processing, so I ended up opening the file once with wave.open() to count frames and then with open() to process the frames, a dodgy decision. I used package json because I found that approach used by others, but I don't think it's absolutely necessary to use json...
I got resonably good transcriptions from a famous Nixon speech and a famous de Gaulle speech in French, but not so good for the famous "Finest Hour" Churchill speech: Churchill's pronunciation is horrible! Eventually I want to add some grammar/spelling check to the final text to improve legibility.
This is a first foray, still much to learn...