Speech Recognition: Free Software and Complete Privacy

Question

Searching for software to transcribe audio files to text with the following constraints:

Language: English, French, Spanish
Complete Privacy: No internet or cloud-based solution, no upload of audio files
Code: Python, but open-minded.
Free — preferably Open Source
OS: order of preference: Mac/Linux, Windows, Android

I have done a quick search and have come across this Python library: Speech Recognition. That would be my first choice, if it can support at least English and French (Spanish a bonus) and allow privacy — as in secrecy — as I have a Python 3.8 and IDE set up.

Is CMUSphinx what I need? How about pocketsphinx (install failed for me)? How about Kaldi? How about the IBM Watson library?

Speed and accuracy are not a big deal, if I can get 70% recognition that would be great. The speech is slow and articulate (my android/iphones have no problem understanding it). Added: A free or inexpensive bundled app that satisfies all the criteria listed would be great, if it exists. The Google, Apple, Microsoft, IBM corporations all have some software that might tick quite a few of the boxes, but is the content really kept offline? Source material would be things like personal interviews — can't risk a leak. My impression was I would have better luck with a Python library.

EDIT: 31 May 2020

After exploring some (but not all!) of the available options, I chose Nikolay's suggestion to try Vosk. It is open source, respects privacy, and currently supports the languages I'm interested in: English, French, Spanish. Getting started took me a little while, so I'm going to add an answer below detailing the steps I followed to get my first audio file transcribed with Python's vosk-api.

Are you looking for libraries – or for a ready-to-use software? — Izzy
– Izzy, Commented May 17, 2020 at 13:18
You're preaching the choir :) I'd love to have such an offline app as well, ideally running on Linux (very ideally even on a Pi/Armbian device), which is why I bookmarked your question. Android would be very nice, too. But FOSS would be a must for me. // Maybe you can integrate those details with your question, so we can cleanup the comments? — Izzy
– Izzy, Commented May 17, 2020 at 22:01

Dan Getz · Accepted Answer · 2025-09-09 08:27:35Z

You can try Vosk, made by my company Alpha Cephei. It works on OSX with Python3.8, Windows and Linux.

Supports 9 languages - English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese. More to come.
Works offline, even on lightweight devices - Raspberry Pi, Android, iOS
Installs with simple pip3 install vosk
Portable per-language models are only 50Mb each, but there are much bigger server models available.
Provides streaming API for the best user experience (unlike popular speech-recognition python packages)
There are bindings for different programming languages, too - java/csharp/javascript etc.
Allows quick reconfiguration of vocabulary for best accuracy.
Supports speaker identification beside simple speech recognition.

I added an answer below to detail my experience with Vosk: let me know if you have any comments (corrections, improvements to code, etc.). Thanks. — PatrickT
– PatrickT, Commented Jun 2, 2020 at 5:13

Franck Dernoncourt · Accepted Answer · 2020-05-18 17:38:43Z

Some list I did when asking Is there any decent speech recognition software for Linux?:

CMU Sphinx
CVoiceControl
Ears
Julius
Kaldi (e.g., Kaldi GStreamer server)
IBM ViaVoice (used to run on Linux but was discontinued years ago)
NICO ANN Toolkit
OpenMindSpeech
RWTH ASR
shout
silvius (built on the Kaldi speech recognition toolkit)
Simon Listens
ViaVoice / Xvoice
Wine + Dragon NaturallySpeaking + NatLink + dragonfly + damselfly
https://github.com/DragonComputer/Dragonfire: only accepts voice commands

All the above-mentioned native Linux solutions have both poor accuracy and usability (or some don't allow free-text dictation but only voice commands). By poor accuracy, I mean an accuracy significantly below the one the speech recognition software I mentioned below for other platforms have. As for Wine + Dragon NaturallySpeaking, in my experience it keeps crashing, and I don't seem to be the only one to have such issues unfortunately.

On Microsoft Windows I use Dragon NaturallySpeaking, on Apple Mac OS X I use Apple Dictation and DragonDictate, on Android I use Google speech recognition, and on iOS I use the built-in Apple speech recognition.

Baidu Research released yesterday the code for its speech recognition library using Connectionist Temporal Classification implemented with Torch. Benchmarks from Gigaom are encouraging as shown in the screenshot below, but I am not aware of any good wrapper around to make it usable without quite some coding (and a large training data set):

There exist some very alpha open-source projects:

https://github.com/mozilla/DeepSpeech (part of Mozilla's Vaani project: http://vaani.io (mirror))
https://github.com/pannous/tensorflow-speech-recognition
Vox, a system to control a Linux system using Dragon NaturallySpeaking: https://github.com/Franck-Dernoncourt/vox_linux + https://github.com/Franck-Dernoncourt/vox_windows
https://github.com/facebookresearch/wav2letter
https://github.com/espnet/espnet
http://github.com/tensorflow/lingvo (to be released by Google, mentioned at Interspeech 2018)

I am also aware of this attempt at tracking states of the arts and recent results (bibliography) on speech recognition. as well as this benchmark of existing speech recognition APIs.

I am aware of Aenea, which allows speech recognition via Dragonfly on one computer to send events to another, but it has some latency cost:

I am also aware of these two talks exploring Linux option for speech recognition:

2016 - The Eleventh HOPE: Coding by Voice with Open Source Speech Recognition (David Williams-King)
2014 - Pycon: Using Python to Code by Voice (Tavis Rudd)

Fantastic answer, thanks! It will take me some time to process. I'm in the process of trying CMUSphinx (via pocketsphinx) and Kaldi (via Vosk). I understand they score high on open source / privacy concerns. Any particular one on your list you have had success with? — PatrickT
– PatrickT, Commented May 18, 2020 at 19:05
@PatrickT for my own ASR use as an end-user: Dragon on Windows and Google ASR on Android. — Franck Dernoncourt
– Franck Dernoncourt, Commented May 20, 2020 at 0:39
I have now been able to set up pocketsphinx (wouldn't work on Python 3.8, but ok on 3.7) and got some results. The quality is very poor though. I tested it with some famous speeches/snippets by Churchill, Nixon, Reagan. Here is Reagan: "My fellow Americans, I'm pleased to tell you today that I've signed legislation that will outlaw Russia forever. We begin bombing in five minutes." --> "i've american completed to the bedtime legislation that would outlaw russia forever with the bombing in five minutes". :-) Now that I've got gcc to work, I'm going to try with Vosk, as Nikolay suggested. — PatrickT
– PatrickT, Commented May 20, 2020 at 1:16
@PatrickT nice output :) for some ASR system to work well (let's say WER < 5%) on open domain and speaker independent, even assuming decent audio, one needs >100k hours of training data, which open source projects typically don't have access to. — Franck Dernoncourt
– Franck Dernoncourt, Commented May 20, 2020 at 1:20
@Dragon yep Dragon is the currently the best on Windows for typical end-users, though still far from perfect. Current Linux solutions are crap for a typical end-user. — Franck Dernoncourt
– Franck Dernoncourt, Commented May 20, 2020 at 1:55

PatrickT · Accepted Answer · 2020-06-02 18:35:13Z

The Python package vosk-api ticked my boxes: open source, respects privacy (works 'offline'), and supports the languages I'm interested in: English, French, Spanish. The list of supported languages, currently limited, is growing: I was lucky with my needs. Getting started took me a little while, so in this answer I'd like to detail a few steps.

The audio must first be converted to the correct wav format.
Long texts should be read and transcribed in chunks.

STEP 1: Convert to WAV

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Convert common audio file formats to wav Also installed PyAudio, ffmpeg: conda install PyAudio conda config --add channels conda-forge conda install ffmpeg See which formats are supported by ffmpeg: ffmpeg -formats """ import os import subprocess def convert_to_wav(source:str): """ Convert common audio file formats like mp3 to the wav format Args: source: path to source file with extension '.mp3', '.ogg', etc. Return: output: path to output file with extension '.wav' Help: option -y to overwrite existing file. """ outdir, ext = os.path.splitext(source) output = outdir+'.wav' try: # basic conversion: # process = subprocess.run(['ffmpeg', '-y', '-i', source, output]) # conversion to format expected by vosk: process = subprocess.run(['ffmpeg', '-y', '-i', source, '-ar', '16000', '-ac', '1', output]) except Exception as e: print(str(e)) return output # make path to the audio file: several input formats are supported filesdir = '/path/to/audio-files' filename = 'nixon-resignation-cleaned-1974-08-08.ogg' #filename = 'churchill-finest-hour-160k-1940-06-18.mp3' filepath = os.path.join(filesdir, filename) # convert audio file to wav: convert_to_wav(filepath)

I set up the ffmpeg options by trial and error after finding that vosk-api was complaining about the format of my WAV audio files.

STEP 2: Convert WAV to TEXT

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Speech Recognition with Python and Vosk Install vosk on linux: pip install https://github.com/alphacep/vosk-api/releases/download/0.3.7/vosk-0.3.7-cp37-cp37m-linux_aarch64.whl Install vosk on MacOS: pip install -U https://github.com/alphacep/vosk-api/releases/download/0.3.7/vosk-0.3.7-cp38-cp38-macosx_10_12_x86_64.whl Download the language model from https://github.com/alphacep/vosk-android-demo/releases, unpack it in the current directory, and renamed it as 'model-en'. KaldiRecognizer usage: model = Model(path/to/model) KaldiRecognizer(model, freq): second argument freq is the source sample frequency Progress bar: pip install progressbar2 """ import os import sys import wave from vosk import Model, KaldiRecognizer import json import progressbar # !! progressbar2 under the hood def convert_wav_to_txt(source:str, language='English'): """ Interprets a wav file with the Vosk Speech Recognition API and saves the transcription to a text file. source: wav file format mono PCM """ # set up the destination file: filename = os.path.splitext(os.path.basename(source))[0] outdir = os.path.abspath(os.path.join(os.path.splitext(source)[0], os.pardir, os.pardir, 'output', filename)) outfile = outdir+'.txt' # set up the model: d = {'English': 'vosk-model-small-en-us-0.3', 'French': 'vosk-model-small-fr-pguyot-0.3', 'Spanish': 'vosk-model-small-es-0.3'} modeldir = d[language] modelpath = os.path.abspath(os.path.join(outdir, os.pardir, os.pardir, 'models', modeldir)) model = Model(modelpath) # set up recognizer: with wave.open(source, 'rb') as audio: freq = audio.getframerate() recognizer = KaldiRecognizer(model, freq) total = audio.getnframes() # initialize a list to hold chunks chunks = [] # set bytes size to be processed at each iteration: chunk_size = 2000 # initialize counter and progress bar count = 0 widgets = [progressbar.Percentage(), progressbar.Bar(marker='■')] # widgets = [progressbar.Percentage(), progressbar.Bar()] # process audio file: with open(source, 'rb') as audio: audio.read(44) #skip header # set up a progress bar for long jobs with progressbar.ProgressBar(widgets=widgets, max_value=10) as bar: while True: # read chunk by chunk data = audio.read(chunk_size) if len(data) == 0: break # append text if recognizer.AcceptWaveform(data): words = json.loads(recognizer.Result()) chunks.append(words) count += chunk_size bar.update(count/total) words = json.loads(recognizer.FinalResult()) chunks.append(words) chunks = [t for t in chunks if 'result' in t] transcript = [t for t in chunks if len(t['result']) != 0] phrases = [t['text'] for t in transcript] text = ' '.join(phrases) # write text to file: with open(outfile, 'w') as output: print(text, file=output) # print full path to output file: return print('\nOutput saved in:\n', outfile) # make path to wav audio file: filesdir = '/path/to/audio-files' filename = 'de-gaulle-appel-18-juin-160k-1940-06-18.wav' # convert French audio: filepath = os.path.join(filesdir, filename) convert_wav_to_txt(filepath, language='French')

REMARKS: pip3 install vosk didn't work for me: see instructions above to use the wheel method to install vosk. I added a progress bar because some of the files could take a while to transcribe and I wasn't sure if the system was hanging or working in the background. I put the code together by picking bits and pieces on github, so for instance not sure what a good bytes size is for each chunk. Not entirely sure why recognizer.FinalResult() is needed in addition to recognizer.Result(). I struggled a bit figuring out differences between open() and wave.open(). In particular, I couldn't do audio.read() after with wave.open(), for some reason (appears to be a known limitation), but I wanted to get the number of frames of the audio file before processing, so I ended up opening the file once with wave.open() to count frames and then with open() to process the frames, a dodgy decision. I used package json because I found that approach used by others, but I don't think it's absolutely necessary to use json...

I got resonably good transcriptions from a famous Nixon speech and a famous de Gaulle speech in French, but not so good for the famous "Finest Hour" Churchill speech: Churchill's pronunciation is horrible! Eventually I want to add some grammar/spelling check to the final text to improve legibility.

This is a first foray, still much to learn...

Are there better options for WAV conversion for Vosk? I set: subprocess.run(['ffmpeg', '-y', '-i', source, '-ar', '16000', '-ac', '1', output]) Also, what does recognizer.FinalResult() do that recognizer.Result() doesn't do? Some chunks are tagged with 'result', some are not, anything I could read to understand how that works? Thanks! — PatrickT
– PatrickT, Commented Jun 2, 2020 at 5:18
Result is only available when you reached silence. So you have to process audio chunk by chunk until you meet silence (result of AcceptWaveform is true) and you retrieve result. Final result is returned when stream eneded and you don't wait for silence anymore, you just process the last bits. We will probably redesign API in the near future to make it more straightforward. Thats why documentation is pending. — Nikolay Shmyrev
– Nikolay Shmyrev, Commented Jun 2, 2020 at 15:15

Wilfred Smith · Accepted Answer · 2020-05-18 05:52:44Z

On macOS, the simplest thing to do is enable dictation and download the 2 GB database to your Mac. Several languages are supported. For me, it has been amazingly good in English and French. You can use it completely offline (no Internet connection), and it's just built into the OS. See https://support.apple.com/guide/mac-help/use-dictation-mh40584/mac for steps to enable this feature.

Not open source, but it's good enough that I've transcribed 15 page documents with it, using my AirPods.

There are many ways to route audio files into it. Take a look at the Rogue Amoeba tools: https://rogueamoeba.com

There's an excellent discussion here: machow2.com/dictate-offline-catalina — PatrickT
– PatrickT, Commented May 22, 2020 at 0:23

Stack Exchange Network

Speech Recognition: Free Software and Complete Privacy

4 Answers 4

Hot Network Questions

Speech Recognition: Free Software and Complete Privacy

4 Answers 4

Related

Hot Network Questions