Fast speech recognition with NVIDIA's Parakeet models via ONNX Runtime.
Note: CoreML is unstable with this model. For Apple, use WebGPU EP (uses metal under the hood,dont confuse by its name :-). it's a native GPU standard, not only web) or CPU. But even CPU alone is significantly faster on my Mac M3 16GB compared to Whisper metal! :-)
CTC (English-only):
use parakeet_rs::{Parakeet, Transcriber, TimestampMode}; let mut parakeet = Parakeet::from_pretrained(".", None)?; // Load and transcribe audio (see examples/raw.rs for full example) let result = parakeet.transcribe_samples(audio, 1600, 1, Some(TimestampMode::Words))?; println!("{}", result.text); // Token-level timestamps for token in result.tokens { println!("[{:.3}s - {:.3}s] {}", token.start, token.end, token.text); }TDT (Multilingual): 25 languages with auto-detection
use parakeet_rs::{ParakeetTDT, Transcriber, TimestampMode}; let mut parakeet = ParakeetTDT::from_pretrained("./tdt", None)?; let result = parakeet.transcribe_samples(audio, 16000, 1, Some(TimestampMode::Sentences))?; println!("{}", result.text); // Token-level timestamps for token in result.tokens { println!("[{:.3}s - {:.3}s] {}", token.start, token.end, token.text); }EOU (Streaming): Real-time ASR with end-of-utterance detection
use parakeet_rs::ParakeetEOU; let mut parakeet = ParakeetEOU::from_pretrained("./eou", None)?; // Prepare your audio (Vec<f32>, 16kHz mono, normalized) let audio: Vec<f32> = /* your audio samples */; // Process in 160ms chunks for streaming const CHUNK_SIZE: usize = 2560; // 160ms at 16kHz for chunk in audio.chunks(CHUNK_SIZE) { let text = parakeet.transcribe(chunk, false)?; print!("{}", text); }Nemotron (Streaming): Cache-aware streaming ASR with punctuation
use parakeet_rs::Nemotron; let mut model = Nemotron::from_pretrained("./nemotron", None)?; // Process in 560ms chunks for streaming const CHUNK_SIZE: usize = 8960; // 560ms at 16kHz for chunk in audio.chunks(CHUNK_SIZE) { let text = model.transcribe_chunk(chunk)?; print!("{}", text); }Multitalker (Streaming Multi-Speaker ASR): Speaker-attributed transcription
parakeet-rs = { version = "0.3", features = ["multitalker"] }use parakeet_rs::MultitalkerASR; let mut model = MultitalkerASR::from_pretrained( "./multitalker", // encoder, decoder, tokenizer "sortformer.onnx", // Sortformer v2 for diarization None, )?; for chunk in audio.chunks(17920) { // ~1.12s at 16kHz let results = model.transcribe_chunk(chunk)?; for r in &results { println!("[Speaker {}] {}", r.speaker_id, r.text); } }See examples/multitalker.rs for full usage with latency modes.
Sortformer v2 & v2.1 (Speaker Diarization): Streaming 4-speaker diarization
parakeet-rs = { version = "0.3", features = ["sortformer"] }use parakeet_rs::sortformer::{Sortformer, DiarizationConfig}; let mut sortformer = Sortformer::with_config( "diar_streaming_sortformer_4spk-v2.onnx", // or v2.1.onnx None, DiarizationConfig::callhome(), // or dihard3(),custom() )?; let segments = sortformer.diarize(audio, 16000, 1)?; for seg in segments { println!("Speaker {} [{:.2}s - {:.2}s]", seg.speaker_id, seg.start as f64 / 16_000.0, seg.end as f64 / 16_000.0); } // For streaming/real-time use, diarize_chunk() preserves state across calls: let segments = sortformer.diarize_chunk(&audio_chunk_16k_mono)?;See examples/diarization.rs for combining with TDT transcription.
See examples/streaming_diarization.rs for diarize_chunk usage example.
See scripts/export_diar_sortformer.py for exporting the model with custom streaming parameters.
CTC: Download from HuggingFace: model.onnx, model.onnx_data, tokenizer.json
TDT: Download from HuggingFace: encoder-model.onnx, encoder-model.onnx.data, decoder_joint-model.onnx, vocab.txt
EOU: Download from HuggingFace: encoder.onnx, decoder_joint.onnx, tokenizer.json
Nemotron: Download from HuggingFace: encoder.onnx, encoder.onnx.data, decoder_joint.onnx, tokenizer.model (int8 / int4)
Multitalker: Download from HuggingFace: encoder.int8.onnx, decoder_joint.int8.onnx, tokenizer.model (also needs a Sortformer model for diarization)
Diarization (Sortformer v2 & v2.1): Download from HuggingFace: diar_streaming_sortformer_4spk-v2.onnx or v2.1.onnx.
Quantized versions available (int8). All files must be in the same directory.
GPU support (auto-falls back to CPU if fails):
parakeet-rs = { version = "0.3", features = ["cuda"] } # or tensorrt, webgpu, directml, migraphx or other ort supported EPs (check cargo features)use parakeet_rs::{Parakeet, ExecutionConfig, ExecutionProvider}; let config = ExecutionConfig::new().with_execution_provider(ExecutionProvider::Cuda); let mut parakeet = Parakeet::from_pretrained(".", Some(config))?;Advanced session configuration via ort SessionBuilder:
let config = ExecutionConfig::new() .with_custom_configure(|builder| builder.with_memory_pattern(false));- CTC: English with punctuation & capitalization
- TDT: Multilingual (auto lang detection)
- EOU: Streaming ASR with end-of-utterance detection
- Nemotron: Cache aware streaming ASR (600M params,EN only)
- Multitalker: Streaming multi-speaker ASR with speaker-kernel injection (ONNX int8)
- Sortformer v2 & v2.1: Streaming speaker diarization (up to 4 speakers) NOTE: you can also download v2.1 model same way.
- Token-level timestamps (CTC, TDT)
- Audio: 16kHz mono WAV (16-bit PCM or 32-bit float)
- CTC/TDT models have ~4-5 minute audio length limit. For longer files, use streaming models or split into chunks
Code: MIT OR Apache-2.0
FYI: The Parakeet ONNX models (downloaded separately from HuggingFace) by NVIDIA. This library does not distribute the models.