- Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary
Keet should treat browser-side resampling as a likely transcription-quality boundary and make it easier to validate or swap.
This is not a confirmed Keet-specific regression report yet. It is a follow-up from the recent Parakeet / NeMo TDT investigation, where we found that transcription output can change materially when the same model is fed audio prepared with different resampling paths.
Why this matters
In the earlier investigation:
- the
parakeet-tdt-0.6b-v2.orgsplit was sensitive to the audio frontend - when Node, NeMo, and
onnx-asrall consumed the exact same pre-resampled16 kHzWAV, they aligned - when different decode/resample paths were used, token paths diverged
That means audio preparation is not a neutral implementation detail for these models.
Keet currently captures at device rate and resamples to 16 kHz in-app using linear interpolation:
- src/lib/audio/utils.ts: comment says it is "Good enough for speech recognition where we're going 48kHz -> 16kHz."
- src/lib/audio/utils.ts:
resampleLinear(...) - src/lib/audio/AudioEngine.ts: tracks device rate vs target
16000 - src/lib/audio/AudioEngine.ts: worklet path can emit
targetSampleRatechunks directly - src/lib/audio/AudioEngine.ts:
handleAudioChunk(...)still appliesresampleLinear(...)when needed
Important nuance
Keet is not using the same exact browser file path as the demo app.
Keet is doing live microphone capture via AudioContext / AudioWorklet, not browser file decode via decodeAudioData(). So this is not "the same bug" by default.
But it is the same class of risk:
- capture at browser/device sample rate
- convert to mono /
16 kHzin JS/worklet code - feed ASR/VAD/transcription logic with the converted signal
For Parakeet / NeMo-style models, that boundary has already proven sensitive enough to alter tokenization.
Suggested actions
- Add lightweight resampling diagnostics to Keet
- log input sample rate, target sample rate, whether resampling happened, and per-chunk resample time
- surface this in the debug panel so microphone/device differences are visible
- Add a deterministic parity path for testing
- allow feeding a canonical pre-resampled
16 kHzfixture into the same downstream transcription path - use that to compare microphone/live path vs known-good
16 kHzPCM
- Make the resampler swappable or configurable
- keep current linear path as the fast default if needed
- but make it possible to test an alternative resampler behind a flag
- Add a regression harness around transcription-sensitive audio
- even one small fixture can help catch resampling-induced transcript drift
- examples from the earlier investigation included punctuation / token-boundary changes like
LibriVox. org.vsLibriVox.org.
References
src/lib/audio/utils.tssrc/lib/audio/AudioEngine.ts
Context
This issue comes from the recent local investigation across:
transformers.jsparakeet.js- NeMo
onnx-asr
Main takeaway: for these ASR models, resampling and audio preparation can affect token output enough to matter.