Investigate resampling path as a transcription-quality boundary

Summary

Keet should treat browser-side resampling as a likely transcription-quality boundary and make it easier to validate or swap.

This is not a confirmed Keet-specific regression report yet. It is a follow-up from the recent Parakeet / NeMo TDT investigation, where we found that transcription output can change materially when the same model is fed audio prepared with different resampling paths.

Why this matters

In the earlier investigation:

the parakeet-tdt-0.6b-v2 .org split was sensitive to the audio frontend
when Node, NeMo, and onnx-asr all consumed the exact same pre-resampled 16 kHz WAV, they aligned
when different decode/resample paths were used, token paths diverged

That means audio preparation is not a neutral implementation detail for these models.

Keet currently captures at device rate and resamples to 16 kHz in-app using linear interpolation:

src/lib/audio/utils.ts: comment says it is "Good enough for speech recognition where we're going 48kHz -> 16kHz."
src/lib/audio/utils.ts: resampleLinear(...)
src/lib/audio/AudioEngine.ts: tracks device rate vs target 16000
src/lib/audio/AudioEngine.ts: worklet path can emit targetSampleRate chunks directly
src/lib/audio/AudioEngine.ts: handleAudioChunk(...) still applies resampleLinear(...) when needed

Important nuance

Keet is not using the same exact browser file path as the demo app.

Keet is doing live microphone capture via AudioContext / AudioWorklet, not browser file decode via decodeAudioData(). So this is not "the same bug" by default.

But it is the same class of risk:

capture at browser/device sample rate
convert to mono / 16 kHz in JS/worklet code
feed ASR/VAD/transcription logic with the converted signal

For Parakeet / NeMo-style models, that boundary has already proven sensitive enough to alter tokenization.

Suggested actions

Add lightweight resampling diagnostics to Keet

log input sample rate, target sample rate, whether resampling happened, and per-chunk resample time
surface this in the debug panel so microphone/device differences are visible

Add a deterministic parity path for testing

allow feeding a canonical pre-resampled 16 kHz fixture into the same downstream transcription path
use that to compare microphone/live path vs known-good 16 kHz PCM

Make the resampler swappable or configurable

keep current linear path as the fast default if needed
but make it possible to test an alternative resampler behind a flag

Add a regression harness around transcription-sensitive audio

even one small fixture can help catch resampling-induced transcript drift
examples from the earlier investigation included punctuation / token-boundary changes like LibriVox. org. vs LibriVox.org.

References

src/lib/audio/utils.ts
src/lib/audio/AudioEngine.ts

Context

This issue comes from the recent local investigation across:

transformers.js
parakeet.js
NeMo
onnx-asr

Main takeaway: for these ASR models, resampling and audio preparation can affect token output enough to matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate resampling path as a transcription-quality boundary #205

Summary

Why this matters

Important nuance

Suggested actions

References

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Investigate resampling path as a transcription-quality boundary #205

Description

Summary

Why this matters

Important nuance

Suggested actions

References

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions