Skip to content

Investigate resampling path as a transcription-quality boundary #205

@ysdede

Description

@ysdede

Summary

Keet should treat browser-side resampling as a likely transcription-quality boundary and make it easier to validate or swap.

This is not a confirmed Keet-specific regression report yet. It is a follow-up from the recent Parakeet / NeMo TDT investigation, where we found that transcription output can change materially when the same model is fed audio prepared with different resampling paths.

Why this matters

In the earlier investigation:

  • the parakeet-tdt-0.6b-v2 .org split was sensitive to the audio frontend
  • when Node, NeMo, and onnx-asr all consumed the exact same pre-resampled 16 kHz WAV, they aligned
  • when different decode/resample paths were used, token paths diverged

That means audio preparation is not a neutral implementation detail for these models.

Keet currently captures at device rate and resamples to 16 kHz in-app using linear interpolation:

Important nuance

Keet is not using the same exact browser file path as the demo app.

Keet is doing live microphone capture via AudioContext / AudioWorklet, not browser file decode via decodeAudioData(). So this is not "the same bug" by default.

But it is the same class of risk:

  • capture at browser/device sample rate
  • convert to mono / 16 kHz in JS/worklet code
  • feed ASR/VAD/transcription logic with the converted signal

For Parakeet / NeMo-style models, that boundary has already proven sensitive enough to alter tokenization.

Suggested actions

  1. Add lightweight resampling diagnostics to Keet
  • log input sample rate, target sample rate, whether resampling happened, and per-chunk resample time
  • surface this in the debug panel so microphone/device differences are visible
  1. Add a deterministic parity path for testing
  • allow feeding a canonical pre-resampled 16 kHz fixture into the same downstream transcription path
  • use that to compare microphone/live path vs known-good 16 kHz PCM
  1. Make the resampler swappable or configurable
  • keep current linear path as the fast default if needed
  • but make it possible to test an alternative resampler behind a flag
  1. Add a regression harness around transcription-sensitive audio
  • even one small fixture can help catch resampling-induced transcript drift
  • examples from the earlier investigation included punctuation / token-boundary changes like LibriVox. org. vs LibriVox.org.

References

  • src/lib/audio/utils.ts
  • src/lib/audio/AudioEngine.ts

Context

This issue comes from the recent local investigation across:

  • transformers.js
  • parakeet.js
  • NeMo
  • onnx-asr

Main takeaway: for these ASR models, resampling and audio preparation can affect token output enough to matter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions