This repository contains a complete implementation of the Conformer: Convolution-Augmented Transformer for Speech Recognition. It includes:
- 🚀 Efficient training pipeline (with Lightning)
- 🌐 Web inference demo (Gradio + Flask)
- 🔤 Beam search decoding with KenLM
- Attention Is All You Need
- Conformer: Convolution-augmented Transformer
- Transformer-XL
- KenLM
- Beam Search + LM Decoding
git clone https://github.com/LuluW8071/Conformer.git cd ConformerFirst, install system-level requirements:
sudo apt update sudo apt install sox libsox-fmt-all build-essential zlib1g-dev libbz2-dev liblzma-devThen Python packages:
pip install -r requirements.txtEnsure CUDA + PyTorch are correctly set up for training.
python3 common_voice.py \ --file_path /path/to/validated.tsv \ --save_json_path converted_clips \ -w 4 \ --percent 10LibriSpeech ASR corpus is automatically downloaded.
🎙️ 2. Personal Recordings via Mimic Record Studio
Important
Recommended: At least 1–2 hours of personal recordings.Original audio is augmented (e.g., noise, pitch, speed) to boost variation in recorded corpus and improve model robustness on your voice.
python3 mimic_record.py \ --input_file /path/to/transcript.txt \ --output_dir /path/to/savepython3 merge_jsons.py personal/train.json converted_clips/train.json \ --output merged_train.jsonRepeat for validation files.
Add your Comet ML API key and project to .env.
python3 train.py \ -g 4 \ # Number of GPUs -w 8 \ # CPU workers --epochs 100 \ # Training epochs --batch_size 32 \ # Batch size -lr 4e-5 \ # Learning rate --precision 16-mixed \ # Enable mixed precision --checkpoint_path /path/to/checkpoint.ckpt # (Optional) Resume trainingExport to TorchScript for optimized inference:
python3 torchscript.py \ --model_checkpoint /path/to/checkpoint.ckpt \ --save_path modelpython3 gradio_demo.py \ --model_path /path/to/optimized_model.pt \ --sharepython3 app.py \ --model_path /path/to/optimized_model.pt🔎 See the Jupyter notebook for interactive inference examples.
| Dataset | Use | Hours | Notes |
|---|---|---|---|
| Mozilla Common Voice 7.0 + Personal | Training | 1855 + 20 | Large crowd-sourced + personal |
| Validation | 161 + 2 | 8% Validation split | |
| LibriSpeech | Training | 960 | Full training set |
| Validation | 10.5 | Test-clean, Test-other |
| Parameter | Value |
|---|---|
batch_size | 64 |
precision | mixed precision |
grad_clip | 0.6 |
learning_rate (AdamW) | 1e-4 |
scheduler | ReduceLROnPlateau |
| LibriSpeech | Mozilla Corpus + Personal |
|---|---|
![]() | ![]() |
| Combined |
|---|
![]() |
| Exp. | Dataset | Eval WER | Test-Clean WER | Test-Other WER | Model |
|---|---|---|---|---|---|
| 1 | LibriSpeech | 22.94 | 15.94 | 31.71 | 🔗 |
| 2 | Mozilla + Personal | 25.29 | 26.22 | 42.59 | 🔗 |
| 3 | Combined (Mozilla + Libri + Personal) | 28.47 | 19.33 | 33.50 | 🔗 |
⚠️ Mozilla eval set is ~15× larger than LibriSpeech eval set.
Note
- Using Experiment No.3 Conformer model and LibriSpeech 4-gram KenLM
- Parameter:
beam_size-beam_threshold-beam_size_token-lm_weight -
word_score= -0.26 -
$Greedy$ = No CTC Beam Search + No LM -
lm_weight=$None$ means CTC Beam Search but no LM
| Parameter | Test-Clean WER | Test-Clean CER | Test-Other WER | Test-Other CER |
|---|---|---|---|---|
| 19.33 | 5.75 | 33.50 | 12.44 | |
| | 16.04 | 5.82 | 30.86 | 13.80 |
| | 15.83 | 5.50 | 30.48 | 13.11 |
| | 11.01 | 5.43 | 23.93 | 13.90 |
| | 9.46 | 3.93 | 20.89 | 10.40 |
| | 9.21 | 3.76 | 20.32 | 10.03 |
| | 9.00 | 3.66 | 19.97 | 9.74 |
| | 8.94 | 3.61 | 19.74 | 9.62 |
| | 8.88 | 3.61 | 19.59 | 9.53 |
As the parameters of the beam search decoder—such as beam width or number of hypotheses—increase, the inference time also increases due to the additional computational overhead required to explore more decoding paths.
@misc{gulati2020conformer, title={Conformer: Convolution-augmented Transformer for Speech Recognition}, author={Anmol Gulati, James Qin, Chung-Cheng Chiu et al.}, year={2020}, url={https://arxiv.org/abs/2005.08100} }


