This project documents a showcase fine-tuning experiment of the GPT-OSS-20B model with LoRA on a custom Duʿāʾ dataset (inspired by Ḥiṣn al-Muslim).
👉 Focus: Arabic language, authentic Islamic supplications
👉 Uniqueness: Built entirely from scratch – dataset prep, debugging, training, inference, and visualization
👉 Goal: Provide a transparent research-style workflow that others can replicate or extend
This repo serves as a technical documentation & showcase.
-
Total time spent: ~12–14h
- Debugging: ~4h (dataset fixes, rsync sync issues, initial CPU-only runs 😅)
- Training + Inference: ~6–8h
- Misc (setup, cleanup, monitoring): ~2h
-
Hardware Environment:
- RunPod B200 Instance
- 28 vCPU, 180 GB RAM, 50 GB Disk, 150 GB Pod Volume
- NVIDIA GPU (CUDA Capability
sm_100) – shown as B200 - PyTorch CUDA 12.1 (Torch available:
True) - Container:
runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel
-
Frameworks: HuggingFace Transformers, PEFT, PyTorch (CUDA), custom Python scripts
-
Specialty: OSS-20B with LoRA → rarely documented on B200 hardware
gpt-oss-20b-lora-dua/ ├── datasets/ # Training data (JSONL, CSV, tokenizer) ├── results/ # Inference results & comparisons ├── images/ # Screenshots & debug visuals ├── videos/ # Training & inference demos (via Git LFS) ├── scripts/ # Organized experiment scripts │ ├── training/ # Training pipelines │ ├── inference/ # Inference tests │ ├── dataset_tools/ # Dataset checks & fixes │ ├── compare/ # Compare runs & Gradio UI │ └── tools/ # Utilities & helpers └── utils/ # Environment configs & scanners - Base dataset curated from Ḥiṣn al-Muslim Duʿāʾ
- Fixes applied using:
fix_training_entry.pycheck_dataset.pyconvert_json_to_jsonl.py
- Issue: GPU not used (ran on CPU by mistake 😅)
- Fix: Verified ROCm setup, ensured
torch.cuda.is_available() = True - Extra ~1h wasted on rsync retries – included here to show real-world overhead
- Ran LoRA on ~100 samples
- Verified that adapters trained & merged properly
- ✅ Confirmed inference pipeline working
- Trained on full dua dataset (
datasets/knigge_dua_dataset.jsonl) - Saved LoRA adapters & merged back into base
- Used
merge_lora.pyto combine base + adapters - Exported in multiple quantized formats (Q4, Q5, Q8) locally
- Files intentionally not pushed to GitHub (too large)
- Tested with authentic Duʿāʾ prompts
- Model produced Arabic text, transliteration, and partial English gloss
- Outputs documented in
results/and via screenshots inimages/
- First documented LoRA fine-tune of OSS-20B on RunPod B200 (CUDA 12)
- Dataset correction pipeline works robustly
- Training reproducible (mini + full runs)
- Model improved on Arabic + Islamic contexts
- Dataset small (~100–200 examples)
- Religious accuracy still requires scholar review
- Cloud quirks → some wasted time (initial CPU-only runs, rsync overhead)
-
Training runs (mini + full):
-
Debugging sessions:
-
Inference showcases:
- RunPod B200 (CUDA 12) works reliably once set up correctly
- LoRA is efficient even on 20B parameter models
- Debugging + real-world overhead (CPU fallback, rsync) matter just as much as training itself
- Transparency (keeping even “mistakes”) helps others learn
This repo demonstrates:
- How to structure a real LoRA fine-tune project end-to-end
- How to handle dataset debugging, training, merging, inference
- How to use cloud GPU instances (RunPod B200) for large-scale experiments
👉 A hands-on showcase, not a polished product – built for education, research, and reproducibility.



