πŸŽ™ LEHJA β€” build status

Updated 2026-06-13 17:08:18 UTC Β· auto-refresh 60s Β· cron every 5 min

OVERALL (Arabic track + MT): 100%
H100 NVL: 0% GPU Β· active jobs 1 Β· VRAM 0.0 GB used / 93.1 GB free of 93.6 GB
βš™οΈ DATA-PREP (CPU) β€” speaker embeddings / speech-token extraction. GPU at 0% is normal here (ONNX runs on CPU); training starts next.

πŸ“‹ Tasks β€” main & sub

βœ… ✦. Urdu voice model (previous milestone) 100%
βœ… CosyVoice2 Urdu LoRA trained + delivered Β· validation PASS Β· median CER 0.065 Β· model on app-server
βœ… 1. Build Quranic-Arabic pipeline (canonical alignment, full tashkeel) 100%
βœ… Diacritic-preserving aligner + mining/prep/train/validate scripts
βœ… 2. Canonical Uthmani Quran text + alignment index 100%
βœ… 6236 ayahs, full Uthmani Hafs (zabar/zer/pesh/shadda/tanwin/ghunna)
βœ… Skeleton search index β€” rough transcript β†’ exact ayah span
βœ… 3. Mine in-domain Arabic recitation (your QTM corpus) 100%
βœ… whisper-ar VAD-segment + inline Quran align Β· 11,900/12,000 files Β· 8/8 shards done
βœ… clean recitation clips harvested Β· 11,237 clips Β· 1223.6 min Β· 3,041 distinct ayahs
βœ… 4. Canonical-align β†’ diacritized labels + confidence gate 100%
βœ… perfect diacritized labels (alignment inline per clip)
βœ… high-confidence gate (matchβ‰₯85) + dedup Β· 7,238 train-grade clips
βœ… 5. Train + validate + package Quranic Arabic LoRA 100%
βœ… CosyVoice2 fine-tune (learn diacritized recitation) Β· Epoch 34/35 Β· loss 0.524978 Β· acc 0.747030
βœ… validate: recites correct Arabic? (whisper-readback) Β· verdict: WEAK
βœ… package CosyVoice2-0.5B-ar + download to app-server
βœ… MT. Flagship MT benchmark (Qwen3.6-35B-A3B vs Qwen3-14B) 100%
βœ… latency + Urdu/Arabic quality on real tutor sentences Β· results saved

πŸ“Š Arabic data harvest

Recitation clips mined: 11,237 Β· 1223.6 min clean audio
Train-grade (matchβ‰₯85): 7,238 clips Β· 3,041 distinct ayahs covered
Source: your QTM Quran-recitation corpus (30,165 Arabic files available)

Approach: every clip's rough transcript is matched to the canonical Uthmani Quran text, so labels carry perfect diacritics (zabar/zer/pesh/tashdeed/tanwin/ghunna) β€” and non-recitation (Urdu explanation, repetition) is auto-filtered.