Update README with two-stage training details

- Document both training stages (CV 17.0 and CV 23.0)
- Highlight best checkpoint at step 7,500 with WER 23.56%
- Fix base_model reference (was circular, now openai/whisper-large-v2)
- Add comprehensive training hyperparameters for both stages
- Include performance comparison and training observations
- Update metadata with both datasets

Files changed (1) hide show

README.md +223 -36

README.md CHANGED Viewed

@@ -1,11 +1,18 @@
 ---
 library_name: transformers
 license: apache-2.0
-base_model: openchs/asr-whisper-helpline-sw-v1
 tags:
 - generated_from_trainer
 datasets:
-- generator
 metrics:
 - wer
 model-index:
@@ -15,57 +22,145 @@ model-index:
       name: Automatic Speech Recognition
       type: automatic-speech-recognition
     dataset:
-      name: generator
-      type: generator
-      config: default
-      split: train
-      args: default
     metrics:
     - name: Wer
       type: wer
-      value: 25.56550424128181
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # asr-whisper-helpline-sw-v1
-This model is a fine-tuned version of [openchs/asr-whisper-helpline-sw-v1](https://huggingface.co/openchs/asr-whisper-helpline-sw-v1) on the generator dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.5175
-- Wer: 25.5655
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 5e-06
 - train_batch_size: 16
 - eval_batch_size: 16
 - seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: cosine_with_restarts
 - lr_scheduler_warmup_steps: 500
-- training_steps: 20000
-- mixed_precision_training: Native AMP
-### Training results
-| Training Loss | Epoch  | Step  | Validation Loss | Wer     |
 |:-------------:|:------:|:-----:|:---------------:|:-------:|
 | 0.0598        | 0.025  | 500   | 0.3869          | 24.8021 |
 | 0.0488        | 0.05   | 1000  | 0.4222          | 26.9086 |
@@ -81,17 +176,109 @@ The following hyperparameters were used during training:
 | 0.0327        | 2.0086 | 6000  | 0.4381          | 23.6923 |
 | 0.0254        | 2.0336 | 6500  | 0.4369          | 23.7512 |
 | 0.0155        | 2.0586 | 7000  | 0.4463          | 23.6216 |
-| 0.0263        | 2.0836 | 7500  | 0.4469          | 23.5627 |
 | 0.0249        | 2.1086 | 8000  | 0.4821          | 25.9189 |
 | 0.0233        | 2.1336 | 8500  | 0.4914          | 27.0500 |
 | 0.036         | 3.0129 | 9000  | 0.4738          | 24.1517 |
 | 0.0485        | 3.0379 | 9500  | 0.4758          | 24.9647 |
 | 0.0132        | 3.0629 | 10000 | 0.5175          | 25.5655 |
-### Framework versions
-- Transformers 4.56.2
-- Pytorch 2.8.0+cu128
-- Datasets 2.21.0
-- Tokenizers 0.22.1

 ---
 library_name: transformers
 license: apache-2.0
+base_model: openai/whisper-large-v2
 tags:
 - generated_from_trainer
+- swahili
+- asr
+- whisper
+- common-voice
+- tanzania
+- child-helpline
 datasets:
+- common_voice_17_0
+- mozilla-foundation/common_voice_23_0
 metrics:
 - wer
 model-index:
       name: Automatic Speech Recognition
       type: automatic-speech-recognition
     dataset:
+      name: mozilla-foundation/common_voice_23_0
+      type: mozilla-foundation/common_voice_23_0
+      config: sw
+      split: validation
+      args: sw
     metrics:
     - name: Wer
       type: wer
+      value: 23.5627
 ---
 # asr-whisper-helpline-sw-v1
+This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) on the Common Voice Swahili dataset.
+## Model Description
+This ASR model is specifically fine-tuned for **Swahili speech recognition** in the context of the **Tanzania Child Helpline**, powered by [OpenCHS](https://github.com/openchlai/ai) (Open Source Child Helpline System). The model is designed to transcribe Swahili spoken in Tanzanian call center environments.
+**Performance Highlights:**
+- **Best Validation WER:** 23.56% (achieved at step 7,500 of continued training)
+- **Baseline WER:** 89.05% (Whisper Large v2 zero-shot on Common Voice 17.0)
+- **Improvement:** ~65.5 percentage point reduction in WER (~73.5% error rate reduction)
+This represents a significant improvement over the base Whisper Large v2 model for Swahili transcription tasks.
+## Training Strategy
+The model was trained in **two stages**:
+1. **Stage 1 - Common Voice 17.0:** Initial fine-tuning on Common Voice 17.0 Swahili dataset (10,000 steps)
+2. **Stage 2 - Common Voice 23.0:** Continued fine-tuning on Common Voice 23.0 Swahili dataset (7,500 steps)
+**Total Training:** 17,500 effective steps with the best checkpoint selected at step 7,500 of stage 2 based on lowest validation WER.
+## Intended Uses & Limitations
+### Intended Uses
+- **Primary:** Transcribing Swahili speech in call center environments, specifically for child helpline services in Tanzania
+- **General:** Swahili automatic speech recognition tasks
+- **Research:** Baseline for domain adaptation studies (general speech → telephony/call center audio)
+### Limitations
+- **Domain Shift:** Model is trained on Common Voice (clean, read speech) but intended for call center audio. Performance on actual telephony audio may differ and requires validation.
+- **Language Variety:** Training data may not fully represent all Tanzanian Swahili dialects and speaking styles.
+- **Audio Quality:** Performance may degrade with low-quality audio, background noise, or poor recording conditions typical in telephony.
+- **Code-Switching:** May not handle code-switching between Swahili and English/other languages well.
+### Known Issues
+- Domain-specific evaluation on actual call center audio is pending
+## Training and Evaluation Data
+### Stage 1: Common Voice 17.0 (Swahili)
+**Training Configuration:**
+- **Training samples:** Streamed entire Common Voice 17.0 Swahili training split
+- **Validation samples:** 2,000 samples
+- **Source:** [Mozilla Common Voice 17.0](https://commonvoice.mozilla.org/)
+- **Language:** Swahili (sw)
+- **Data type:** Read speech from diverse speakers
+- **Streaming mode:** Used dataset streaming to minimize disk usage
+**Stage 1 Results:**
+- Final validation WER: 23.62%
+- Training steps: 10,000
+### Stage 2: Common Voice 23.0 (Swahili)
+**Training Configuration:**
+- **Starting point:** Best checkpoint from Stage 1
+- **Training samples:** Common Voice 23.0 Swahili training split (downloaded locally)
+- **Validation samples:** 2,000 samples
+- **Source:** [Mozilla Common Voice 23.0](https://commonvoice.mozilla.org/)
+- **Language:** Swahili (sw)
+**Stage 2 Results:**
+- Best validation WER: **23.56%** at step 7,500
+- Training continued to 10,000 steps but early stopping applied retrospectively
+**Baseline Performance:**
+- Base Whisper Large v2 (zero-shot): **89.05% WER** on Common Voice 17.0 validation
+## Training Procedure
+### Training Hyperparameters - Stage 1 (Common Voice 17.0)
+**Optimization:**
+- learning_rate: 1e-05
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_steps: 500
+- optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
+- training_steps: 10,000
+**Batch Configuration:**
 - train_batch_size: 16
 - eval_batch_size: 16
+- gradient_accumulation_steps: 1
+**Memory Optimization:**
+- gradient_checkpointing: true
+- mixed_precision_training: Native AMP (FP16)
+- dataloader_num_workers: 2
+**Evaluation & Checkpointing:**
+- evaluation_strategy: steps (every 500 steps)
+- save_steps: 500
+- logging_steps: 50
+**Other:**
 - seed: 42
+### Training Hyperparameters - Stage 2 (Common Voice 23.0)
+**Optimization:**
+- learning_rate: 5e-06 (reduced from Stage 1)
 - lr_scheduler_type: cosine_with_restarts
 - lr_scheduler_warmup_steps: 500
+- optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
+- training_steps: 20,000 (stopped at 10,000, best at 7,500)
+**Batch Configuration:**
+- train_batch_size: 16
+- eval_batch_size: 16
+**Memory Optimization:**
+- mixed_precision_training: Native AMP (FP16)
+**Evaluation & Checkpointing:**
+- evaluation_strategy: steps (every 500 steps)
+- save_steps: 500
+- logging_steps: 50
+**Other:**
+- seed: 42
+### Training Results - Stage 2 (Common Voice 23.0)
+| Training Loss | Epoch  | Step  | Validation Loss | WER     |
 |:-------------:|:------:|:-----:|:---------------:|:-------:|
 | 0.0598        | 0.025  | 500   | 0.3869          | 24.8021 |
 | 0.0488        | 0.05   | 1000  | 0.4222          | 26.9086 |
 | 0.0327        | 2.0086 | 6000  | 0.4381          | 23.6923 |
 | 0.0254        | 2.0336 | 6500  | 0.4369          | 23.7512 |
 | 0.0155        | 2.0586 | 7000  | 0.4463          | 23.6216 |
+| **0.0263**    | **2.0836** | **7500** | **0.4469** | **23.5627** ← **Best checkpoint** |
 | 0.0249        | 2.1086 | 8000  | 0.4821          | 25.9189 |
 | 0.0233        | 2.1336 | 8500  | 0.4914          | 27.0500 |
 | 0.036         | 3.0129 | 9000  | 0.4738          | 24.1517 |
 | 0.0485        | 3.0379 | 9500  | 0.4758          | 24.9647 |
 | 0.0132        | 3.0629 | 10000 | 0.5175          | 25.5655 |
+**Training Observations:**
+- Initial performance on CV23: 24.80% WER (step 500)
+- Progressive improvement to best WER of **23.56%** at step 7,500
+- Performance degraded slightly after step 7,500 (overfitting indicators)
+- Model weights restored to step 7,500 checkpoint for optimal performance
+### Combined Training Summary
+**Stage 1 (CV 17.0):**
+- Steps: 0 → 10,000
+- Starting WER: 43.68% → Final WER: 23.62%
+**Stage 2 (CV 23.0):**
+- Steps: 0 → 7,500 (best checkpoint)
+- Starting WER: 24.80% → Best WER: 23.56%
+**Total Effective Training:** ~17,500 steps across two datasets
+## Performance Comparison
+| Model | Dataset | Split | WER | Improvement from Baseline |
+|-------|---------|-------|-----|---------------------------|
+| Whisper Large v2 (baseline) | CV 17.0 | Validation | 89.05% | - |
+| **This model (Stage 1)** | **CV 17.0** | **Validation** | **23.62%** | **-65.43 pp (73.5% reduction)** |
+| **This model (Stage 2 - Best)** | **CV 23.0** | **Validation** | **23.56%** | **-65.49 pp (73.5% reduction)** |
+**Note:** The two-stage training approach with dataset progression (CV 17.0 → CV 23.0) achieved marginal improvement in final WER while ensuring model robustness across Common Voice versions.
+## Usage
+```python
+from transformers import pipeline
+# Load the model
+pipe = pipeline("automatic-speech-recognition",
+                model="openchs/asr-whisper-helpline-sw-v1")
+# Transcribe audio
+result = pipe("path/to/swahili_audio.wav")
+print(result["text"])
+```
+### Advanced Usage
+```python
+from transformers import WhisperProcessor, WhisperForConditionalGeneration
+import torch
+# Load model and processor
+processor = WhisperProcessor.from_pretrained("openchs/asr-whisper-helpline-sw-v1")
+model = WhisperForConditionalGeneration.from_pretrained("openchs/asr-whisper-helpline-sw-v1")
+# Load and process audio
+# ... your audio loading code ...
+# Generate transcription
+input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
+predicted_ids = model.generate(input_features)
+transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
+print(transcription[0])
+```
+## Future Work
+- **Domain Evaluation:** Assessment on actual Tanzania Child Helpline call center audio to measure domain shift impact
+- **Domain Adaptation:** Fine-tuning on telephony/call center audio for improved production performance
+- **Error Analysis:** Detailed analysis of failure cases to identify improvement opportunities
+- **Test Set Evaluation:** Comprehensive evaluation on Common Voice 23.0 test split
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{openchs-swahili-asr-v1,
+  title={Swahili ASR Model for Tanzania Child Helpline},
+  author={OpenCHS Team},
+  year={2025},
+  publisher={HuggingFace},
+  howpublished={\url{https://huggingface.co/openchs/asr-whisper-helpline-sw-v1}}
+}
+```
+## Framework Versions
+- Transformers: 4.56.2
+- PyTorch: 2.8.0+cu128
+- Datasets: 2.21.0
+- Tokenizers: 0.22.1
+## License
+Apache 2.0
+## Acknowledgments
+- Base model: [OpenAI Whisper Large v2](https://huggingface.co/openai/whisper-large-v2)
+- Training data: [Mozilla Common Voice 17.0](https://commonvoice.mozilla.org/) and [Mozilla Common Voice 23.0](https://commonvoice.mozilla.org/)
+- Project: [OpenCHS](https://github.com/openchlai/ai)