k-nurf commited on
Commit
3712a25
·
1 Parent(s): c85ca22

Update README with two-stage training details

Browse files

- Document both training stages (CV 17.0 and CV 23.0)
- Highlight best checkpoint at step 7,500 with WER 23.56%
- Fix base_model reference (was circular, now openai/whisper-large-v2)
- Add comprehensive training hyperparameters for both stages
- Include performance comparison and training observations
- Update metadata with both datasets

Files changed (1) hide show
  1. README.md +223 -36
README.md CHANGED
@@ -1,11 +1,18 @@
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
4
- base_model: openchs/asr-whisper-helpline-sw-v1
5
  tags:
6
  - generated_from_trainer
 
 
 
 
 
 
7
  datasets:
8
- - generator
 
9
  metrics:
10
  - wer
11
  model-index:
@@ -15,57 +22,145 @@ model-index:
15
  name: Automatic Speech Recognition
16
  type: automatic-speech-recognition
17
  dataset:
18
- name: generator
19
- type: generator
20
- config: default
21
- split: train
22
- args: default
23
  metrics:
24
  - name: Wer
25
  type: wer
26
- value: 25.56550424128181
27
  ---
28
 
29
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
30
- should probably proofread and complete it, then remove this comment. -->
31
-
32
  # asr-whisper-helpline-sw-v1
33
 
34
- This model is a fine-tuned version of [openchs/asr-whisper-helpline-sw-v1](https://huggingface.co/openchs/asr-whisper-helpline-sw-v1) on the generator dataset.
35
- It achieves the following results on the evaluation set:
36
- - Loss: 0.5175
37
- - Wer: 25.5655
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- ## Model description
40
 
41
- More information needed
 
 
 
 
 
 
42
 
43
- ## Intended uses & limitations
 
 
44
 
45
- More information needed
46
 
47
- ## Training and evaluation data
 
 
 
 
 
48
 
49
- More information needed
 
 
50
 
51
- ## Training procedure
 
52
 
53
- ### Training hyperparameters
54
 
55
- The following hyperparameters were used during training:
56
- - learning_rate: 5e-06
 
 
 
 
 
 
 
 
57
  - train_batch_size: 16
58
  - eval_batch_size: 16
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  - seed: 42
60
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 
 
 
 
61
  - lr_scheduler_type: cosine_with_restarts
62
  - lr_scheduler_warmup_steps: 500
63
- - training_steps: 20000
64
- - mixed_precision_training: Native AMP
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- ### Training results
 
 
 
67
 
68
- | Training Loss | Epoch | Step | Validation Loss | Wer |
69
  |:-------------:|:------:|:-----:|:---------------:|:-------:|
70
  | 0.0598 | 0.025 | 500 | 0.3869 | 24.8021 |
71
  | 0.0488 | 0.05 | 1000 | 0.4222 | 26.9086 |
@@ -81,17 +176,109 @@ The following hyperparameters were used during training:
81
  | 0.0327 | 2.0086 | 6000 | 0.4381 | 23.6923 |
82
  | 0.0254 | 2.0336 | 6500 | 0.4369 | 23.7512 |
83
  | 0.0155 | 2.0586 | 7000 | 0.4463 | 23.6216 |
84
- | 0.0263 | 2.0836 | 7500 | 0.4469 | 23.5627 |
85
  | 0.0249 | 2.1086 | 8000 | 0.4821 | 25.9189 |
86
  | 0.0233 | 2.1336 | 8500 | 0.4914 | 27.0500 |
87
  | 0.036 | 3.0129 | 9000 | 0.4738 | 24.1517 |
88
  | 0.0485 | 3.0379 | 9500 | 0.4758 | 24.9647 |
89
  | 0.0132 | 3.0629 | 10000 | 0.5175 | 25.5655 |
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
- ### Framework versions
93
 
94
- - Transformers 4.56.2
95
- - Pytorch 2.8.0+cu128
96
- - Datasets 2.21.0
97
- - Tokenizers 0.22.1
 
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
4
+ base_model: openai/whisper-large-v2
5
  tags:
6
  - generated_from_trainer
7
+ - swahili
8
+ - asr
9
+ - whisper
10
+ - common-voice
11
+ - tanzania
12
+ - child-helpline
13
  datasets:
14
+ - common_voice_17_0
15
+ - mozilla-foundation/common_voice_23_0
16
  metrics:
17
  - wer
18
  model-index:
 
22
  name: Automatic Speech Recognition
23
  type: automatic-speech-recognition
24
  dataset:
25
+ name: mozilla-foundation/common_voice_23_0
26
+ type: mozilla-foundation/common_voice_23_0
27
+ config: sw
28
+ split: validation
29
+ args: sw
30
  metrics:
31
  - name: Wer
32
  type: wer
33
+ value: 23.5627
34
  ---
35
 
 
 
 
36
  # asr-whisper-helpline-sw-v1
37
 
38
+ This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) on the Common Voice Swahili dataset.
39
+
40
+ ## Model Description
41
+
42
+ This ASR model is specifically fine-tuned for **Swahili speech recognition** in the context of the **Tanzania Child Helpline**, powered by [OpenCHS](https://github.com/openchlai/ai) (Open Source Child Helpline System). The model is designed to transcribe Swahili spoken in Tanzanian call center environments.
43
+
44
+ **Performance Highlights:**
45
+ - **Best Validation WER:** 23.56% (achieved at step 7,500 of continued training)
46
+ - **Baseline WER:** 89.05% (Whisper Large v2 zero-shot on Common Voice 17.0)
47
+ - **Improvement:** ~65.5 percentage point reduction in WER (~73.5% error rate reduction)
48
+
49
+ This represents a significant improvement over the base Whisper Large v2 model for Swahili transcription tasks.
50
+
51
+ ## Training Strategy
52
+
53
+ The model was trained in **two stages**:
54
+
55
+ 1. **Stage 1 - Common Voice 17.0:** Initial fine-tuning on Common Voice 17.0 Swahili dataset (10,000 steps)
56
+ 2. **Stage 2 - Common Voice 23.0:** Continued fine-tuning on Common Voice 23.0 Swahili dataset (7,500 steps)
57
+
58
+ **Total Training:** 17,500 effective steps with the best checkpoint selected at step 7,500 of stage 2 based on lowest validation WER.
59
+
60
+ ## Intended Uses & Limitations
61
+
62
+ ### Intended Uses
63
+ - **Primary:** Transcribing Swahili speech in call center environments, specifically for child helpline services in Tanzania
64
+ - **General:** Swahili automatic speech recognition tasks
65
+ - **Research:** Baseline for domain adaptation studies (general speech → telephony/call center audio)
66
+
67
+ ### Limitations
68
+ - **Domain Shift:** Model is trained on Common Voice (clean, read speech) but intended for call center audio. Performance on actual telephony audio may differ and requires validation.
69
+ - **Language Variety:** Training data may not fully represent all Tanzanian Swahili dialects and speaking styles.
70
+ - **Audio Quality:** Performance may degrade with low-quality audio, background noise, or poor recording conditions typical in telephony.
71
+ - **Code-Switching:** May not handle code-switching between Swahili and English/other languages well.
72
+
73
+ ### Known Issues
74
+ - Domain-specific evaluation on actual call center audio is pending
75
+
76
+ ## Training and Evaluation Data
77
 
78
+ ### Stage 1: Common Voice 17.0 (Swahili)
79
 
80
+ **Training Configuration:**
81
+ - **Training samples:** Streamed entire Common Voice 17.0 Swahili training split
82
+ - **Validation samples:** 2,000 samples
83
+ - **Source:** [Mozilla Common Voice 17.0](https://commonvoice.mozilla.org/)
84
+ - **Language:** Swahili (sw)
85
+ - **Data type:** Read speech from diverse speakers
86
+ - **Streaming mode:** Used dataset streaming to minimize disk usage
87
 
88
+ **Stage 1 Results:**
89
+ - Final validation WER: 23.62%
90
+ - Training steps: 10,000
91
 
92
+ ### Stage 2: Common Voice 23.0 (Swahili)
93
 
94
+ **Training Configuration:**
95
+ - **Starting point:** Best checkpoint from Stage 1
96
+ - **Training samples:** Common Voice 23.0 Swahili training split (downloaded locally)
97
+ - **Validation samples:** 2,000 samples
98
+ - **Source:** [Mozilla Common Voice 23.0](https://commonvoice.mozilla.org/)
99
+ - **Language:** Swahili (sw)
100
 
101
+ **Stage 2 Results:**
102
+ - Best validation WER: **23.56%** at step 7,500
103
+ - Training continued to 10,000 steps but early stopping applied retrospectively
104
 
105
+ **Baseline Performance:**
106
+ - Base Whisper Large v2 (zero-shot): **89.05% WER** on Common Voice 17.0 validation
107
 
108
+ ## Training Procedure
109
 
110
+ ### Training Hyperparameters - Stage 1 (Common Voice 17.0)
111
+
112
+ **Optimization:**
113
+ - learning_rate: 1e-05
114
+ - lr_scheduler_type: linear
115
+ - lr_scheduler_warmup_steps: 500
116
+ - optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
117
+ - training_steps: 10,000
118
+
119
+ **Batch Configuration:**
120
  - train_batch_size: 16
121
  - eval_batch_size: 16
122
+ - gradient_accumulation_steps: 1
123
+
124
+ **Memory Optimization:**
125
+ - gradient_checkpointing: true
126
+ - mixed_precision_training: Native AMP (FP16)
127
+ - dataloader_num_workers: 2
128
+
129
+ **Evaluation & Checkpointing:**
130
+ - evaluation_strategy: steps (every 500 steps)
131
+ - save_steps: 500
132
+ - logging_steps: 50
133
+
134
+ **Other:**
135
  - seed: 42
136
+
137
+ ### Training Hyperparameters - Stage 2 (Common Voice 23.0)
138
+
139
+ **Optimization:**
140
+ - learning_rate: 5e-06 (reduced from Stage 1)
141
  - lr_scheduler_type: cosine_with_restarts
142
  - lr_scheduler_warmup_steps: 500
143
+ - optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
144
+ - training_steps: 20,000 (stopped at 10,000, best at 7,500)
145
+
146
+ **Batch Configuration:**
147
+ - train_batch_size: 16
148
+ - eval_batch_size: 16
149
+
150
+ **Memory Optimization:**
151
+ - mixed_precision_training: Native AMP (FP16)
152
+
153
+ **Evaluation & Checkpointing:**
154
+ - evaluation_strategy: steps (every 500 steps)
155
+ - save_steps: 500
156
+ - logging_steps: 50
157
 
158
+ **Other:**
159
+ - seed: 42
160
+
161
+ ### Training Results - Stage 2 (Common Voice 23.0)
162
 
163
+ | Training Loss | Epoch | Step | Validation Loss | WER |
164
  |:-------------:|:------:|:-----:|:---------------:|:-------:|
165
  | 0.0598 | 0.025 | 500 | 0.3869 | 24.8021 |
166
  | 0.0488 | 0.05 | 1000 | 0.4222 | 26.9086 |
 
176
  | 0.0327 | 2.0086 | 6000 | 0.4381 | 23.6923 |
177
  | 0.0254 | 2.0336 | 6500 | 0.4369 | 23.7512 |
178
  | 0.0155 | 2.0586 | 7000 | 0.4463 | 23.6216 |
179
+ | **0.0263** | **2.0836** | **7500** | **0.4469** | **23.5627** ← **Best checkpoint** |
180
  | 0.0249 | 2.1086 | 8000 | 0.4821 | 25.9189 |
181
  | 0.0233 | 2.1336 | 8500 | 0.4914 | 27.0500 |
182
  | 0.036 | 3.0129 | 9000 | 0.4738 | 24.1517 |
183
  | 0.0485 | 3.0379 | 9500 | 0.4758 | 24.9647 |
184
  | 0.0132 | 3.0629 | 10000 | 0.5175 | 25.5655 |
185
 
186
+ **Training Observations:**
187
+ - Initial performance on CV23: 24.80% WER (step 500)
188
+ - Progressive improvement to best WER of **23.56%** at step 7,500
189
+ - Performance degraded slightly after step 7,500 (overfitting indicators)
190
+ - Model weights restored to step 7,500 checkpoint for optimal performance
191
+
192
+ ### Combined Training Summary
193
+
194
+ **Stage 1 (CV 17.0):**
195
+ - Steps: 0 → 10,000
196
+ - Starting WER: 43.68% → Final WER: 23.62%
197
+
198
+ **Stage 2 (CV 23.0):**
199
+ - Steps: 0 → 7,500 (best checkpoint)
200
+ - Starting WER: 24.80% → Best WER: 23.56%
201
+
202
+ **Total Effective Training:** ~17,500 steps across two datasets
203
+
204
+ ## Performance Comparison
205
+
206
+ | Model | Dataset | Split | WER | Improvement from Baseline |
207
+ |-------|---------|-------|-----|---------------------------|
208
+ | Whisper Large v2 (baseline) | CV 17.0 | Validation | 89.05% | - |
209
+ | **This model (Stage 1)** | **CV 17.0** | **Validation** | **23.62%** | **-65.43 pp (73.5% reduction)** |
210
+ | **This model (Stage 2 - Best)** | **CV 23.0** | **Validation** | **23.56%** | **-65.49 pp (73.5% reduction)** |
211
+
212
+ **Note:** The two-stage training approach with dataset progression (CV 17.0 → CV 23.0) achieved marginal improvement in final WER while ensuring model robustness across Common Voice versions.
213
+
214
+ ## Usage
215
+
216
+ ```python
217
+ from transformers import pipeline
218
+
219
+ # Load the model
220
+ pipe = pipeline("automatic-speech-recognition",
221
+ model="openchs/asr-whisper-helpline-sw-v1")
222
+
223
+ # Transcribe audio
224
+ result = pipe("path/to/swahili_audio.wav")
225
+ print(result["text"])
226
+ ```
227
+
228
+ ### Advanced Usage
229
+
230
+ ```python
231
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
232
+ import torch
233
+
234
+ # Load model and processor
235
+ processor = WhisperProcessor.from_pretrained("openchs/asr-whisper-helpline-sw-v1")
236
+ model = WhisperForConditionalGeneration.from_pretrained("openchs/asr-whisper-helpline-sw-v1")
237
+
238
+ # Load and process audio
239
+ # ... your audio loading code ...
240
+
241
+ # Generate transcription
242
+ input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
243
+ predicted_ids = model.generate(input_features)
244
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
245
+ print(transcription[0])
246
+ ```
247
+
248
+ ## Future Work
249
+
250
+ - **Domain Evaluation:** Assessment on actual Tanzania Child Helpline call center audio to measure domain shift impact
251
+ - **Domain Adaptation:** Fine-tuning on telephony/call center audio for improved production performance
252
+ - **Error Analysis:** Detailed analysis of failure cases to identify improvement opportunities
253
+ - **Test Set Evaluation:** Comprehensive evaluation on Common Voice 23.0 test split
254
+
255
+ ## Citation
256
+
257
+ If you use this model, please cite:
258
+
259
+ ```bibtex
260
+ @misc{openchs-swahili-asr-v1,
261
+ title={Swahili ASR Model for Tanzania Child Helpline},
262
+ author={OpenCHS Team},
263
+ year={2025},
264
+ publisher={HuggingFace},
265
+ howpublished={\url{https://huggingface.co/openchs/asr-whisper-helpline-sw-v1}}
266
+ }
267
+ ```
268
+
269
+ ## Framework Versions
270
+
271
+ - Transformers: 4.56.2
272
+ - PyTorch: 2.8.0+cu128
273
+ - Datasets: 2.21.0
274
+ - Tokenizers: 0.22.1
275
+
276
+ ## License
277
+
278
+ Apache 2.0
279
 
280
+ ## Acknowledgments
281
 
282
+ - Base model: [OpenAI Whisper Large v2](https://huggingface.co/openai/whisper-large-v2)
283
+ - Training data: [Mozilla Common Voice 17.0](https://commonvoice.mozilla.org/) and [Mozilla Common Voice 23.0](https://commonvoice.mozilla.org/)
284
+ - Project: [OpenCHS](https://github.com/openchlai/ai)