|
|
--- |
|
|
title: Whisper Fine-Tune vs Commercial APIs |
|
|
emoji: 🎤 |
|
|
colorFrom: purple |
|
|
colorTo: blue |
|
|
sdk: static |
|
|
pinned: false |
|
|
license: mit |
|
|
short_description: Local fine-tunes beat commercial STT APIs |
|
|
tags: |
|
|
- whisper |
|
|
- speech-to-text |
|
|
- evaluation |
|
|
- benchmark |
|
|
- api-comparison |
|
|
--- |
|
|
|
|
|
# Whisper Fine-Tune vs Commercial APIs |
|
|
|
|
|
Interactive visualization showing local fine-tuned Whisper models beating commercial STT APIs (OpenAI Whisper, Assembly, Gladia) on transcription accuracy. |
|
|
|
|
|
## Overview |
|
|
|
|
|
This Space presents a comprehensive evaluation of **7 models**: |
|
|
- 5 fine-tuned Whisper variants (Large V3 Turbo, Small, Tiny, Base) |
|
|
- 3 commercial STT APIs (Assembly, Gladia, OpenAI Whisper) |
|
|
|
|
|
All models were tested on identical 137-word audio with verified ground truth transcription. |
|
|
|
|
|
## Key Findings |
|
|
|
|
|
**Winner:** Whisper Large V3 Turbo (Fine-Tune) - Local |
|
|
- **Accuracy: 94.16%** |
|
|
- **Beats best commercial API** (Assembly at 92.70%) |
|
|
- **Zero deletions** - no lost content |
|
|
- Production-ready + privacy-focused + zero per-minute costs |
|
|
|
|
|
## Visualizations |
|
|
|
|
|
This Space includes interactive charts for: |
|
|
1. **WER Comparison** - Overall transcription accuracy |
|
|
2. **Error Breakdown** - Substitutions, deletions, insertions by model |
|
|
3. **Information Preserved** - Semantic accuracy metrics |
|
|
4. **Detailed Metrics** - Complete performance breakdown |
|
|
|
|
|
## Methodology |
|
|
|
|
|
- **Ground Truth:** Manual transcription verification |
|
|
- **Metrics:** WER, MER, WIL, WIP using `jiwer` library |
|
|
- **Framework:** Hugging Face Transformers pipeline |
|
|
- **Environment:** Python 3.12, CPU inference |
|
|
- **Test Sample:** 137-word narrative passage |
|
|
|
|
|
## Results Summary |
|
|
|
|
|
| Rank | Model | Type | Accuracy | WER | |
|
|
|------|-------|------|----------|-----| |
|
|
| 1 | Whisper Large V3 Turbo (Fine-Tune) | Local | 94.16% | 5.84% | |
|
|
| 2 | Assembly API | Commercial | 92.70% | 7.30% | |
|
|
| 3 | Gladia API | Commercial | 91.97% | 8.03% | |
|
|
| 4 | Whisper Small (Fine-Tune) | Local | 91.24% | 8.76% | |
|
|
| 5 | Whisper (OpenAI API) | Commercial | 91.24% | 8.76% | |
|
|
| 6 | Whisper Base (Fine-Tune) | Local | 85.40% | 14.60% | |
|
|
| 7 | Whisper Tiny (Fine-Tune) | Local | 85.40% | 14.60% | |
|
|
|
|
|
## Key Insights |
|
|
|
|
|
### 1. Local Fine-Tunes Beat Commercial Whisper APIs |
|
|
The fine-tuned Whisper Large V3 Turbo achieved **94.16% accuracy**, beating the best commercial service (Assembly at 92.70%). This proves targeted fine-tuning can outperform premium APIs on the same base model. |
|
|
|
|
|
### 2. Cost & Privacy Advantages |
|
|
Local models eliminate per-minute API costs and keep sensitive audio data on-premises. The performance advantage makes this even more compelling. |
|
|
|
|
|
### 3. Commercial APIs Are Competitive |
|
|
All three commercial APIs delivered production-ready performance (91-93% accuracy). They're viable alternatives when local inference isn't feasible. |
|
|
|
|
|
### 4. Production Recommendations |
|
|
|
|
|
**Best Overall:** |
|
|
- Whisper Large V3 Turbo (Fine-Tune) - 94.16% accuracy, local deployment |
|
|
|
|
|
**Best Commercial:** |
|
|
- Assembly API - 92.70% accuracy if cloud deployment required |
|
|
|
|
|
**Balanced Local:** |
|
|
- Whisper Small (Fine-Tune) - 91.24% accuracy, matches OpenAI with faster inference |
|
|
|
|
|
## Resources |
|
|
|
|
|
- **Evaluation Framework:** Python-based automated testing |
|
|
- **Models Used:** OpenAI Whisper variants and FUTO fine-tunes |
|
|
- **Metrics Library:** [jiwer](https://github.com/jitsi/jiwer) |
|
|
- **Visualization:** Chart.js for interactive charts |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - See full evaluation data and methodology in the Space. |
|
|
|
|
|
## Author |
|
|
|
|
|
**Daniel Rosehill** |
|
|
- Website: [danielrosehill.com](https://danielrosehill.com) |
|
|
- Email: public@danielrosehill.com |
|
|
|
|
|
--- |
|
|
|
|
|
*Generated by automated Whisper evaluation framework | November 2025* |
|
|
|
|
|
## Technical Details |
|
|
|
|
|
### Evaluation Metrics Explained |
|
|
|
|
|
- **WER (Word Error Rate):** Primary metric - percentage of words transcribed incorrectly |
|
|
- 0-10%: Excellent/Production ready |
|
|
- 10-20%: Good/Acceptable |
|
|
- 20%+: Needs improvement |
|
|
|
|
|
- **MER (Match Error Rate):** Similar to WER but treats sequences differently |
|
|
- **WIL (Word Information Lost):** Measures semantic information loss |
|
|
- **WIP (Word Information Preserved):** Inverse of WIL - higher is better |
|
|
|
|
|
### Error Types |
|
|
|
|
|
- **Substitutions:** Incorrect word transcribed |
|
|
- **Deletions:** Missing words from output |
|
|
- **Insertions:** Extra words added (hallucinations) |
|
|
|
|
|
### Test Environment |
|
|
|
|
|
- **Hardware:** CPU inference (no GPU) |
|
|
- **Python:** 3.12 |
|
|
- **Framework:** Hugging Face Transformers |
|
|
- **Audio Format:** WAV, 137 words |
|
|
- **Content:** Narrative passage about coastal town |
|
|
|
|
|
--- |
|
|
|
|
|
View the full interactive results above! 👆 |
|
|
|