danielrosehill's picture
commit
6d38f0b
---
title: Whisper Fine-Tune vs Commercial APIs
emoji: 🎤
colorFrom: purple
colorTo: blue
sdk: static
pinned: false
license: mit
short_description: Local fine-tunes beat commercial STT APIs
tags:
- whisper
- speech-to-text
- evaluation
- benchmark
- api-comparison
---
# Whisper Fine-Tune vs Commercial APIs
Interactive visualization showing local fine-tuned Whisper models beating commercial STT APIs (OpenAI Whisper, Assembly, Gladia) on transcription accuracy.
## Overview
This Space presents a comprehensive evaluation of **7 models**:
- 5 fine-tuned Whisper variants (Large V3 Turbo, Small, Tiny, Base)
- 3 commercial STT APIs (Assembly, Gladia, OpenAI Whisper)
All models were tested on identical 137-word audio with verified ground truth transcription.
## Key Findings
**Winner:** Whisper Large V3 Turbo (Fine-Tune) - Local
- **Accuracy: 94.16%**
- **Beats best commercial API** (Assembly at 92.70%)
- **Zero deletions** - no lost content
- Production-ready + privacy-focused + zero per-minute costs
## Visualizations
This Space includes interactive charts for:
1. **WER Comparison** - Overall transcription accuracy
2. **Error Breakdown** - Substitutions, deletions, insertions by model
3. **Information Preserved** - Semantic accuracy metrics
4. **Detailed Metrics** - Complete performance breakdown
## Methodology
- **Ground Truth:** Manual transcription verification
- **Metrics:** WER, MER, WIL, WIP using `jiwer` library
- **Framework:** Hugging Face Transformers pipeline
- **Environment:** Python 3.12, CPU inference
- **Test Sample:** 137-word narrative passage
## Results Summary
| Rank | Model | Type | Accuracy | WER |
|------|-------|------|----------|-----|
| 1 | Whisper Large V3 Turbo (Fine-Tune) | Local | 94.16% | 5.84% |
| 2 | Assembly API | Commercial | 92.70% | 7.30% |
| 3 | Gladia API | Commercial | 91.97% | 8.03% |
| 4 | Whisper Small (Fine-Tune) | Local | 91.24% | 8.76% |
| 5 | Whisper (OpenAI API) | Commercial | 91.24% | 8.76% |
| 6 | Whisper Base (Fine-Tune) | Local | 85.40% | 14.60% |
| 7 | Whisper Tiny (Fine-Tune) | Local | 85.40% | 14.60% |
## Key Insights
### 1. Local Fine-Tunes Beat Commercial Whisper APIs
The fine-tuned Whisper Large V3 Turbo achieved **94.16% accuracy**, beating the best commercial service (Assembly at 92.70%). This proves targeted fine-tuning can outperform premium APIs on the same base model.
### 2. Cost & Privacy Advantages
Local models eliminate per-minute API costs and keep sensitive audio data on-premises. The performance advantage makes this even more compelling.
### 3. Commercial APIs Are Competitive
All three commercial APIs delivered production-ready performance (91-93% accuracy). They're viable alternatives when local inference isn't feasible.
### 4. Production Recommendations
**Best Overall:**
- Whisper Large V3 Turbo (Fine-Tune) - 94.16% accuracy, local deployment
**Best Commercial:**
- Assembly API - 92.70% accuracy if cloud deployment required
**Balanced Local:**
- Whisper Small (Fine-Tune) - 91.24% accuracy, matches OpenAI with faster inference
## Resources
- **Evaluation Framework:** Python-based automated testing
- **Models Used:** OpenAI Whisper variants and FUTO fine-tunes
- **Metrics Library:** [jiwer](https://github.com/jitsi/jiwer)
- **Visualization:** Chart.js for interactive charts
## License
MIT License - See full evaluation data and methodology in the Space.
## Author
**Daniel Rosehill**
- Website: [danielrosehill.com](https://danielrosehill.com)
- Email: public@danielrosehill.com
---
*Generated by automated Whisper evaluation framework | November 2025*
## Technical Details
### Evaluation Metrics Explained
- **WER (Word Error Rate):** Primary metric - percentage of words transcribed incorrectly
- 0-10%: Excellent/Production ready
- 10-20%: Good/Acceptable
- 20%+: Needs improvement
- **MER (Match Error Rate):** Similar to WER but treats sequences differently
- **WIL (Word Information Lost):** Measures semantic information loss
- **WIP (Word Information Preserved):** Inverse of WIL - higher is better
### Error Types
- **Substitutions:** Incorrect word transcribed
- **Deletions:** Missing words from output
- **Insertions:** Extra words added (hallucinations)
### Test Environment
- **Hardware:** CPU inference (no GPU)
- **Python:** 3.12
- **Framework:** Hugging Face Transformers
- **Audio Format:** WAV, 137 words
- **Content:** Narrative passage about coastal town
---
View the full interactive results above! 👆