title: Whisper Fine-Tune vs Commercial APIs
emoji: 🎤
colorFrom: purple
colorTo: blue
sdk: static
pinned: false
license: mit
short_description: Local fine-tunes beat commercial STT APIs
tags:
- whisper
- speech-to-text
- evaluation
- benchmark
- api-comparison
Whisper Fine-Tune vs Commercial APIs
Interactive visualization showing local fine-tuned Whisper models beating commercial STT APIs (OpenAI Whisper, Assembly, Gladia) on transcription accuracy.
Overview
This Space presents a comprehensive evaluation of 7 models:
- 5 fine-tuned Whisper variants (Large V3 Turbo, Small, Tiny, Base)
- 3 commercial STT APIs (Assembly, Gladia, OpenAI Whisper)
All models were tested on identical 137-word audio with verified ground truth transcription.
Key Findings
Winner: Whisper Large V3 Turbo (Fine-Tune) - Local
- Accuracy: 94.16%
- Beats best commercial API (Assembly at 92.70%)
- Zero deletions - no lost content
- Production-ready + privacy-focused + zero per-minute costs
Visualizations
This Space includes interactive charts for:
- WER Comparison - Overall transcription accuracy
- Error Breakdown - Substitutions, deletions, insertions by model
- Information Preserved - Semantic accuracy metrics
- Detailed Metrics - Complete performance breakdown
Methodology
- Ground Truth: Manual transcription verification
- Metrics: WER, MER, WIL, WIP using
jiwerlibrary - Framework: Hugging Face Transformers pipeline
- Environment: Python 3.12, CPU inference
- Test Sample: 137-word narrative passage
Results Summary
| Rank | Model | Type | Accuracy | WER |
|---|---|---|---|---|
| 1 | Whisper Large V3 Turbo (Fine-Tune) | Local | 94.16% | 5.84% |
| 2 | Assembly API | Commercial | 92.70% | 7.30% |
| 3 | Gladia API | Commercial | 91.97% | 8.03% |
| 4 | Whisper Small (Fine-Tune) | Local | 91.24% | 8.76% |
| 5 | Whisper (OpenAI API) | Commercial | 91.24% | 8.76% |
| 6 | Whisper Base (Fine-Tune) | Local | 85.40% | 14.60% |
| 7 | Whisper Tiny (Fine-Tune) | Local | 85.40% | 14.60% |
Key Insights
1. Local Fine-Tunes Beat Commercial Whisper APIs
The fine-tuned Whisper Large V3 Turbo achieved 94.16% accuracy, beating the best commercial service (Assembly at 92.70%). This proves targeted fine-tuning can outperform premium APIs on the same base model.
2. Cost & Privacy Advantages
Local models eliminate per-minute API costs and keep sensitive audio data on-premises. The performance advantage makes this even more compelling.
3. Commercial APIs Are Competitive
All three commercial APIs delivered production-ready performance (91-93% accuracy). They're viable alternatives when local inference isn't feasible.
4. Production Recommendations
Best Overall:
- Whisper Large V3 Turbo (Fine-Tune) - 94.16% accuracy, local deployment
Best Commercial:
- Assembly API - 92.70% accuracy if cloud deployment required
Balanced Local:
- Whisper Small (Fine-Tune) - 91.24% accuracy, matches OpenAI with faster inference
Resources
- Evaluation Framework: Python-based automated testing
- Models Used: OpenAI Whisper variants and FUTO fine-tunes
- Metrics Library: jiwer
- Visualization: Chart.js for interactive charts
License
MIT License - See full evaluation data and methodology in the Space.
Author
Daniel Rosehill
- Website: danielrosehill.com
- Email: public@danielrosehill.com
Generated by automated Whisper evaluation framework | November 2025
Technical Details
Evaluation Metrics Explained
WER (Word Error Rate): Primary metric - percentage of words transcribed incorrectly
- 0-10%: Excellent/Production ready
- 10-20%: Good/Acceptable
- 20%+: Needs improvement
MER (Match Error Rate): Similar to WER but treats sequences differently
WIL (Word Information Lost): Measures semantic information loss
WIP (Word Information Preserved): Inverse of WIL - higher is better
Error Types
- Substitutions: Incorrect word transcribed
- Deletions: Missing words from output
- Insertions: Extra words added (hallucinations)
Test Environment
- Hardware: CPU inference (no GPU)
- Python: 3.12
- Framework: Hugging Face Transformers
- Audio Format: WAV, 137 words
- Content: Narrative passage about coastal town
View the full interactive results above! 👆