Spaces:

danielrosehill
/

Whisper-Fine-Tune-Eval

Running

App Files Files Community

Whisper-Fine-Tune-Eval / README.md

danielrosehill

commit

6d38f0b about 1 month ago

preview code

raw

history blame contribute delete

4.5 kB

metadata

title: Whisper Fine-Tune vs Commercial APIs
emoji: 🎤
colorFrom: purple
colorTo: blue
sdk: static
pinned: false
license: mit
short_description: Local fine-tunes beat commercial STT APIs
tags:
  - whisper
  - speech-to-text
  - evaluation
  - benchmark
  - api-comparison

Whisper Fine-Tune vs Commercial APIs

Interactive visualization showing local fine-tuned Whisper models beating commercial STT APIs (OpenAI Whisper, Assembly, Gladia) on transcription accuracy.

Overview

This Space presents a comprehensive evaluation of 7 models:

5 fine-tuned Whisper variants (Large V3 Turbo, Small, Tiny, Base)
3 commercial STT APIs (Assembly, Gladia, OpenAI Whisper)

All models were tested on identical 137-word audio with verified ground truth transcription.

Key Findings

Winner: Whisper Large V3 Turbo (Fine-Tune) - Local

Accuracy: 94.16%
Beats best commercial API (Assembly at 92.70%)
Zero deletions - no lost content
Production-ready + privacy-focused + zero per-minute costs

Visualizations

This Space includes interactive charts for:

WER Comparison - Overall transcription accuracy
Error Breakdown - Substitutions, deletions, insertions by model
Information Preserved - Semantic accuracy metrics
Detailed Metrics - Complete performance breakdown

Methodology

Ground Truth: Manual transcription verification
Metrics: WER, MER, WIL, WIP using jiwer library
Framework: Hugging Face Transformers pipeline
Environment: Python 3.12, CPU inference
Test Sample: 137-word narrative passage

Results Summary

Rank	Model	Type	Accuracy	WER
1	Whisper Large V3 Turbo (Fine-Tune)	Local	94.16%	5.84%
2	Assembly API	Commercial	92.70%	7.30%
3	Gladia API	Commercial	91.97%	8.03%
4	Whisper Small (Fine-Tune)	Local	91.24%	8.76%
5	Whisper (OpenAI API)	Commercial	91.24%	8.76%
6	Whisper Base (Fine-Tune)	Local	85.40%	14.60%
7	Whisper Tiny (Fine-Tune)	Local	85.40%	14.60%

Key Insights

1. Local Fine-Tunes Beat Commercial Whisper APIs

The fine-tuned Whisper Large V3 Turbo achieved 94.16% accuracy, beating the best commercial service (Assembly at 92.70%). This proves targeted fine-tuning can outperform premium APIs on the same base model.

2. Cost & Privacy Advantages

Local models eliminate per-minute API costs and keep sensitive audio data on-premises. The performance advantage makes this even more compelling.

3. Commercial APIs Are Competitive

All three commercial APIs delivered production-ready performance (91-93% accuracy). They're viable alternatives when local inference isn't feasible.

4. Production Recommendations

Best Overall:

Whisper Large V3 Turbo (Fine-Tune) - 94.16% accuracy, local deployment

Best Commercial:

Assembly API - 92.70% accuracy if cloud deployment required

Balanced Local:

Whisper Small (Fine-Tune) - 91.24% accuracy, matches OpenAI with faster inference

Resources

Evaluation Framework: Python-based automated testing
Models Used: OpenAI Whisper variants and FUTO fine-tunes
Metrics Library: jiwer
Visualization: Chart.js for interactive charts

License

MIT License - See full evaluation data and methodology in the Space.

Author

Daniel Rosehill

Website: danielrosehill.com
Email: public@danielrosehill.com

Generated by automated Whisper evaluation framework | November 2025

Technical Details

Evaluation Metrics Explained

WER (Word Error Rate): Primary metric - percentage of words transcribed incorrectly
- 0-10%: Excellent/Production ready
- 10-20%: Good/Acceptable
- 20%+: Needs improvement
MER (Match Error Rate): Similar to WER but treats sequences differently
WIL (Word Information Lost): Measures semantic information loss
WIP (Word Information Preserved): Inverse of WIL - higher is better

Error Types

Substitutions: Incorrect word transcribed
Deletions: Missing words from output
Insertions: Extra words added (hallucinations)

Test Environment

Hardware: CPU inference (no GPU)
Python: 3.12
Framework: Hugging Face Transformers
Audio Format: WAV, 137 words
Content: Narrative passage about coastal town

View the full interactive results above! 👆