Spaces:

danielrosehill
/

Whisper-Fine-Tune-Eval

Running

App Files Files Community

Whisper-Fine-Tune-Eval / README.md

danielrosehill

commit

6d38f0b about 1 month ago

preview code

raw

history blame contribute delete

4.5 kB

	---
	title: Whisper Fine-Tune vs Commercial APIs
	emoji: 🎤
	colorFrom: purple
	colorTo: blue
	sdk: static
	pinned: false
	license: mit
	short_description: Local fine-tunes beat commercial STT APIs
	tags:
	- whisper
	- speech-to-text
	- evaluation
	- benchmark
	- api-comparison
	---

	# Whisper Fine-Tune vs Commercial APIs

	Interactive visualization showing local fine-tuned Whisper models beating commercial STT APIs (OpenAI Whisper, Assembly, Gladia) on transcription accuracy.

	## Overview

	This Space presents a comprehensive evaluation of 7 models:
	- 5 fine-tuned Whisper variants (Large V3 Turbo, Small, Tiny, Base)
	- 3 commercial STT APIs (Assembly, Gladia, OpenAI Whisper)

	All models were tested on identical 137-word audio with verified ground truth transcription.

	## Key Findings

	Winner: Whisper Large V3 Turbo (Fine-Tune) - Local
	- Accuracy: 94.16%
	- Beats best commercial API (Assembly at 92.70%)
	- Zero deletions - no lost content
	- Production-ready + privacy-focused + zero per-minute costs

	## Visualizations

	This Space includes interactive charts for:
	1. WER Comparison - Overall transcription accuracy
	2. Error Breakdown - Substitutions, deletions, insertions by model
	3. Information Preserved - Semantic accuracy metrics
	4. Detailed Metrics - Complete performance breakdown

	## Methodology

	- Ground Truth: Manual transcription verification
	- Metrics: WER, MER, WIL, WIP using `jiwer` library
	- Framework: Hugging Face Transformers pipeline
	- Environment: Python 3.12, CPU inference
	- Test Sample: 137-word narrative passage

	## Results Summary

	\| Rank \| Model \| Type \| Accuracy \| WER \|
	\|------\|-------\|------\|----------\|-----\|
	\| 1 \| Whisper Large V3 Turbo (Fine-Tune) \| Local \| 94.16% \| 5.84% \|
	\| 2 \| Assembly API \| Commercial \| 92.70% \| 7.30% \|
	\| 3 \| Gladia API \| Commercial \| 91.97% \| 8.03% \|
	\| 4 \| Whisper Small (Fine-Tune) \| Local \| 91.24% \| 8.76% \|
	\| 5 \| Whisper (OpenAI API) \| Commercial \| 91.24% \| 8.76% \|
	\| 6 \| Whisper Base (Fine-Tune) \| Local \| 85.40% \| 14.60% \|
	\| 7 \| Whisper Tiny (Fine-Tune) \| Local \| 85.40% \| 14.60% \|

	## Key Insights

	### 1. Local Fine-Tunes Beat Commercial Whisper APIs
	The fine-tuned Whisper Large V3 Turbo achieved 94.16% accuracy, beating the best commercial service (Assembly at 92.70%). This proves targeted fine-tuning can outperform premium APIs on the same base model.

	### 2. Cost & Privacy Advantages
	Local models eliminate per-minute API costs and keep sensitive audio data on-premises. The performance advantage makes this even more compelling.

	### 3. Commercial APIs Are Competitive
	All three commercial APIs delivered production-ready performance (91-93% accuracy). They're viable alternatives when local inference isn't feasible.

	### 4. Production Recommendations

	Best Overall:
	- Whisper Large V3 Turbo (Fine-Tune) - 94.16% accuracy, local deployment

	Best Commercial:
	- Assembly API - 92.70% accuracy if cloud deployment required

	Balanced Local:
	- Whisper Small (Fine-Tune) - 91.24% accuracy, matches OpenAI with faster inference

	## Resources

	- Evaluation Framework: Python-based automated testing
	- Models Used: OpenAI Whisper variants and FUTO fine-tunes
	- Metrics Library: [jiwer](https://github.com/jitsi/jiwer)
	- Visualization: Chart.js for interactive charts

	## License

	MIT License - See full evaluation data and methodology in the Space.

	## Author

	Daniel Rosehill
	- Website: [danielrosehill.com](https://danielrosehill.com)
	- Email: public@danielrosehill.com

	---

	Generated by automated Whisper evaluation framework \| November 2025

	## Technical Details

	### Evaluation Metrics Explained

	- WER (Word Error Rate): Primary metric - percentage of words transcribed incorrectly
	- 0-10%: Excellent/Production ready
	- 10-20%: Good/Acceptable
	- 20%+: Needs improvement

	- MER (Match Error Rate): Similar to WER but treats sequences differently
	- WIL (Word Information Lost): Measures semantic information loss
	- WIP (Word Information Preserved): Inverse of WIL - higher is better

	### Error Types

	- Substitutions: Incorrect word transcribed
	- Deletions: Missing words from output
	- Insertions: Extra words added (hallucinations)

	### Test Environment

	- Hardware: CPU inference (no GPU)
	- Python: 3.12
	- Framework: Hugging Face Transformers
	- Audio Format: WAV, 137 words
	- Content: Narrative passage about coastal town

	---

	View the full interactive results above! 👆