danielrosehill's picture
commit
6d38f0b
metadata
title: Whisper Fine-Tune vs Commercial APIs
emoji: 🎤
colorFrom: purple
colorTo: blue
sdk: static
pinned: false
license: mit
short_description: Local fine-tunes beat commercial STT APIs
tags:
  - whisper
  - speech-to-text
  - evaluation
  - benchmark
  - api-comparison

Whisper Fine-Tune vs Commercial APIs

Interactive visualization showing local fine-tuned Whisper models beating commercial STT APIs (OpenAI Whisper, Assembly, Gladia) on transcription accuracy.

Overview

This Space presents a comprehensive evaluation of 7 models:

  • 5 fine-tuned Whisper variants (Large V3 Turbo, Small, Tiny, Base)
  • 3 commercial STT APIs (Assembly, Gladia, OpenAI Whisper)

All models were tested on identical 137-word audio with verified ground truth transcription.

Key Findings

Winner: Whisper Large V3 Turbo (Fine-Tune) - Local

  • Accuracy: 94.16%
  • Beats best commercial API (Assembly at 92.70%)
  • Zero deletions - no lost content
  • Production-ready + privacy-focused + zero per-minute costs

Visualizations

This Space includes interactive charts for:

  1. WER Comparison - Overall transcription accuracy
  2. Error Breakdown - Substitutions, deletions, insertions by model
  3. Information Preserved - Semantic accuracy metrics
  4. Detailed Metrics - Complete performance breakdown

Methodology

  • Ground Truth: Manual transcription verification
  • Metrics: WER, MER, WIL, WIP using jiwer library
  • Framework: Hugging Face Transformers pipeline
  • Environment: Python 3.12, CPU inference
  • Test Sample: 137-word narrative passage

Results Summary

Rank Model Type Accuracy WER
1 Whisper Large V3 Turbo (Fine-Tune) Local 94.16% 5.84%
2 Assembly API Commercial 92.70% 7.30%
3 Gladia API Commercial 91.97% 8.03%
4 Whisper Small (Fine-Tune) Local 91.24% 8.76%
5 Whisper (OpenAI API) Commercial 91.24% 8.76%
6 Whisper Base (Fine-Tune) Local 85.40% 14.60%
7 Whisper Tiny (Fine-Tune) Local 85.40% 14.60%

Key Insights

1. Local Fine-Tunes Beat Commercial Whisper APIs

The fine-tuned Whisper Large V3 Turbo achieved 94.16% accuracy, beating the best commercial service (Assembly at 92.70%). This proves targeted fine-tuning can outperform premium APIs on the same base model.

2. Cost & Privacy Advantages

Local models eliminate per-minute API costs and keep sensitive audio data on-premises. The performance advantage makes this even more compelling.

3. Commercial APIs Are Competitive

All three commercial APIs delivered production-ready performance (91-93% accuracy). They're viable alternatives when local inference isn't feasible.

4. Production Recommendations

Best Overall:

  • Whisper Large V3 Turbo (Fine-Tune) - 94.16% accuracy, local deployment

Best Commercial:

  • Assembly API - 92.70% accuracy if cloud deployment required

Balanced Local:

  • Whisper Small (Fine-Tune) - 91.24% accuracy, matches OpenAI with faster inference

Resources

  • Evaluation Framework: Python-based automated testing
  • Models Used: OpenAI Whisper variants and FUTO fine-tunes
  • Metrics Library: jiwer
  • Visualization: Chart.js for interactive charts

License

MIT License - See full evaluation data and methodology in the Space.

Author

Daniel Rosehill


Generated by automated Whisper evaluation framework | November 2025

Technical Details

Evaluation Metrics Explained

  • WER (Word Error Rate): Primary metric - percentage of words transcribed incorrectly

    • 0-10%: Excellent/Production ready
    • 10-20%: Good/Acceptable
    • 20%+: Needs improvement
  • MER (Match Error Rate): Similar to WER but treats sequences differently

  • WIL (Word Information Lost): Measures semantic information loss

  • WIP (Word Information Preserved): Inverse of WIL - higher is better

Error Types

  • Substitutions: Incorrect word transcribed
  • Deletions: Missing words from output
  • Insertions: Extra words added (hallucinations)

Test Environment

  • Hardware: CPU inference (no GPU)
  • Python: 3.12
  • Framework: Hugging Face Transformers
  • Audio Format: WAV, 137 words
  • Content: Narrative passage about coastal town

View the full interactive results above! 👆