danielrosehill's picture
Add Whisper fine-tune vs commercial APIs comparison with interactive visualizations
502d51d

Whisper Model WER Evaluation

Quick single-sample test comparing Word Error Rate (WER) performance of fine-tuned versus stock Whisper models on local hardware.

Test Setup

  • Fine-tuning: Performed on Modal using A100 GPU
  • Training data: 1 hour of audio, chunked and timestamped using WhisperX
  • Evaluation: Single audio sample (137 words) tested on local hardware
  • Test audio: eval/test-audio.wav with ground truth in eval/truth.txt

Quick Results

Best performer: Whisper Large Turbo (Fine-tuned) - 5.84% WER

See Evaluation Summary for detailed analysis and Latest Results for raw data.

Models Tested

Fine-Tuned Models

  1. Whisper Large Turbo - 5.84% WER (Production Ready)
  2. Whisper Small (FUTO) - 8.76% WER (Production Ready)
  3. Whisper Tiny (FUTO) - 14.60% WER
  4. Whisper Base - 14.60% WER

Stock Baseline

  • Whisper Small (OpenAI) - 11.68% WER (baseline)

Key Finding: Fine-tuned Large Turbo achieved 50% better WER than stock Whisper on this test sample.

Repository Structure

Local-STT-Fine-Tune-Tests/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ requirements.txt             # Python dependencies
β”‚
β”œβ”€β”€ scripts/                     # Evaluation scripts
β”‚   β”œβ”€β”€ evaluate_models.py       # Main evaluation script
β”‚   └── run_evaluation.sh        # Convenience runner
β”‚
β”œβ”€β”€ docs/                        # Documentation
β”‚   β”œβ”€β”€ EVALUATION_SUMMARY.md    # Comprehensive analysis & recommendations
β”‚   └── paths.md                 # Model path reference
β”‚
β”œβ”€β”€ eval/                        # Test data
β”‚   β”œβ”€β”€ test-audio.wav           # Test audio file (137 words)
β”‚   └── truth.txt                # Ground truth transcription
β”‚
└── results/                     # Evaluation outputs
    β”œβ”€β”€ latest/                  # Most recent results
    β”‚   β”œβ”€β”€ report.txt           # Human-readable report
    β”‚   β”œβ”€β”€ results.json         # Machine-readable data
    β”‚   └── model_comparison_chart.txt
    β”œβ”€β”€ transcriptions/          # Individual model outputs
    β”œβ”€β”€ archive/                 # Historical runs
    └── evaluation_*.txt         # Timestamped reports

Quick Start

Run the evaluation with a single command:

./scripts/run_evaluation.sh

Or manually:

# Create venv if needed
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Run evaluation
python scripts/evaluate_models.py

What Gets Evaluated

For each model, the script calculates:

  • WER (Word Error Rate) - Primary metric
  • MER (Match Error Rate)
  • WIL (Word Information Lost)
  • WIP (Word Information Preserved)
  • Error breakdown:
    • Hits (correct words)
    • Substitutions
    • Deletions
    • Insertions

Output Files

Results are saved to the results/ directory:

  • evaluation_report_YYYYMMDD_HHMMSS.txt - Human-readable report
  • evaluation_results_YYYYMMDD_HHMMSS.json - Machine-readable results
  • transcription_<model_name>.txt - Individual transcriptions

Understanding WER

WER is calculated as:

WER = (Substitutions + Deletions + Insertions) / Total Words in Reference

Interpretation:

  • < 5%: Excellent - Near-human level
  • 5-10%: Very Good - Production ready
  • 10-20%: Good - Acceptable for most uses
  • 20-30%: Fair - May need post-processing
  • > 30%: Poor - Needs improvement

Report Format

The evaluation report includes:

  1. Ranked Results - Models sorted by WER (best to worst)
  2. Detailed Metrics - Full breakdown for each model
  3. Conclusions - Best/worst performers and improvement analysis
  4. WER Interpretation - Context for the results

Requirements

  • Python 3.8+
  • PyTorch
  • Transformers
  • jiwer
  • CUDA (optional, for GPU acceleration)

Documentation

Technical Details

Metrics Calculated

  • WER (Word Error Rate) - Primary metric
  • MER (Match Error Rate)
  • WIL (Word Information Lost)
  • WIP (Word Information Preserved)
  • Error breakdown: Hits, Substitutions, Deletions, Insertions

WER Interpretation

  • < 5%: Excellent - Near-human level
  • 5-10%: Very Good - Production ready
  • 10-20%: Good - Acceptable for most uses
  • 20-30%: Fair - May need post-processing
  • > 30%: Poor - Needs improvement

Requirements

  • Python 3.8+
  • PyTorch
  • Transformers (Hugging Face)
  • jiwer
  • CUDA (optional, for GPU acceleration)

Notes

  • The script automatically detects CUDA and uses GPU if available
  • Each run generates timestamped outputs for comparison tracking
  • Transcriptions are saved individually for manual review
  • Failed model loads are reported separately in the evaluation report

Contributing

To test additional models:

  1. Add model path to scripts/evaluate_models.py in the MODELS dictionary
  2. Run ./scripts/run_evaluation.sh
  3. Check results/latest/ for updated rankings

License

MIT License - See LICENSE file for details