Whisper Model WER Evaluation
Quick single-sample test comparing Word Error Rate (WER) performance of fine-tuned versus stock Whisper models on local hardware.
Test Setup
- Fine-tuning: Performed on Modal using A100 GPU
- Training data: 1 hour of audio, chunked and timestamped using WhisperX
- Evaluation: Single audio sample (137 words) tested on local hardware
- Test audio:
eval/test-audio.wavwith ground truth ineval/truth.txt
Quick Results
Best performer: Whisper Large Turbo (Fine-tuned) - 5.84% WER
See Evaluation Summary for detailed analysis and Latest Results for raw data.
Models Tested
Fine-Tuned Models
- Whisper Large Turbo - 5.84% WER (Production Ready)
- Whisper Small (FUTO) - 8.76% WER (Production Ready)
- Whisper Tiny (FUTO) - 14.60% WER
- Whisper Base - 14.60% WER
Stock Baseline
- Whisper Small (OpenAI) - 11.68% WER (baseline)
Key Finding: Fine-tuned Large Turbo achieved 50% better WER than stock Whisper on this test sample.
Repository Structure
Local-STT-Fine-Tune-Tests/
βββ README.md # This file
βββ requirements.txt # Python dependencies
β
βββ scripts/ # Evaluation scripts
β βββ evaluate_models.py # Main evaluation script
β βββ run_evaluation.sh # Convenience runner
β
βββ docs/ # Documentation
β βββ EVALUATION_SUMMARY.md # Comprehensive analysis & recommendations
β βββ paths.md # Model path reference
β
βββ eval/ # Test data
β βββ test-audio.wav # Test audio file (137 words)
β βββ truth.txt # Ground truth transcription
β
βββ results/ # Evaluation outputs
βββ latest/ # Most recent results
β βββ report.txt # Human-readable report
β βββ results.json # Machine-readable data
β βββ model_comparison_chart.txt
βββ transcriptions/ # Individual model outputs
βββ archive/ # Historical runs
βββ evaluation_*.txt # Timestamped reports
Quick Start
Run the evaluation with a single command:
./scripts/run_evaluation.sh
Or manually:
# Create venv if needed
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txt
# Run evaluation
python scripts/evaluate_models.py
What Gets Evaluated
For each model, the script calculates:
- WER (Word Error Rate) - Primary metric
- MER (Match Error Rate)
- WIL (Word Information Lost)
- WIP (Word Information Preserved)
- Error breakdown:
- Hits (correct words)
- Substitutions
- Deletions
- Insertions
Output Files
Results are saved to the results/ directory:
evaluation_report_YYYYMMDD_HHMMSS.txt- Human-readable reportevaluation_results_YYYYMMDD_HHMMSS.json- Machine-readable resultstranscription_<model_name>.txt- Individual transcriptions
Understanding WER
WER is calculated as:
WER = (Substitutions + Deletions + Insertions) / Total Words in Reference
Interpretation:
- < 5%: Excellent - Near-human level
- 5-10%: Very Good - Production ready
- 10-20%: Good - Acceptable for most uses
- 20-30%: Fair - May need post-processing
- > 30%: Poor - Needs improvement
Report Format
The evaluation report includes:
- Ranked Results - Models sorted by WER (best to worst)
- Detailed Metrics - Full breakdown for each model
- Conclusions - Best/worst performers and improvement analysis
- WER Interpretation - Context for the results
Requirements
- Python 3.8+
- PyTorch
- Transformers
- jiwer
- CUDA (optional, for GPU acceleration)
Documentation
- Evaluation Summary - Detailed analysis with recommendations
- Model Paths - Reference for model locations
- Latest Results - Most recent evaluation outputs
- Comparison Chart - Visual WER comparison
Technical Details
Metrics Calculated
- WER (Word Error Rate) - Primary metric
- MER (Match Error Rate)
- WIL (Word Information Lost)
- WIP (Word Information Preserved)
- Error breakdown: Hits, Substitutions, Deletions, Insertions
WER Interpretation
- < 5%: Excellent - Near-human level
- 5-10%: Very Good - Production ready
- 10-20%: Good - Acceptable for most uses
- 20-30%: Fair - May need post-processing
- > 30%: Poor - Needs improvement
Requirements
- Python 3.8+
- PyTorch
- Transformers (Hugging Face)
- jiwer
- CUDA (optional, for GPU acceleration)
Notes
- The script automatically detects CUDA and uses GPU if available
- Each run generates timestamped outputs for comparison tracking
- Transcriptions are saved individually for manual review
- Failed model loads are reported separately in the evaluation report
Contributing
To test additional models:
- Add model path to
scripts/evaluate_models.pyin theMODELSdictionary - Run
./scripts/run_evaluation.sh - Check
results/latest/for updated rankings
License
MIT License - See LICENSE file for details