Spaces:

danielrosehill
/

Whisper-Fine-Tune-Eval

Running

App Files Files Community

Whisper-Fine-Tune-Eval / eval /README.md

danielrosehill

Add Whisper fine-tune vs commercial APIs comparison with interactive visualizations

502d51d about 1 month ago

preview code

raw

history blame contribute delete

5.41 kB

Whisper Model WER Evaluation

Quick single-sample test comparing Word Error Rate (WER) performance of fine-tuned versus stock Whisper models on local hardware.

Test Setup

Fine-tuning: Performed on Modal using A100 GPU
Training data: 1 hour of audio, chunked and timestamped using WhisperX
Evaluation: Single audio sample (137 words) tested on local hardware
Test audio: eval/test-audio.wav with ground truth in eval/truth.txt

Quick Results

Best performer: Whisper Large Turbo (Fine-tuned) - 5.84% WER

See Evaluation Summary for detailed analysis and Latest Results for raw data.

Models Tested

Fine-Tuned Models

Whisper Large Turbo - 5.84% WER (Production Ready)
Whisper Small (FUTO) - 8.76% WER (Production Ready)
Whisper Tiny (FUTO) - 14.60% WER
Whisper Base - 14.60% WER

Stock Baseline

Whisper Small (OpenAI) - 11.68% WER (baseline)

Key Finding: Fine-tuned Large Turbo achieved 50% better WER than stock Whisper on this test sample.

Repository Structure

Local-STT-Fine-Tune-Tests/
├── README.md                    # This file
├── requirements.txt             # Python dependencies
│
├── scripts/                     # Evaluation scripts
│   ├── evaluate_models.py       # Main evaluation script
│   └── run_evaluation.sh        # Convenience runner
│
├── docs/                        # Documentation
│   ├── EVALUATION_SUMMARY.md    # Comprehensive analysis & recommendations
│   └── paths.md                 # Model path reference
│
├── eval/                        # Test data
│   ├── test-audio.wav           # Test audio file (137 words)
│   └── truth.txt                # Ground truth transcription
│
└── results/                     # Evaluation outputs
    ├── latest/                  # Most recent results
    │   ├── report.txt           # Human-readable report
    │   ├── results.json         # Machine-readable data
    │   └── model_comparison_chart.txt
    ├── transcriptions/          # Individual model outputs
    ├── archive/                 # Historical runs
    └── evaluation_*.txt         # Timestamped reports

Quick Start

Run the evaluation with a single command:

./scripts/run_evaluation.sh

Or manually:

# Create venv if needed
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Run evaluation
python scripts/evaluate_models.py

What Gets Evaluated

For each model, the script calculates:

WER (Word Error Rate) - Primary metric
MER (Match Error Rate)
WIL (Word Information Lost)
WIP (Word Information Preserved)
Error breakdown:
- Hits (correct words)
- Substitutions
- Deletions
- Insertions

Output Files

Results are saved to the results/ directory:

evaluation_report_YYYYMMDD_HHMMSS.txt - Human-readable report
evaluation_results_YYYYMMDD_HHMMSS.json - Machine-readable results
transcription_<model_name>.txt - Individual transcriptions

Understanding WER

WER is calculated as:

WER = (Substitutions + Deletions + Insertions) / Total Words in Reference

Interpretation:

< 5%: Excellent - Near-human level
5-10%: Very Good - Production ready
10-20%: Good - Acceptable for most uses
20-30%: Fair - May need post-processing
> 30%: Poor - Needs improvement

Report Format

The evaluation report includes:

Ranked Results - Models sorted by WER (best to worst)
Detailed Metrics - Full breakdown for each model
Conclusions - Best/worst performers and improvement analysis
WER Interpretation - Context for the results

Requirements

Python 3.8+
PyTorch
Transformers
jiwer
CUDA (optional, for GPU acceleration)

Documentation

Evaluation Summary - Detailed analysis with recommendations
Model Paths - Reference for model locations
Latest Results - Most recent evaluation outputs
Comparison Chart - Visual WER comparison

Technical Details

Metrics Calculated

WER (Word Error Rate) - Primary metric
MER (Match Error Rate)
WIL (Word Information Lost)
WIP (Word Information Preserved)
Error breakdown: Hits, Substitutions, Deletions, Insertions

WER Interpretation

< 5%: Excellent - Near-human level
5-10%: Very Good - Production ready
10-20%: Good - Acceptable for most uses
20-30%: Fair - May need post-processing
> 30%: Poor - Needs improvement

Requirements

Python 3.8+
PyTorch
Transformers (Hugging Face)
jiwer
CUDA (optional, for GPU acceleration)

Notes

The script automatically detects CUDA and uses GPU if available
Each run generates timestamped outputs for comparison tracking
Transcriptions are saved individually for manual review
Failed model loads are reported separately in the evaluation report

Contributing

To test additional models:

Add model path to scripts/evaluate_models.py in the MODELS dictionary
Run ./scripts/run_evaluation.sh
Check results/latest/ for updated rankings

License

MIT License - See LICENSE file for details