Morpheme-Aware Whisper for Low-Resource Armenian ASR
This model is a fine-tuned version of Whisper Tiny, utilizing a frozen encoder and a custom morpheme tokenizer to achieve cost-effective and accurate speech-to-text for the Armenian language. It outperforms standard OpenAI Whisper models in speed and specific Armenian accuracy by training on the Common Voice 20.0 dataset.
Model Details
- Model Architecture: Whisper Tiny (Frozen Encoder, Retrained Decoder)
- Language: Armenian (
hy) - Tokenizer: Custom Morpheme Tokenizer
- Dataset: Chillarmo/common_voice_20_armenian
- Paper: Morpheme-Aware Whisper for Low-Resource Armenian ASR (Movsesyan, 2025)
Abstract & Motivation
For a language such as Armenian, having the ability to accurately and cost-effectively translate speech into text is huge. However, due to its low-resource nature, it does not have this yet. This project utilizes Whisper Tiny as the core technology to achieve this goal.
The problem with current generic models is that they are not robust enough for real-world usage. A business professional or an Armenian organization cannot simply deploy current methods and expect them to work; the high error rates produce nonsense outputs. This leads to accessibility issues at events where the speaker is Armenian, but the audience may not fully comprehend the language, forcing them to focus on decoding speech rather than understanding concepts.
Evaluation Results
| Metric | Value |
|---|---|
| WER (Word Error Rate) | 40.16% |
| Exact Match | 12.14% |
| Eval Loss | 1.31 |
Note: The WER was calculated using strict Armenian text normalization (excluding punctuation and non-Armenian characters).
Training Procedure
Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 16
- seed: 42
- optimizer: adamw_torch_fused (betas=(0.9,0.999), epsilon=1e-08)
- lr_scheduler_type: linear
- num_epochs: 3.0
- mixed_precision_training: Native AMP
Framework Versions
- Transformers 4.56.2
- Pytorch 2.8.0+cu129
- Datasets 3.5.0
- Tokenizers 0.22.1
Citation
If you use this model, please cite the following work:
@inproceedings{movsesyan2025morpheme,
author = {Movses Movsesyan},
title = {Morpheme-Aware Whisper for Low-Resource Armenian ASR},
booktitle = {ACM},
year = {2025},
url = {[https://doi.org/10.1145/nnnnnnn.nnnnnnn](https://doi.org/10.1145/nnnnnnn.nnnnnnn)}
}
- Downloads last month
- 10
Dataset used to train Chillarmo/Whisper-Tiny-New-Vocab
Evaluation results
- Wer on Common Voice 20.0test set self-reported40.156
- Exact Match on Common Voice 20.0test set self-reported12.141
- Evaluation Loss on Common Voice 20.0test set self-reported1.312