Overview
This repository presents a comprehensive and publicly accessible collection of Natural Language Processing (NLP) resources for the Tigre language, an under‑resourced South Semitic language within the Afro‑Asiatic family. The release integrates both text and speech modalities and provides strong baseline models for key NLP tasks, including language modeling, automatic speech recognition (ASR), and machine translation.
All models were trained on a large‑scale Tigre corpus and are designed to support downstream research and technology development for this low‑resource language community.
w2v-bert-2.0-tig-asr: Automatic Speech Recognition for Tigre (ትግረ)
Model Description
This model is based on the Wav2Vec2‑BERT architecture (facebook/w2v‑bert‑2.0) and fine‑tuned specifically for Automatic Speech Recognition (ASR) in the Tigre language (ትግረ), spoken primarily in Eritrea and Sudan.
Tigre is written using the Ge’ez (Ethiopic) script, and the model is optimized to generate high‑quality transcriptions in this script.
Fine‑tuning was performed using the Hugging Face transformers library, with the addition of an adapter layer to enhance performance under low‑resource constraints.
Training Data
The model was trained on the Tigre subset of Mozilla Common Voice.
- Dataset Source: Common Voice Tigre
- Text Pre‑processing:
- Removal of punctuation
- Removal of Latin characters and other non‑Tigre symbols
- Normalization to a clean Ethiopic character set for CTC decoding
- Training Samples: 17,723 audio clips
- Evaluation Samples: 1,970 audio clips
Evaluation Results
Performance was measured using Word Error Rate (WER) on the Common Voice test split.
| Metric | Value |
|---|---|
| Final Validation WER | 0.088010 (~ 8.8%) |
Notes on Training Dynamics
Training was conducted over 10 epochs. Loss curves and WER steadily decreased, with early rapid learning.
A divergence between training and validation loss near Step ~2700 indicates the onset of mild overfitting—a common behavior when working with limited‑size corpora.
Training Details
| Parameter | Value | Notes |
|---|---|---|
| Base Model | facebook/w2v-bert-2.0 |
Pre-trained multilingual speech model |
| Tokenizer | Custom Wav2Vec2CTCTokenizer |
Based on Tigre Ge’ez characters (Vocab Size: 197) |
| Processor | Wav2Vec2BertProcessor |
Combines tokenizer + SeamlessM4TFeatureExtractor |
| Optimizer | AdamW | Standard HF configuration |
| Learning Rate | 5e‑5 | Stable for adapter‑based tuning |
| Epochs | 10 | Full fine‑tuning schedule |
| Effective Batch Size | 32 | Using gradient_accumulation_steps=16 |
| Architecture Modification | add_adapter=True |
Parameter‑efficient tuning |
Summary
The model achieves a Word Error Rate of 8.8%, which is considered competitive for a low‑resource ASR task using the Ge’ez script.
The results demonstrate the effectiveness of transfer learning + adapter‑based parameter‑efficient fine‑tuning for developing ASR systems in low‑resource linguistic settings.
Usage
🐍 Quick Inference with Hugging Face Pipeline
from transformers import pipeline
model_id = "BeitTigreAI/tigre-asr-Wav2Vec2Bert"
asr_pipeline = pipeline(
"automatic-speech-recognition",
model=model_id,
device=0 # change to 'cpu' if needed
)
result = asr_pipeline("audio_path.wav")
print(result)
- Downloads last month
- 25