Overview

This repository presents a comprehensive and publicly accessible collection of Natural Language Processing (NLP) resources for the Tigre language, an under‑resourced South Semitic language within the Afro‑Asiatic family. The release integrates both text and speech modalities and provides strong baseline models for key NLP tasks, including language modeling, automatic speech recognition (ASR), and machine translation.
All models were trained on a large‑scale Tigre corpus and are designed to support downstream research and technology development for this low‑resource language community.

w2v-bert-2.0-tig-asr: Automatic Speech Recognition for Tigre (ትግረ)

Model Description

This model is based on the Wav2Vec2‑BERT architecture (facebook/w2v‑bert‑2.0) and fine‑tuned specifically for Automatic Speech Recognition (ASR) in the Tigre language (ትግረ), spoken primarily in Eritrea and Sudan.
Tigre is written using the Ge’ez (Ethiopic) script, and the model is optimized to generate high‑quality transcriptions in this script. Fine‑tuning was performed using the Hugging Face transformers library, with the addition of an adapter layer to enhance performance under low‑resource constraints.

Training Data

The model was trained on the Tigre subset of Mozilla Common Voice.

Dataset Source: Common Voice Tigre
Text Pre‑processing:
- Removal of punctuation
- Removal of Latin characters and other non‑Tigre symbols
- Normalization to a clean Ethiopic character set for CTC decoding
Training Samples: 17,723 audio clips
Evaluation Samples: 1,970 audio clips

Evaluation Results

Performance was measured using Word Error Rate (WER) on the Common Voice test split.

Metric	Value
Final Validation WER	0.088010 (~ 8.8%)

Notes on Training Dynamics

Training was conducted over 10 epochs. Loss curves and WER steadily decreased, with early rapid learning.
A divergence between training and validation loss near Step ~2700 indicates the onset of mild overfitting—a common behavior when working with limited‑size corpora.

Training Details

Parameter	Value	Notes
Base Model	`facebook/w2v-bert-2.0`	Pre-trained multilingual speech model
Tokenizer	Custom `Wav2Vec2CTCTokenizer`	Based on Tigre Ge’ez characters (Vocab Size: 197)
Processor	`Wav2Vec2BertProcessor`	Combines tokenizer + `SeamlessM4TFeatureExtractor`
Optimizer	AdamW	Standard HF configuration
Learning Rate	5e‑5	Stable for adapter‑based tuning
Epochs	10	Full fine‑tuning schedule
Effective Batch Size	32	Using `gradient_accumulation_steps=16`
Architecture Modification	`add_adapter=True`	Parameter‑efficient tuning

Summary

The model achieves a Word Error Rate of 8.8%, which is considered competitive for a low‑resource ASR task using the Ge’ez script.
The results demonstrate the effectiveness of transfer learning + adapter‑based parameter‑efficient fine‑tuning for developing ASR systems in low‑resource linguistic settings.

Usage

🐍 Quick Inference with Hugging Face Pipeline

from transformers import pipeline

model_id = "BeitTigreAI/tigre-asr-Wav2Vec2Bert"

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=model_id,
    device=0  # change to 'cpu' if needed
)

result = asr_pipeline("audio_path.wav")
print(result)

Downloads last month: 25

Safetensors

Model size

0.6B params

Tensor type

F32