Overview

This repository presents a comprehensive and publicly accessible collection of Natural Language Processing (NLP) resources for the Tigre language, an under‑resourced South Semitic language within the Afro‑Asiatic family. The release integrates both text and speech modalities and provides strong baseline models for key NLP tasks, including language modeling, automatic speech recognition (ASR), and machine translation.
All models were trained on a large‑scale Tigre corpus and are designed to support downstream research and technology development for this low‑resource language community.

w2v-bert-2.0-tig-asr: Automatic Speech Recognition for Tigre (ትግረ)

Model Description

This model is based on the Wav2Vec2‑BERT architecture (facebook/w2v‑bert‑2.0) and fine‑tuned specifically for Automatic Speech Recognition (ASR) in the Tigre language (ትግረ), spoken primarily in Eritrea and Sudan.
Tigre is written using the Ge’ez (Ethiopic) script, and the model is optimized to generate high‑quality transcriptions in this script. Fine‑tuning was performed using the Hugging Face transformers library, with the addition of an adapter layer to enhance performance under low‑resource constraints.

Training Data

The model was trained on the Tigre subset of Mozilla Common Voice.

  • Dataset Source: Common Voice Tigre
  • Text Pre‑processing:
    • Removal of punctuation
    • Removal of Latin characters and other non‑Tigre symbols
    • Normalization to a clean Ethiopic character set for CTC decoding
  • Training Samples: 17,723 audio clips
  • Evaluation Samples: 1,970 audio clips

Evaluation Results

Performance was measured using Word Error Rate (WER) on the Common Voice test split.

Metric Value
Final Validation WER 0.088010 (~ 8.8%)

Notes on Training Dynamics

Training was conducted over 10 epochs. Loss curves and WER steadily decreased, with early rapid learning.
A divergence between training and validation loss near Step ~2700 indicates the onset of mild overfitting—a common behavior when working with limited‑size corpora.

Training Details

Parameter Value Notes
Base Model facebook/w2v-bert-2.0 Pre-trained multilingual speech model
Tokenizer Custom Wav2Vec2CTCTokenizer Based on Tigre Ge’ez characters (Vocab Size: 197)
Processor Wav2Vec2BertProcessor Combines tokenizer + SeamlessM4TFeatureExtractor
Optimizer AdamW Standard HF configuration
Learning Rate 5e‑5 Stable for adapter‑based tuning
Epochs 10 Full fine‑tuning schedule
Effective Batch Size 32 Using gradient_accumulation_steps=16
Architecture Modification add_adapter=True Parameter‑efficient tuning

Summary

The model achieves a Word Error Rate of 8.8%, which is considered competitive for a low‑resource ASR task using the Ge’ez script.
The results demonstrate the effectiveness of transfer learning + adapter‑based parameter‑efficient fine‑tuning for developing ASR systems in low‑resource linguistic settings.

Usage

🐍 Quick Inference with Hugging Face Pipeline

from transformers import pipeline

model_id = "BeitTigreAI/tigre-asr-Wav2Vec2Bert"

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=model_id,
    device=0  # change to 'cpu' if needed
)

result = asr_pipeline("audio_path.wav")
print(result)
Downloads last month
25
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support