MixMinMatch Collection

mmBERT Arabic Quality Classifier

A text quality classifier for Arabic pretraining data, trained from mmBERT-small. Used to create AraMix-HQ.

This model implements the FineWeb2-HQ approach (Messmer et al., 2025) but uses mmBERT as the encoder for improved Arabic understanding.

Usage

from transformers import pipeline

classifier = pipeline("text-classification", model="AdaMLLab/mmBERT-Arabic-Quality-Classifier")
result = classifier("النص العربي هنا")

Citation

@misc{alrashed2025mixminmatch,
      title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets}, 
      author={Sultan Alrashed and Francesco Orabona},
      year={2025},
      eprint={2512.18834v2},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.18834v2}, 
}
Downloads last month
16
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AdaMLLab/mmBERT-Arabic-Quality-Classifier

Finetuned
(21)
this model

Collection including AdaMLLab/mmBERT-Arabic-Quality-Classifier

Papers for AdaMLLab/mmBERT-Arabic-Quality-Classifier