| --- |
| language: |
| - ar |
| - fr |
| license: mit |
| pipeline_tag: text-classification |
| tags: |
| - misinformation-detection |
| - fake-news |
| - text-classification |
| - algerian-darija |
| - arabic |
| - xlm-roberta |
|
|
| base_model: xlm-roberta-large |
| --- |
| |
| # DziriBERT — Algerian Darija Misinformation Detection |
|
|
| **DziriBERT** is a fine-tuned **XLM-RoBERTa-large** model for detecting misinformation in **Algerian Darija** text from social media and news. |
|
|
| - **Base model**: `xlm-roberta-large` (355M parameters) |
| - **Task**: Multi-class text classification (5 classes) |
| - **Classes**: |
| - **F**: Fake |
| - **R**: Real |
| - **N**: Non-new |
| - **M**: Misleading |
| - **S**: Satire |
|
|
| --- |
|
|
| ## Performance (Test set: 3,344 samples) |
|
|
| - **Accuracy**: 78.32% |
| - **Macro F1**: 68.22% |
| - **Weighted F1**: 78.43% |
|
|
| **Per-class F1**: |
| - Fake (F): 85.04% |
| - Real (R): 80.44% |
| - Non-new (N): 83.23% |
| - Misleading (M): 64.57% |
| - Satire (S): 27.83% |
|
|
| --- |
|
|
| ## Training Summary |
|
|
| - **Max sequence length**: 128 |
| - **Epochs**: 3 (early stopping) |
| - **Batch size**: 8 (effective 16 with gradient accumulation) |
| - **Learning rate**: 1e-5 |
| - **Loss**: Weighted CrossEntropy |
| - **Data augmentation**: Applied to minority classes (M, S) |
| - **Seed**: 42 |
|
|
| --- |
|
|
| ## Strengths & Limitations |
|
|
| **Strengths** |
| - Strong performance on Fake, Real, and Non-new classes |
| - Handles Darija, Arabic, and French code-switching well |
|
|
| **Limitations** |
| - Low performance on Satire due to limited samples |
| - Misleading class remains challenging |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| import os |
| import torch |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| |
| os.environ["USE_TF"] = "0" |
| os.environ["USE_TORCH"] = "1" |
| |
| MODEL_ID = "Rahilgh/model4_2" |
| DEVICE = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False) |
| model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE) |
| model.eval() |
| |
| LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"} |
| LABEL_NAMES = { |
| "F": "Fake", |
| "R": "Real", |
| "N": "Non-new", |
| "M": "Misleading", |
| "S": "Satire", |
| } |
| |
| texts = [ |
| "الجزائر فازت ببطولة امم افريقيا 2019", |
| "صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية", |
| ] |
| |
| for text in texts: |
| inputs = tokenizer( |
| text, |
| return_tensors="pt", |
| max_length=128, |
| truncation=True, |
| padding=True, |
| ).to(DEVICE) |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| probs = torch.softmax(outputs.logits, dim=1) |
| pred_id = probs.argmax().item() |
| confidence = probs[0][pred_id].item() |
| |
| label = LABEL_MAP[pred_id] |
| print(f"Text: {text}") |
| print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}") |
| |