Rahilgh
/

model4_2

Text Classification

misinformation-detection

algerian-darija

Model card Files Files and versions

model4_2 / README.md

Rahilgh's picture

Update README.md

3af9548 verified 3 months ago

|

history blame contribute delete

2.74 kB

	---
	language:
	- ar
	- fr
	license: mit
	pipeline_tag: text-classification
	tags:
	- misinformation-detection
	- fake-news
	- text-classification
	- algerian-darija
	- arabic
	- xlm-roberta

	base_model: xlm-roberta-large
	---

	# DziriBERT — Algerian Darija Misinformation Detection

	DziriBERT is a fine-tuned XLM-RoBERTa-large model for detecting misinformation in Algerian Darija text from social media and news.

	- Base model: `xlm-roberta-large` (355M parameters)
	- Task: Multi-class text classification (5 classes)
	- Classes:
	- F: Fake
	- R: Real
	- N: Non-new
	- M: Misleading
	- S: Satire

	---

	## Performance (Test set: 3,344 samples)

	- Accuracy: 78.32%
	- Macro F1: 68.22%
	- Weighted F1: 78.43%

	Per-class F1:
	- Fake (F): 85.04%
	- Real (R): 80.44%
	- Non-new (N): 83.23%
	- Misleading (M): 64.57%
	- Satire (S): 27.83%

	---

	## Training Summary

	- Max sequence length: 128
	- Epochs: 3 (early stopping)
	- Batch size: 8 (effective 16 with gradient accumulation)
	- Learning rate: 1e-5
	- Loss: Weighted CrossEntropy
	- Data augmentation: Applied to minority classes (M, S)
	- Seed: 42

	---

	## Strengths & Limitations

	Strengths
	- Strong performance on Fake, Real, and Non-new classes
	- Handles Darija, Arabic, and French code-switching well

	Limitations
	- Low performance on Satire due to limited samples
	- Misleading class remains challenging

	---

	## Usage

	```python
	import os
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	os.environ["USE_TF"] = "0"
	os.environ["USE_TORCH"] = "1"

	MODEL_ID = "Rahilgh/model4_2"
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False)
	model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE)
	model.eval()

	LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
	LABEL_NAMES = {
	"F": "Fake",
	"R": "Real",
	"N": "Non-new",
	"M": "Misleading",
	"S": "Satire",
	}

	texts = [
	"الجزائر فازت ببطولة امم افريقيا 2019",
	"صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية",
	]

	for text in texts:
	inputs = tokenizer(
	text,
	return_tensors="pt",
	max_length=128,
	truncation=True,
	padding=True,
	).to(DEVICE)

	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=1)
	pred_id = probs.argmax().item()
	confidence = probs[0][pred_id].item()

	label = LABEL_MAP[pred_id]
	print(f"Text: {text}")
	print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}")