File size: 11,316 Bytes

---
base_model: dicta-il/dictalm2.0-instruct
library_name: peft
model_name: offensive_v5_dpo
tags:
- dpo
- lora
- transformers
- trl
- hebrew
- offensive-language-detection
- content-moderation
- explainable-ai
- reasoning
license: mit
language:
- he
pipeline_tag: text-classification
---

# Hebrew Offensive Language Detection with Reasoning (offensive_v5_dpo)

This model is a fine-tuned version of [dicta-il/dictalm2.0-instruct](https://huggingface.co/dicta-il/dictalm2.0-instruct) specialized for **detecting offensive language in Hebrew text** while providing **explainable rationales** in Hebrew.

**Model Repository:** [KevynKrancenblum/hebrew-offensive-detection](https://huggingface.co/KevynKrancenblum/hebrew-offensive-detection)

## What Does This Model Do?

This model performs **binary classification** of Hebrew text to determine if it contains offensive language, with the unique capability of **explaining its reasoning** in Hebrew. It addresses critical challenges in Hebrew NLP:

### Key Capabilities

1. **Offensive Language Detection**: Classifies Hebrew text as offensive (label: 1) or non-offensive (label: 0)
2. **Explainable Predictions**: Generates Hebrew rationales explaining why text is classified as offensive or not
3. **Cultural Awareness**: Fine-tuned on Hebrew-specific offensive patterns including:
   - Cultural insults and slurs (קללות)
   - Political and ethnic hate speech (הסתה)
   - Threats and aggressive language (איומים)
   - Context-dependent offensiveness in Israeli discourse

### Performance Metrics

| Dataset | Accuracy | Precision | Recall | F1-Score |
|---------|----------|-----------|--------|----------|
| OlaH-5000 (test) | **0.85** | **0.85** | **0.85** | **0.85** |
| HeDetox (cross-domain) | **0.91** | **0.92** | **0.91** | **0.91** |

**Comparison with baselines:**
- AlephBERT (fine-tuned): 0.84 F1 (no explanations)
- heBERT (fine-tuned): 0.85 F1 (no explanations)
- GPT-5 (zero-shot): 0.77 F1 (lacks Hebrew cultural grounding)

## Quick Start

### Installation

```bash
pip install transformers torch peft bitsandbytes accelerate
```

### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "KevynKrancenblum/hebrew-offensive-detection"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,  # Use 4-bit quantization for efficiency
    device_map="auto"
)

# Prepare system prompt in Hebrew
SYSTEM_PROMPT = """אתה מומחה לזיהוי תוכן פוגעני בעברית. נתח את הטקסט הבא והסבר את הנימוק שלך.
בהתבסס על הנימוק, תן תווית: 1 לפוגעני או 0 ללא פוגעני."""

# Classification function
def classify_hebrew_text(text: str) -> dict:
    prompt = f"{SYSTEM_PROMPT}\n\nטקסט: \"{text}\""

    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(messages, tokenize=False)

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Parse response
    lines = response.split('\n')
    label = None
    reason = None

    for line in lines:
        if 'תווית:' in line or 'label:' in line.lower():
            # Extract label (0 or 1)
            if '1' in line and 'פוגעני' in line:
                label = 1
            elif '0' in line:
                label = 0
        elif len(line.strip()) > 10 and label is None:
            # Rationale is typically the longer text after label
            reason = line.strip()

    return {
        "label": label,  # 1 = offensive, 0 = non-offensive
        "reason": reason,  # Hebrew explanation
        "full_response": response
    }

# Example usage
text = "יא מטומטם, לך תמות"
result = classify_hebrew_text(text)

print(f"Label: {result['label']}")
print(f"Reason: {result['reason']}")
```

### Example Output

**Input:** "יא מטומטם, לך תמות"

**Output:**
```
Label: 1 (Offensive)
Reason: הטקסט מכיל קללה ("מטומטם") ואיום ("לך תמות"), שניהם ביטויים פוגעניים המטרתם להשפיל ולאיים.
```

**Translation:** "The text contains an insult ('idiot') and a threat ('go die'), both offensive expressions intended to humiliate and threaten."

## Training Methodology

### Three-Stage Alignment Pipeline

This model was developed through a sophisticated **three-stage training process** combining teacher-student learning with preference optimization:

#### Stage 1: Teacher-Generated Reasoning Supervision
- **Teacher Model:** GPT-5 (gpt-5-preview)
- **Task:** Generate high-quality Hebrew rationales explaining offensive/non-offensive classifications
- **Dataset:** ~8,000 annotated samples from OlaH-5000
- **Output:** Structured reasoning corpus in Hebrew

#### Stage 2: Supervised Fine-Tuning (SFT)
- **Base Model:** DictaLM-2.0-Instruct (7B parameters, Mistral architecture)
- **Method:** Parameter-Efficient Fine-Tuning (PEFT) using QLoRA
- **Training Details:**
  - LoRA adapters: rank=256, alpha=512
  - 4-bit quantization (bitsandbytes)
  - Chain-of-thought supervision (model learns to generate rationale → label)
  - Training time: ~12 hours on RTX 4080 SUPER (16GB VRAM)
- **Results:** 74% F1 (improved neutrality handling)

#### Stage 3: Direct Preference Optimization (DPO)
- **Method:** Iterative DPO alignment without reward model
- **Preference Pairs:**
  - **Chosen:** GPT-5 teacher rationale (correct label + explanation)
  - **Rejected:** GPT-5-mini rationale (incorrect label + plausible but wrong explanation)
- **Three Iterations:**
  - Round 1: 80% F1 (balanced precision-recall)
  - Round 2: 82% F1 (refined calibration)
  - **Round 3 (this model): 85% F1** (optimal performance, stable explanations)

### Why DPO?

Direct Preference Optimization was chosen over traditional RLHF/PPO because:
- ✅ No separate reward model required
- ✅ Computationally efficient (trainable on consumer GPUs)
- ✅ Single-stage optimization
- ✅ Comparable or superior performance to full RLHF
- ✅ More stable training dynamics

### Training Configuration

**Hardware:**
- Single NVIDIA RTX 4080 SUPER (16GB VRAM)
- Total training time: ~32 hours (all stages)

**Hyperparameters:**
- Epochs: 50 (SFT), 3 (DPO iterations)
- Batch size: 2 per device, gradient accumulation: 16 (effective batch = 32)
- Learning rate: 2×10⁻⁵ (linear warmup)
- Max sequence length: 512 tokens
- Precision: bfloat16
- Optimizer: AdamW

**Memory Optimization:**
- QLoRA reduces memory from ~28GB (FP16) to <7GB (4-bit)
- Gradient checkpointing enabled
- LoRA adapters: ~67M trainable parameters (~0.96% of base model)

## Use Cases

This model is designed for:

1. **Content Moderation**: Automated detection of offensive content in Hebrew social media, forums, and comment sections
2. **Educational Tools**: Teaching about offensive language patterns with explainable feedback
3. **Research**: Studying Hebrew offensive language and cultural hate speech patterns
4. **Compliance**: Helping platforms enforce community guidelines in Hebrew

## Datasets Used

- **OlaH-5000**: Primary training dataset for Hebrew offensive language
- **HeDetox**: Cross-domain evaluation dataset for Hebrew text detoxification

## Limitations

- **Slang and Youth Language**: May struggle with emerging slang, metaphorical insults, or internet-specific Hebrew
- **Spelling Variations**: Performance degrades with unconventional spellings or corrupted text
- **Domain Specificity**: Optimized for social media text (Twitter/Facebook style)
- **Cultural Subjectivity**: Inherits biases from training data annotations
- **Context Length**: Limited to 512 tokens (may miss context in very long texts)

## Ethical Considerations

⚠️ **Important:** This model reflects cultural and contextual interpretations of offensiveness in Israeli Hebrew discourse. Classifications should be:
- Used as **decision support**, not sole determinant
- Combined with **human review** for sensitive moderation decisions
- Regularly evaluated for **bias and fairness**
- Contextualized to specific use cases and communities

## Training Procedure

This model was trained with **Direct Preference Optimization (DPO)**, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).

[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/kevynkrancenblum-sami-shamoon/huggingface/runs/ep1pizjj)

### Framework Versions

- PEFT: 0.17.0
- TRL: 0.21.0
- Transformers: 4.55.2
- PyTorch: 2.6.0+cu124
- Datasets: 4.0.0
- Tokenizers: 0.21.4
- bitsandbytes: (4-bit quantization)

## Repository and Resources

- **GitHub Repository:** [KevynKrancenblum/hebrew-offensive-detection](https://github.com/KevynKrancenblum/hebrew-offensive-detection)
- **Interactive Demo:** Streamlit web interface included in repository
- **Documentation:** Comprehensive README with usage examples

## Citation

If you use this model in your research, please cite:

```bibtex
@mastersthesis{krancenblum2025hebrew,
  title={Developing Reasoning-Augmented Language Models for Hebrew Offensive Language Detection},
  author={Krancenblum, Kevyn},
  year={2025},
  school={Sami Shamoon College of Engineering},
  note={Model: https://huggingface.co/KevynKrancenblum/hebrew-offensive-detection}
}
```

### Cite DPO Method

```bibtex
@inproceedings{rafailov2023direct,
    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
    year         = 2023,
    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
}
```

### Cite TRL Framework

```bibtex
@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
```

## License

MIT License - See LICENSE file for details

## Acknowledgments

- **Dicta Research Center** for DictaLM-2.0-Instruct base model
- **OpenAI** for GPT-5 teacher supervision
- **Hugging Face** for model hosting and transformers library
- **OlaH-5000** and **HeDetox** dataset creators
- **TRL Team** for Direct Preference Optimization implementation