---
base_model: unsloth/llama-3-8b-Instruct-bnb-4bit
license: apache-2.0
language:
- en
tags:
- llama.cpp
- gguf
- quantized
- q4_k_m
- q8_0
- named-entity-recognition
- ner
- news-analysis
- indian-news
- conflibert
---

# ConflLlama-NER-TOI: Large-Scale Named Entity Recognition for Indian News

<p align="center">
  <img src="images/logo.png" alt="ConflLlama-NER Logo" width="300"/>
</p>

---

## ⚠️ Important: Read Before Using

**This model requires exact prompt formatting to work correctly.** Please read the [Critical: Inference & Prompt Formatting](#critical-inference--prompt-formatting) section below before attempting to use this model. Using incorrect prompts will result in poor performance or hallucinations.

---

**ConflLlama-NER-TOI** is a large-scale named entity recognition model fine-tuned on **100,000 Indian news articles** from the Times of India. Built upon **Llama-3.1 8B Instruct**, this model identifies and classifies four entity types across diverse news domains:

- **Location**: Geographic entities (cities, regions, countries)
- **Organisation**: Companies, government bodies, institutions, political parties
- **Person**: Named individuals, public figures, officials
- **Temporal**: Time expressions, dates, periods

This model represents a significant scale-up from the original ConflLlama-NER, trained on 1,300x more data for broader domain coverage and improved generalization.

---

## Key Features

- **Large-Scale Training**: Fine-tuned on 100,000 news articles with high-confidence entity annotations
- **Multi-Domain Coverage**: Trained on diverse news topics including politics, sports, crime, economy, and entertainment
- **ConfliBERT Annotations**: Uses state-of-the-art entity recognition model for training data generation
- **High-Quality Filtering**: Only entities with ≥0.9 confidence score included in training
- **JSON Output Format**: Returns structured entity lists with text, type, position, and confidence scores
- **Efficient Deployment**: Available in multiple quantization formats (Q4_K_M, Q8_0, BF16)
- **Instruction-Tuned**: Built on Llama-3.1 Instruct for robust prompt following

---

## Training Data

### Dataset: Times of India Corpus with ConfliBERT Annotations

The model was trained on a large-scale dataset derived from Indian news articles:

- **Total Articles**: ~1.5 million Times of India articles
- **Training Samples**: 100,000 (randomly sampled)
- **Training/Test Split**: 95,000 train / 5,000 test (95/5 split)
- **Text Processing**: Articles truncated to 510 tokens (~400-500 words) to fit context limits
- **Annotation Method**: Automated using `eventdata-utd/conflibert-named-entity-recognition`
- **Quality Filter**: Only entities with confidence score ≥0.9 included
- **Total Entities**: Hundreds of thousands of high-confidence annotations

### Entity Type Definitions

| Entity Type | Description | Examples |
|-------------|-------------|----------|
| **Location** | Geographic entities including cities, states, countries, regions | "Mumbai", "Punjab", "India", "United States" |
| **Organisation** | Companies, government bodies, institutions, political parties, agencies | "Congress", "Reserve Bank of India", "Google", "Supreme Court" |
| **Person** | Named individuals, politicians, celebrities, officials | "Narendra Modi", "Virat Kohli", "Amit Shah" |
| **Temporal** | Time expressions, dates, periods, durations | "Monday", "2023", "last week", "evening" |

### Data Coverage

The training data spans multiple news domains:
- **Politics**: Government actions, elections, policy announcements
- **Crime**: Police reports, legal proceedings, incidents
- **Sports**: Matches, players, tournaments
- **Economy**: Business news, market reports, financial decisions
- **Entertainment**: Celebrity news, film industry, cultural events
- **Local News**: City-level events, regional developments

---

## Model Architecture

- **Base Model**: `unsloth/llama-3-8b-Instruct-bnb-4bit`
- **Fine-tuning Method**: QLoRA (Quantized Low-Rank Adaptation)
- **Quantization**: 4-bit with bitsandbytes
- **Maximum Sequence Length**: 2048 tokens
- **LoRA Configuration**:
  - Rank (r): 16
  - Alpha (lora_alpha): 16
  - Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
  - Dropout: 0
  - Gradient Checkpointing: Enabled (Unsloth optimization)

### Training Configuration

- **Optimizer**: AdamW 8-bit (memory efficient)
- **Learning Rate**: 2e-4 with 3% warmup (150 steps)
- **Batch Size**: 2 per device with 4 gradient accumulation steps (effective batch size: 8)
- **Training Steps**: 5,000 steps (~42% of one epoch)
- **Precision**: BFloat16 (when supported)
- **Hardware**: NVIDIA A100-SXM4-40GB GPU on NCSA Delta
- **Training Time**: ~3-4 hours
- **Memory Footprint**: ~8 GB VRAM

### Training Dynamics

- **Initial Loss**: ~2.5
- **Final Loss**: ~0.90 (plateau observed)
- **Gradient Norm**: Stable at 0.25-0.35 throughout training
- **Learning Rate Schedule**: Linear decay from 2e-4 to 0

---

## Critical: Inference & Prompt Formatting

**IMPORTANT**: This model requires **exact prompt formatting** to function correctly. The model was instruction-tuned with a specific template. Deviating from this format will result in poor performance or hallucinations.

### Option 1: Using LM Studio / Llama.cpp (Recommended)

**System Prompt:**
```
Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.
```

**User Message:**
```
Text: [YOUR ARTICLE TEXT HERE]
Entities: 
```

### Option 2: Direct API/Python Integration

If your platform requires a single concatenated prompt:

```python
prompt = """<|start_header_id|>system<|end_header_id|>

Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>

Text: {your_text_here}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
```

### Example Usage

**Input:**
```
Text: The Punjab government has placed under suspension two officials following the Bathinda incident. Chief Minister Amarinder Singh announced the decision in Chandigarh on Monday.
Entities:
```

**Expected Output:**
```json
[
  {"text": "Punjab government", "type": "Organisation", "start": 4, "end": 21, "score": 0.995},
  {"text": "Bathinda", "type": "Location", "start": 72, "end": 80, "score": 0.980},
  {"text": "Amarinder Singh", "type": "Person", "start": 101, "end": 116, "score": 0.998},
  {"text": "Chandigarh", "type": "Location", "start": 145, "end": 155, "score": 0.992},
  {"text": "Monday", "type": "Temporal", "start": 159, "end": 165, "score": 0.975}
]
```

---

## Installation & Usage

### Using Llama.cpp (Recommended for Local Deployment)

```bash
# Download the Q4_K_M GGUF model
wget https://huggingface.co/shreyasmeher/ConflLlama-NER-TOI-GGUF/resolve/main/model-unsloth-Q4_K_M.gguf

# Run with llama.cpp
./llama-cli -m model-unsloth-Q4_K_M.gguf \
  --system "Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields." \
  --prompt "Text: [YOUR TEXT]\nEntities: " \
  --temp 0.3 \
  --n-predict 512
```

### Using Transformers (Python)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json

# Load model and tokenizer
model_name = "shreyasmeher/ConflLlama-NER-TOI"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare input with correct formatting
text = "The Reserve Bank of India announced new policies in Mumbai on Tuesday."
prompt = f"""<|start_header_id|>system<|end_header_id|>

Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>

Text: {text}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Parse JSON output
entities = json.loads(result.split("Entities: ")[1])
print(entities)
```

### Using Unsloth (For Further Fine-tuning)

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="shreyasmeher/ConflLlama-NER-TOI",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Enable inference mode
FastLanguageModel.for_inference(model)

# Generate
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.3)
result = tokenizer.batch_decode(outputs)
```

---

## Intended Use

This model is designed for **research and analysis** of news and information extraction:

1. **News Analytics**: Extract structured information from news articles at scale
2. **Information Retrieval**: Build searchable databases of entities from news corpora
3. **Knowledge Graph Construction**: Create entity relationship networks from news
4. **Content Analysis**: Identify key actors, locations, and organizations in news coverage
5. **Media Monitoring**: Track entity mentions across large news datasets
6. **Social Science Research**: Analyze patterns in news coverage and entity representation

### Example Applications

- Building entity databases from news archives
- Tracking political figures and organizations in media
- Analyzing geographic focus of news coverage
- Preprocessing for downstream NLP tasks (summarization, Q&A, classification)
- Creating entity timelines and co-occurrence networks

---

## Limitations

1. **Training Data Scope**:
   - Trained primarily on **Indian news** from Times of India
   - May show bias toward Indian entities, locations, and naming conventions
   - Performance on news from other countries/regions may vary

2. **Entity Type Coverage**:
   - Limited to 4 entity types (Location, Organisation, Person, Temporal)
   - Does not recognize: Events, Products, Diseases, Currencies, etc.
   - No nested or overlapping entity support

3. **Training Completeness**:
   - Model only saw ~42% of training data (5,000 steps vs ~11,875 for full epoch)
   - Loss plateaued at 0.90, suggesting potential for improvement with longer training
   - Early stopping may have left performance gains on the table

4. **Annotation Quality**:
   - Training labels generated by ConfliBERT model (not human-annotated)
   - Inherits any biases or errors from the annotation model
   - Confidence threshold (≥0.9) may have excluded valid but uncertain entities

5. **Context Window**: 
   - Articles truncated to 510 tokens (~400-500 words)
   - Entities appearing later in long articles were not seen during training

6. **Language**: English-only, optimized for Indian English conventions

7. **Output Parsing**: Model outputs JSON, but may occasionally produce malformed JSON. Implement robust parsing with error handling.

8. **Prompt Sensitivity**: **Critical limitation** - requires exact prompt formatting. Deviations will significantly degrade performance.

---

## Performance Considerations

### Expected Behavior

- **Strong Performance**: Indian locations, political organizations, Hindi-origin names
- **Moderate Performance**: International entities, technical/specialized terms
- **Potential Challenges**: 
  - Ambiguous entity boundaries in complex phrases
  - Entities with unconventional capitalization
  - Code-mixed text (Hindi-English)
  - Very long or very short articles

### Recommended Use Cases

✅ **Good Fit:**
- Extracting entities from Indian news articles
- Processing Times of India or similar English-language Indian news
- Large-scale batch processing of news corpora
- Research requiring entities with confidence scores

❌ **Poor Fit:**
- Real-time critical applications (e.g., emergency response)
- Non-news text (scientific papers, social media, legal documents)
- Languages other than English
- Applications requiring 100% precision

---

## Ethical Considerations

### Responsible Use

1. **Training Data Provenance**: 
   - Model trained on Times of India articles without explicit permission
   - Users should verify compliance with data usage policies for their applications

2. **Bias and Representation**:
   - Training data reflects editorial decisions and biases of Times of India
   - May overrepresent certain geographic regions, political perspectives, or demographics
   - Entity recognition accuracy may vary across different communities

3. **Privacy Concerns**:
   - Model extracts names of real individuals from text
   - Users must handle personal information in compliance with privacy laws (GDPR, etc.)
   - Do not use for surveillance or unauthorized profiling

4. **Quality and Errors**:
   - Automated annotations (ConfliBERT) may contain errors
   - Critical applications should validate entity extractions manually
   - Do not use for legal, medical, or safety-critical decisions

5. **Dual-Use Potential**: While designed for research, entity extraction could support:
   - ❌ Surveillance or profiling of individuals/groups
   - ❌ Manipulation of public discourse
   - ❌ Unauthorized data harvesting
   - ✅ Academic research and journalistic analysis (intended use)

### Transparency Requirements

Users should:
- Disclose when findings are based on automated NER
- Report model limitations in publications
- Validate critical findings with manual review or alternative methods
- Cite both the model and the ConfliBERT annotation source

---

## Comparison with Original ConflLlama-NER

| Aspect | ConflLlama-NER (CAMEO) | ConflLlama-NER-TOI (This Model) |
|--------|------------------------|----------------------------------|
| **Training Data** | 1,094 sentences | 100,000 articles |
| **Domain** | Conflict events | General news |
| **Entity Types** | Source, Target, Related (role-based) | Location, Org, Person, Temporal (type-based) |
| **Annotation** | Manual (CAMEO-coded) | Automated (ConfliBERT) |
| **Geographic Focus** | International conflicts | Indian news |
| **Use Case** | Political violence research | General information extraction |
| **Training Steps** | 2,000 | 5,000 |
| **Scale** | Specialized, small | General, large |

**When to use which:**
- **CAMEO version**: Conflict analysis, event coding, semantic role extraction
- **TOI version (this)**: General news analysis, entity databases, broad NER tasks

---

## Citation


```bibtex
@article{meher2025confllama,
  title={ConflLlama: Domain-specific adaptation of large language models for conflict event classification},
  author={Meher, Shreyas and Brandt, Patrick T.},
  journal={Research \& Politics},
  volume={12},
  number={3},
  year={2025},
  publisher={SAGE Publications},
  doi={10.1177/20531680251356282}
}
```


---

## Acknowledgments

- **Funding**: NSF Award 2311142
- **Computing Resources**: Delta system at NCSA (University of Illinois) through ACCESS allocation CIS220162
- **Base Model**: [Unsloth](https://github.com/unslothai/unsloth) team for Llama-3.1 8B Instruct optimizations
- **Annotation Model**: Event Data UTD team for ConfliBERT
- **Data Source**: Times of India corpus
- **Infrastructure**: Hugging Face for model hosting and transformers library

---

## Model Versions

This repository contains the following model variants:

- **16-bit merged**: Full precision, largest file size, highest quality
- **4-bit quantized**: Smaller, faster inference with minimal quality loss
- **GGUF (Q4_K_M, Q8_0)**: Optimized for llama.cpp deployment
- **LoRA adapters**: For efficient storage and further fine-tuning

---

## Future Work

Potential improvements for future iterations:

1. **Complete Training**: Run for 2-3 full epochs to reduce loss below 0.90
2. **Hyperparameter Tuning**: Increase learning rate, warmup, and LoRA rank
3. **Enable Evaluation**: Add validation set monitoring for optimal checkpoint selection
4. **Expand Entity Types**: Include Events, Quantities, Miscellaneous categories
5. **Multilingual Support**: Add Hindi, other Indian languages
6. **Human Validation**: Sample-based quality assessment of ConfliBERT annotations
7. **Cross-Domain Testing**: Evaluate on non-Indian news sources

---

## License

Apache 2.0 - See [LICENSE](LICENSE) for details.

---

## Related Resources

- **Original ConflLlama (Attack Classification)**: [shreyasmeher/confllama](https://huggingface.co/shreyasmeher/confllama)
- **ConflLlama-NER (CAMEO version)**: [shreyasmeher/confllama-ner-sft](https://huggingface.co/shreyasmeher/confllama-ner-sft)
- **ConfliBERT Annotation Model**: [eventdata-utd/conflibert-named-entity-recognition](https://huggingface.co/eventdata-utd/conflibert-named-entity-recognition)
- **Research Paper**: [https://doi.org/10.1177/20531680251356282](https://doi.org/10.1177/20531680251356282)
- **Unsloth Framework**: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)

---

## Contact

For questions, issues, or collaboration inquiries:
- **GitHub Issues**: [Repository Issues](https://github.com/shreyasmeher/confllama)
- **Hugging Face**: [@shreyasmeher](https://huggingface.co/shreyasmeher)
- **Email**: shreyas.meher@utdallas.edu

---

<p align="center">
  <img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>
</p>