---
base_model: unsloth/llama-3-8b-Instruct-bnb-4bit
license: apache-2.0
language:
- en
tags:
- llama.cpp
- gguf
- quantized
- q4_k_m
- q8_0
- named-entity-recognition
- ner
- news-analysis
- indian-news
- conflibert
---
# ConflLlama-NER-TOI: Large-Scale Named Entity Recognition for Indian News
---
## ⚠️ Important: Read Before Using
**This model requires exact prompt formatting to work correctly.** Please read the [Critical: Inference & Prompt Formatting](#critical-inference--prompt-formatting) section below before attempting to use this model. Using incorrect prompts will result in poor performance or hallucinations.
---
**ConflLlama-NER-TOI** is a large-scale named entity recognition model fine-tuned on **100,000 Indian news articles** from the Times of India. Built upon **Llama-3.1 8B Instruct**, this model identifies and classifies four entity types across diverse news domains:
- **Location**: Geographic entities (cities, regions, countries)
- **Organisation**: Companies, government bodies, institutions, political parties
- **Person**: Named individuals, public figures, officials
- **Temporal**: Time expressions, dates, periods
This model represents a significant scale-up from the original ConflLlama-NER, trained on 1,300x more data for broader domain coverage and improved generalization.
---
## Key Features
- **Large-Scale Training**: Fine-tuned on 100,000 news articles with high-confidence entity annotations
- **Multi-Domain Coverage**: Trained on diverse news topics including politics, sports, crime, economy, and entertainment
- **ConfliBERT Annotations**: Uses state-of-the-art entity recognition model for training data generation
- **High-Quality Filtering**: Only entities with ≥0.9 confidence score included in training
- **JSON Output Format**: Returns structured entity lists with text, type, position, and confidence scores
- **Efficient Deployment**: Available in multiple quantization formats (Q4_K_M, Q8_0, BF16)
- **Instruction-Tuned**: Built on Llama-3.1 Instruct for robust prompt following
---
## Training Data
### Dataset: Times of India Corpus with ConfliBERT Annotations
The model was trained on a large-scale dataset derived from Indian news articles:
- **Total Articles**: ~1.5 million Times of India articles
- **Training Samples**: 100,000 (randomly sampled)
- **Training/Test Split**: 95,000 train / 5,000 test (95/5 split)
- **Text Processing**: Articles truncated to 510 tokens (~400-500 words) to fit context limits
- **Annotation Method**: Automated using `eventdata-utd/conflibert-named-entity-recognition`
- **Quality Filter**: Only entities with confidence score ≥0.9 included
- **Total Entities**: Hundreds of thousands of high-confidence annotations
### Entity Type Definitions
| Entity Type | Description | Examples |
|-------------|-------------|----------|
| **Location** | Geographic entities including cities, states, countries, regions | "Mumbai", "Punjab", "India", "United States" |
| **Organisation** | Companies, government bodies, institutions, political parties, agencies | "Congress", "Reserve Bank of India", "Google", "Supreme Court" |
| **Person** | Named individuals, politicians, celebrities, officials | "Narendra Modi", "Virat Kohli", "Amit Shah" |
| **Temporal** | Time expressions, dates, periods, durations | "Monday", "2023", "last week", "evening" |
### Data Coverage
The training data spans multiple news domains:
- **Politics**: Government actions, elections, policy announcements
- **Crime**: Police reports, legal proceedings, incidents
- **Sports**: Matches, players, tournaments
- **Economy**: Business news, market reports, financial decisions
- **Entertainment**: Celebrity news, film industry, cultural events
- **Local News**: City-level events, regional developments
---
## Model Architecture
- **Base Model**: `unsloth/llama-3-8b-Instruct-bnb-4bit`
- **Fine-tuning Method**: QLoRA (Quantized Low-Rank Adaptation)
- **Quantization**: 4-bit with bitsandbytes
- **Maximum Sequence Length**: 2048 tokens
- **LoRA Configuration**:
- Rank (r): 16
- Alpha (lora_alpha): 16
- Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
- Dropout: 0
- Gradient Checkpointing: Enabled (Unsloth optimization)
### Training Configuration
- **Optimizer**: AdamW 8-bit (memory efficient)
- **Learning Rate**: 2e-4 with 3% warmup (150 steps)
- **Batch Size**: 2 per device with 4 gradient accumulation steps (effective batch size: 8)
- **Training Steps**: 5,000 steps (~42% of one epoch)
- **Precision**: BFloat16 (when supported)
- **Hardware**: NVIDIA A100-SXM4-40GB GPU on NCSA Delta
- **Training Time**: ~3-4 hours
- **Memory Footprint**: ~8 GB VRAM
### Training Dynamics
- **Initial Loss**: ~2.5
- **Final Loss**: ~0.90 (plateau observed)
- **Gradient Norm**: Stable at 0.25-0.35 throughout training
- **Learning Rate Schedule**: Linear decay from 2e-4 to 0
---
## Critical: Inference & Prompt Formatting
**IMPORTANT**: This model requires **exact prompt formatting** to function correctly. The model was instruction-tuned with a specific template. Deviating from this format will result in poor performance or hallucinations.
### Option 1: Using LM Studio / Llama.cpp (Recommended)
**System Prompt:**
```
Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.
```
**User Message:**
```
Text: [YOUR ARTICLE TEXT HERE]
Entities:
```
### Option 2: Direct API/Python Integration
If your platform requires a single concatenated prompt:
```python
prompt = """<|start_header_id|>system<|end_header_id|>
Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>
Text: {your_text_here}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
```
### Example Usage
**Input:**
```
Text: The Punjab government has placed under suspension two officials following the Bathinda incident. Chief Minister Amarinder Singh announced the decision in Chandigarh on Monday.
Entities:
```
**Expected Output:**
```json
[
{"text": "Punjab government", "type": "Organisation", "start": 4, "end": 21, "score": 0.995},
{"text": "Bathinda", "type": "Location", "start": 72, "end": 80, "score": 0.980},
{"text": "Amarinder Singh", "type": "Person", "start": 101, "end": 116, "score": 0.998},
{"text": "Chandigarh", "type": "Location", "start": 145, "end": 155, "score": 0.992},
{"text": "Monday", "type": "Temporal", "start": 159, "end": 165, "score": 0.975}
]
```
---
## Installation & Usage
### Using Llama.cpp (Recommended for Local Deployment)
```bash
# Download the Q4_K_M GGUF model
wget https://huggingface.co/shreyasmeher/ConflLlama-NER-TOI-GGUF/resolve/main/model-unsloth-Q4_K_M.gguf
# Run with llama.cpp
./llama-cli -m model-unsloth-Q4_K_M.gguf \
--system "Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields." \
--prompt "Text: [YOUR TEXT]\nEntities: " \
--temp 0.3 \
--n-predict 512
```
### Using Transformers (Python)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json
# Load model and tokenizer
model_name = "shreyasmeher/ConflLlama-NER-TOI"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Prepare input with correct formatting
text = "The Reserve Bank of India announced new policies in Mumbai on Tuesday."
prompt = f"""<|start_header_id|>system<|end_header_id|>
Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>
Text: {text}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.3,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Parse JSON output
entities = json.loads(result.split("Entities: ")[1])
print(entities)
```
### Using Unsloth (For Further Fine-tuning)
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="shreyasmeher/ConflLlama-NER-TOI",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# Enable inference mode
FastLanguageModel.for_inference(model)
# Generate
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.3)
result = tokenizer.batch_decode(outputs)
```
---
## Intended Use
This model is designed for **research and analysis** of news and information extraction:
1. **News Analytics**: Extract structured information from news articles at scale
2. **Information Retrieval**: Build searchable databases of entities from news corpora
3. **Knowledge Graph Construction**: Create entity relationship networks from news
4. **Content Analysis**: Identify key actors, locations, and organizations in news coverage
5. **Media Monitoring**: Track entity mentions across large news datasets
6. **Social Science Research**: Analyze patterns in news coverage and entity representation
### Example Applications
- Building entity databases from news archives
- Tracking political figures and organizations in media
- Analyzing geographic focus of news coverage
- Preprocessing for downstream NLP tasks (summarization, Q&A, classification)
- Creating entity timelines and co-occurrence networks
---
## Limitations
1. **Training Data Scope**:
- Trained primarily on **Indian news** from Times of India
- May show bias toward Indian entities, locations, and naming conventions
- Performance on news from other countries/regions may vary
2. **Entity Type Coverage**:
- Limited to 4 entity types (Location, Organisation, Person, Temporal)
- Does not recognize: Events, Products, Diseases, Currencies, etc.
- No nested or overlapping entity support
3. **Training Completeness**:
- Model only saw ~42% of training data (5,000 steps vs ~11,875 for full epoch)
- Loss plateaued at 0.90, suggesting potential for improvement with longer training
- Early stopping may have left performance gains on the table
4. **Annotation Quality**:
- Training labels generated by ConfliBERT model (not human-annotated)
- Inherits any biases or errors from the annotation model
- Confidence threshold (≥0.9) may have excluded valid but uncertain entities
5. **Context Window**:
- Articles truncated to 510 tokens (~400-500 words)
- Entities appearing later in long articles were not seen during training
6. **Language**: English-only, optimized for Indian English conventions
7. **Output Parsing**: Model outputs JSON, but may occasionally produce malformed JSON. Implement robust parsing with error handling.
8. **Prompt Sensitivity**: **Critical limitation** - requires exact prompt formatting. Deviations will significantly degrade performance.
---
## Performance Considerations
### Expected Behavior
- **Strong Performance**: Indian locations, political organizations, Hindi-origin names
- **Moderate Performance**: International entities, technical/specialized terms
- **Potential Challenges**:
- Ambiguous entity boundaries in complex phrases
- Entities with unconventional capitalization
- Code-mixed text (Hindi-English)
- Very long or very short articles
### Recommended Use Cases
✅ **Good Fit:**
- Extracting entities from Indian news articles
- Processing Times of India or similar English-language Indian news
- Large-scale batch processing of news corpora
- Research requiring entities with confidence scores
❌ **Poor Fit:**
- Real-time critical applications (e.g., emergency response)
- Non-news text (scientific papers, social media, legal documents)
- Languages other than English
- Applications requiring 100% precision
---
## Ethical Considerations
### Responsible Use
1. **Training Data Provenance**:
- Model trained on Times of India articles without explicit permission
- Users should verify compliance with data usage policies for their applications
2. **Bias and Representation**:
- Training data reflects editorial decisions and biases of Times of India
- May overrepresent certain geographic regions, political perspectives, or demographics
- Entity recognition accuracy may vary across different communities
3. **Privacy Concerns**:
- Model extracts names of real individuals from text
- Users must handle personal information in compliance with privacy laws (GDPR, etc.)
- Do not use for surveillance or unauthorized profiling
4. **Quality and Errors**:
- Automated annotations (ConfliBERT) may contain errors
- Critical applications should validate entity extractions manually
- Do not use for legal, medical, or safety-critical decisions
5. **Dual-Use Potential**: While designed for research, entity extraction could support:
- ❌ Surveillance or profiling of individuals/groups
- ❌ Manipulation of public discourse
- ❌ Unauthorized data harvesting
- ✅ Academic research and journalistic analysis (intended use)
### Transparency Requirements
Users should:
- Disclose when findings are based on automated NER
- Report model limitations in publications
- Validate critical findings with manual review or alternative methods
- Cite both the model and the ConfliBERT annotation source
---
## Comparison with Original ConflLlama-NER
| Aspect | ConflLlama-NER (CAMEO) | ConflLlama-NER-TOI (This Model) |
|--------|------------------------|----------------------------------|
| **Training Data** | 1,094 sentences | 100,000 articles |
| **Domain** | Conflict events | General news |
| **Entity Types** | Source, Target, Related (role-based) | Location, Org, Person, Temporal (type-based) |
| **Annotation** | Manual (CAMEO-coded) | Automated (ConfliBERT) |
| **Geographic Focus** | International conflicts | Indian news |
| **Use Case** | Political violence research | General information extraction |
| **Training Steps** | 2,000 | 5,000 |
| **Scale** | Specialized, small | General, large |
**When to use which:**
- **CAMEO version**: Conflict analysis, event coding, semantic role extraction
- **TOI version (this)**: General news analysis, entity databases, broad NER tasks
---
## Citation
```bibtex
@article{meher2025confllama,
title={ConflLlama: Domain-specific adaptation of large language models for conflict event classification},
author={Meher, Shreyas and Brandt, Patrick T.},
journal={Research \& Politics},
volume={12},
number={3},
year={2025},
publisher={SAGE Publications},
doi={10.1177/20531680251356282}
}
```
---
## Acknowledgments
- **Funding**: NSF Award 2311142
- **Computing Resources**: Delta system at NCSA (University of Illinois) through ACCESS allocation CIS220162
- **Base Model**: [Unsloth](https://github.com/unslothai/unsloth) team for Llama-3.1 8B Instruct optimizations
- **Annotation Model**: Event Data UTD team for ConfliBERT
- **Data Source**: Times of India corpus
- **Infrastructure**: Hugging Face for model hosting and transformers library
---
## Model Versions
This repository contains the following model variants:
- **16-bit merged**: Full precision, largest file size, highest quality
- **4-bit quantized**: Smaller, faster inference with minimal quality loss
- **GGUF (Q4_K_M, Q8_0)**: Optimized for llama.cpp deployment
- **LoRA adapters**: For efficient storage and further fine-tuning
---
## Future Work
Potential improvements for future iterations:
1. **Complete Training**: Run for 2-3 full epochs to reduce loss below 0.90
2. **Hyperparameter Tuning**: Increase learning rate, warmup, and LoRA rank
3. **Enable Evaluation**: Add validation set monitoring for optimal checkpoint selection
4. **Expand Entity Types**: Include Events, Quantities, Miscellaneous categories
5. **Multilingual Support**: Add Hindi, other Indian languages
6. **Human Validation**: Sample-based quality assessment of ConfliBERT annotations
7. **Cross-Domain Testing**: Evaluate on non-Indian news sources
---
## License
Apache 2.0 - See [LICENSE](LICENSE) for details.
---
## Related Resources
- **Original ConflLlama (Attack Classification)**: [shreyasmeher/confllama](https://huggingface.co/shreyasmeher/confllama)
- **ConflLlama-NER (CAMEO version)**: [shreyasmeher/confllama-ner-sft](https://huggingface.co/shreyasmeher/confllama-ner-sft)
- **ConfliBERT Annotation Model**: [eventdata-utd/conflibert-named-entity-recognition](https://huggingface.co/eventdata-utd/conflibert-named-entity-recognition)
- **Research Paper**: [https://doi.org/10.1177/20531680251356282](https://doi.org/10.1177/20531680251356282)
- **Unsloth Framework**: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)
---
## Contact
For questions, issues, or collaboration inquiries:
- **GitHub Issues**: [Repository Issues](https://github.com/shreyasmeher/confllama)
- **Hugging Face**: [@shreyasmeher](https://huggingface.co/shreyasmeher)
- **Email**: shreyas.meher@utdallas.edu
---