ConflLlama-NER-TOI: Large-Scale Named Entity Recognition for Indian News
⚠️ Important: Read Before Using
This model requires exact prompt formatting to work correctly. Please read the Critical: Inference & Prompt Formatting section below before attempting to use this model. Using incorrect prompts will result in poor performance or hallucinations.
ConflLlama-NER-TOI is a large-scale named entity recognition model fine-tuned on 100,000 Indian news articles from the Times of India. Built upon Llama-3.1 8B Instruct, this model identifies and classifies four entity types across diverse news domains:
- Location: Geographic entities (cities, regions, countries)
- Organisation: Companies, government bodies, institutions, political parties
- Person: Named individuals, public figures, officials
- Temporal: Time expressions, dates, periods
This model represents a significant scale-up from the original ConflLlama-NER, trained on 1,300x more data for broader domain coverage and improved generalization.
Key Features
- Large-Scale Training: Fine-tuned on 100,000 news articles with high-confidence entity annotations
- Multi-Domain Coverage: Trained on diverse news topics including politics, sports, crime, economy, and entertainment
- ConfliBERT Annotations: Uses state-of-the-art entity recognition model for training data generation
- High-Quality Filtering: Only entities with ≥0.9 confidence score included in training
- JSON Output Format: Returns structured entity lists with text, type, position, and confidence scores
- Efficient Deployment: Available in multiple quantization formats (Q4_K_M, Q8_0, BF16)
- Instruction-Tuned: Built on Llama-3.1 Instruct for robust prompt following
Training Data
Dataset: Times of India Corpus with ConfliBERT Annotations
The model was trained on a large-scale dataset derived from Indian news articles:
- Total Articles: ~1.5 million Times of India articles
- Training Samples: 100,000 (randomly sampled)
- Training/Test Split: 95,000 train / 5,000 test (95/5 split)
- Text Processing: Articles truncated to 510 tokens (~400-500 words) to fit context limits
- Annotation Method: Automated using
eventdata-utd/conflibert-named-entity-recognition - Quality Filter: Only entities with confidence score ≥0.9 included
- Total Entities: Hundreds of thousands of high-confidence annotations
Entity Type Definitions
| Entity Type | Description | Examples |
|---|---|---|
| Location | Geographic entities including cities, states, countries, regions | "Mumbai", "Punjab", "India", "United States" |
| Organisation | Companies, government bodies, institutions, political parties, agencies | "Congress", "Reserve Bank of India", "Google", "Supreme Court" |
| Person | Named individuals, politicians, celebrities, officials | "Narendra Modi", "Virat Kohli", "Amit Shah" |
| Temporal | Time expressions, dates, periods, durations | "Monday", "2023", "last week", "evening" |
Data Coverage
The training data spans multiple news domains:
- Politics: Government actions, elections, policy announcements
- Crime: Police reports, legal proceedings, incidents
- Sports: Matches, players, tournaments
- Economy: Business news, market reports, financial decisions
- Entertainment: Celebrity news, film industry, cultural events
- Local News: City-level events, regional developments
Model Architecture
- Base Model:
unsloth/llama-3-8b-Instruct-bnb-4bit - Fine-tuning Method: QLoRA (Quantized Low-Rank Adaptation)
- Quantization: 4-bit with bitsandbytes
- Maximum Sequence Length: 2048 tokens
- LoRA Configuration:
- Rank (r): 16
- Alpha (lora_alpha): 16
- Target Modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Dropout: 0
- Gradient Checkpointing: Enabled (Unsloth optimization)
Training Configuration
- Optimizer: AdamW 8-bit (memory efficient)
- Learning Rate: 2e-4 with 3% warmup (150 steps)
- Batch Size: 2 per device with 4 gradient accumulation steps (effective batch size: 8)
- Training Steps: 5,000 steps (~42% of one epoch)
- Precision: BFloat16 (when supported)
- Hardware: NVIDIA A100-SXM4-40GB GPU on NCSA Delta
- Training Time: ~3-4 hours
- Memory Footprint: ~8 GB VRAM
Training Dynamics
- Initial Loss: ~2.5
- Final Loss: ~0.90 (plateau observed)
- Gradient Norm: Stable at 0.25-0.35 throughout training
- Learning Rate Schedule: Linear decay from 2e-4 to 0
Critical: Inference & Prompt Formatting
IMPORTANT: This model requires exact prompt formatting to function correctly. The model was instruction-tuned with a specific template. Deviating from this format will result in poor performance or hallucinations.
Option 1: Using LM Studio / Llama.cpp (Recommended)
System Prompt:
Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.
User Message:
Text: [YOUR ARTICLE TEXT HERE]
Entities:
Option 2: Direct API/Python Integration
If your platform requires a single concatenated prompt:
prompt = """<|start_header_id|>system<|end_header_id|>
Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>
Text: {your_text_here}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
Example Usage
Input:
Text: The Punjab government has placed under suspension two officials following the Bathinda incident. Chief Minister Amarinder Singh announced the decision in Chandigarh on Monday.
Entities:
Expected Output:
[
{"text": "Punjab government", "type": "Organisation", "start": 4, "end": 21, "score": 0.995},
{"text": "Bathinda", "type": "Location", "start": 72, "end": 80, "score": 0.980},
{"text": "Amarinder Singh", "type": "Person", "start": 101, "end": 116, "score": 0.998},
{"text": "Chandigarh", "type": "Location", "start": 145, "end": 155, "score": 0.992},
{"text": "Monday", "type": "Temporal", "start": 159, "end": 165, "score": 0.975}
]
Installation & Usage
Using Llama.cpp (Recommended for Local Deployment)
# Download the Q4_K_M GGUF model
wget https://huggingface.co/shreyasmeher/ConflLlama-NER-TOI-GGUF/resolve/main/model-unsloth-Q4_K_M.gguf
# Run with llama.cpp
./llama-cli -m model-unsloth-Q4_K_M.gguf \
--system "Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields." \
--prompt "Text: [YOUR TEXT]\nEntities: " \
--temp 0.3 \
--n-predict 512
Using Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json
# Load model and tokenizer
model_name = "shreyasmeher/ConflLlama-NER-TOI"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Prepare input with correct formatting
text = "The Reserve Bank of India announced new policies in Mumbai on Tuesday."
prompt = f"""<|start_header_id|>system<|end_header_id|>
Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>
Text: {text}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.3,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Parse JSON output
entities = json.loads(result.split("Entities: ")[1])
print(entities)
Using Unsloth (For Further Fine-tuning)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="shreyasmeher/ConflLlama-NER-TOI",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# Enable inference mode
FastLanguageModel.for_inference(model)
# Generate
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.3)
result = tokenizer.batch_decode(outputs)
Intended Use
This model is designed for research and analysis of news and information extraction:
- News Analytics: Extract structured information from news articles at scale
- Information Retrieval: Build searchable databases of entities from news corpora
- Knowledge Graph Construction: Create entity relationship networks from news
- Content Analysis: Identify key actors, locations, and organizations in news coverage
- Media Monitoring: Track entity mentions across large news datasets
- Social Science Research: Analyze patterns in news coverage and entity representation
Example Applications
- Building entity databases from news archives
- Tracking political figures and organizations in media
- Analyzing geographic focus of news coverage
- Preprocessing for downstream NLP tasks (summarization, Q&A, classification)
- Creating entity timelines and co-occurrence networks
Limitations
Training Data Scope:
- Trained primarily on Indian news from Times of India
- May show bias toward Indian entities, locations, and naming conventions
- Performance on news from other countries/regions may vary
Entity Type Coverage:
- Limited to 4 entity types (Location, Organisation, Person, Temporal)
- Does not recognize: Events, Products, Diseases, Currencies, etc.
- No nested or overlapping entity support
Training Completeness:
- Model only saw ~42% of training data (5,000 steps vs ~11,875 for full epoch)
- Loss plateaued at 0.90, suggesting potential for improvement with longer training
- Early stopping may have left performance gains on the table
Annotation Quality:
- Training labels generated by ConfliBERT model (not human-annotated)
- Inherits any biases or errors from the annotation model
- Confidence threshold (≥0.9) may have excluded valid but uncertain entities
Context Window:
- Articles truncated to 510 tokens (~400-500 words)
- Entities appearing later in long articles were not seen during training
Language: English-only, optimized for Indian English conventions
Output Parsing: Model outputs JSON, but may occasionally produce malformed JSON. Implement robust parsing with error handling.
Prompt Sensitivity: Critical limitation - requires exact prompt formatting. Deviations will significantly degrade performance.
Performance Considerations
Expected Behavior
- Strong Performance: Indian locations, political organizations, Hindi-origin names
- Moderate Performance: International entities, technical/specialized terms
- Potential Challenges:
- Ambiguous entity boundaries in complex phrases
- Entities with unconventional capitalization
- Code-mixed text (Hindi-English)
- Very long or very short articles
Recommended Use Cases
✅ Good Fit:
- Extracting entities from Indian news articles
- Processing Times of India or similar English-language Indian news
- Large-scale batch processing of news corpora
- Research requiring entities with confidence scores
❌ Poor Fit:
- Real-time critical applications (e.g., emergency response)
- Non-news text (scientific papers, social media, legal documents)
- Languages other than English
- Applications requiring 100% precision
Ethical Considerations
Responsible Use
Training Data Provenance:
- Model trained on Times of India articles without explicit permission
- Users should verify compliance with data usage policies for their applications
Bias and Representation:
- Training data reflects editorial decisions and biases of Times of India
- May overrepresent certain geographic regions, political perspectives, or demographics
- Entity recognition accuracy may vary across different communities
Privacy Concerns:
- Model extracts names of real individuals from text
- Users must handle personal information in compliance with privacy laws (GDPR, etc.)
- Do not use for surveillance or unauthorized profiling
Quality and Errors:
- Automated annotations (ConfliBERT) may contain errors
- Critical applications should validate entity extractions manually
- Do not use for legal, medical, or safety-critical decisions
Dual-Use Potential: While designed for research, entity extraction could support:
- ❌ Surveillance or profiling of individuals/groups
- ❌ Manipulation of public discourse
- ❌ Unauthorized data harvesting
- ✅ Academic research and journalistic analysis (intended use)
Transparency Requirements
Users should:
- Disclose when findings are based on automated NER
- Report model limitations in publications
- Validate critical findings with manual review or alternative methods
- Cite both the model and the ConfliBERT annotation source
Comparison with Original ConflLlama-NER
| Aspect | ConflLlama-NER (CAMEO) | ConflLlama-NER-TOI (This Model) |
|---|---|---|
| Training Data | 1,094 sentences | 100,000 articles |
| Domain | Conflict events | General news |
| Entity Types | Source, Target, Related (role-based) | Location, Org, Person, Temporal (type-based) |
| Annotation | Manual (CAMEO-coded) | Automated (ConfliBERT) |
| Geographic Focus | International conflicts | Indian news |
| Use Case | Political violence research | General information extraction |
| Training Steps | 2,000 | 5,000 |
| Scale | Specialized, small | General, large |
When to use which:
- CAMEO version: Conflict analysis, event coding, semantic role extraction
- TOI version (this): General news analysis, entity databases, broad NER tasks
Citation
@article{meher2025confllama,
title={ConflLlama: Domain-specific adaptation of large language models for conflict event classification},
author={Meher, Shreyas and Brandt, Patrick T.},
journal={Research \& Politics},
volume={12},
number={3},
year={2025},
publisher={SAGE Publications},
doi={10.1177/20531680251356282}
}
Acknowledgments
- Funding: NSF Award 2311142
- Computing Resources: Delta system at NCSA (University of Illinois) through ACCESS allocation CIS220162
- Base Model: Unsloth team for Llama-3.1 8B Instruct optimizations
- Annotation Model: Event Data UTD team for ConfliBERT
- Data Source: Times of India corpus
- Infrastructure: Hugging Face for model hosting and transformers library
Model Versions
This repository contains the following model variants:
- 16-bit merged: Full precision, largest file size, highest quality
- 4-bit quantized: Smaller, faster inference with minimal quality loss
- GGUF (Q4_K_M, Q8_0): Optimized for llama.cpp deployment
- LoRA adapters: For efficient storage and further fine-tuning
Future Work
Potential improvements for future iterations:
- Complete Training: Run for 2-3 full epochs to reduce loss below 0.90
- Hyperparameter Tuning: Increase learning rate, warmup, and LoRA rank
- Enable Evaluation: Add validation set monitoring for optimal checkpoint selection
- Expand Entity Types: Include Events, Quantities, Miscellaneous categories
- Multilingual Support: Add Hindi, other Indian languages
- Human Validation: Sample-based quality assessment of ConfliBERT annotations
- Cross-Domain Testing: Evaluate on non-Indian news sources
License
Apache 2.0 - See LICENSE for details.
Related Resources
- Original ConflLlama (Attack Classification): shreyasmeher/confllama
- ConflLlama-NER (CAMEO version): shreyasmeher/confllama-ner-sft
- ConfliBERT Annotation Model: eventdata-utd/conflibert-named-entity-recognition
- Research Paper: https://doi.org/10.1177/20531680251356282
- Unsloth Framework: https://github.com/unslothai/unsloth
Contact
For questions, issues, or collaboration inquiries:
- GitHub Issues: Repository Issues
- Hugging Face: @shreyasmeher
- Email: shreyas.meher@utdallas.edu
- Downloads last month
- 78
5-bit
16-bit
Model tree for shreyasmeher/ConflLlama-NER-TOI
Base model
unsloth/llama-3-8b-Instruct-bnb-4bit