ConflLlama-NER-TOI: Large-Scale Named Entity Recognition for Indian News

⚠️ Important: Read Before Using

This model requires exact prompt formatting to work correctly. Please read the Critical: Inference & Prompt Formatting section below before attempting to use this model. Using incorrect prompts will result in poor performance or hallucinations.

ConflLlama-NER-TOI is a large-scale named entity recognition model fine-tuned on 100,000 Indian news articles from the Times of India. Built upon Llama-3.1 8B Instruct, this model identifies and classifies four entity types across diverse news domains:

Location: Geographic entities (cities, regions, countries)
Organisation: Companies, government bodies, institutions, political parties
Person: Named individuals, public figures, officials
Temporal: Time expressions, dates, periods

This model represents a significant scale-up from the original ConflLlama-NER, trained on 1,300x more data for broader domain coverage and improved generalization.

Key Features

Large-Scale Training: Fine-tuned on 100,000 news articles with high-confidence entity annotations
Multi-Domain Coverage: Trained on diverse news topics including politics, sports, crime, economy, and entertainment
ConfliBERT Annotations: Uses state-of-the-art entity recognition model for training data generation
High-Quality Filtering: Only entities with ≥0.9 confidence score included in training
JSON Output Format: Returns structured entity lists with text, type, position, and confidence scores
Efficient Deployment: Available in multiple quantization formats (Q4_K_M, Q8_0, BF16)
Instruction-Tuned: Built on Llama-3.1 Instruct for robust prompt following

Training Data

Dataset: Times of India Corpus with ConfliBERT Annotations

The model was trained on a large-scale dataset derived from Indian news articles:

Total Articles: ~1.5 million Times of India articles
Training Samples: 100,000 (randomly sampled)
Training/Test Split: 95,000 train / 5,000 test (95/5 split)
Text Processing: Articles truncated to 510 tokens (~400-500 words) to fit context limits
Annotation Method: Automated using eventdata-utd/conflibert-named-entity-recognition
Quality Filter: Only entities with confidence score ≥0.9 included
Total Entities: Hundreds of thousands of high-confidence annotations

Entity Type Definitions

Entity Type	Description	Examples
Location	Geographic entities including cities, states, countries, regions	"Mumbai", "Punjab", "India", "United States"
Organisation	Companies, government bodies, institutions, political parties, agencies	"Congress", "Reserve Bank of India", "Google", "Supreme Court"
Person	Named individuals, politicians, celebrities, officials	"Narendra Modi", "Virat Kohli", "Amit Shah"
Temporal	Time expressions, dates, periods, durations	"Monday", "2023", "last week", "evening"

Data Coverage

The training data spans multiple news domains:

Politics: Government actions, elections, policy announcements
Crime: Police reports, legal proceedings, incidents
Sports: Matches, players, tournaments
Economy: Business news, market reports, financial decisions
Entertainment: Celebrity news, film industry, cultural events
Local News: City-level events, regional developments

Model Architecture

Base Model: unsloth/llama-3-8b-Instruct-bnb-4bit
Fine-tuning Method: QLoRA (Quantized Low-Rank Adaptation)
Quantization: 4-bit with bitsandbytes
Maximum Sequence Length: 2048 tokens
LoRA Configuration:
- Rank (r): 16
- Alpha (lora_alpha): 16
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Dropout: 0
- Gradient Checkpointing: Enabled (Unsloth optimization)

Training Configuration

Optimizer: AdamW 8-bit (memory efficient)
Learning Rate: 2e-4 with 3% warmup (150 steps)
Batch Size: 2 per device with 4 gradient accumulation steps (effective batch size: 8)
Training Steps: 5,000 steps (~42% of one epoch)
Precision: BFloat16 (when supported)
Hardware: NVIDIA A100-SXM4-40GB GPU on NCSA Delta
Training Time: ~3-4 hours
Memory Footprint: ~8 GB VRAM

Training Dynamics

Initial Loss: ~2.5
Final Loss: ~0.90 (plateau observed)
Gradient Norm: Stable at 0.25-0.35 throughout training
Learning Rate Schedule: Linear decay from 2e-4 to 0

Critical: Inference & Prompt Formatting

IMPORTANT: This model requires exact prompt formatting to function correctly. The model was instruction-tuned with a specific template. Deviating from this format will result in poor performance or hallucinations.

Option 1: Using LM Studio / Llama.cpp (Recommended)

System Prompt:

Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.

User Message:

Text: [YOUR ARTICLE TEXT HERE]
Entities:

Option 2: Direct API/Python Integration

If your platform requires a single concatenated prompt:

prompt = """<|start_header_id|>system<|end_header_id|>

Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>

Text: {your_text_here}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

Example Usage

Input:

Text: The Punjab government has placed under suspension two officials following the Bathinda incident. Chief Minister Amarinder Singh announced the decision in Chandigarh on Monday.
Entities:

Expected Output:

[
  {"text": "Punjab government", "type": "Organisation", "start": 4, "end": 21, "score": 0.995},
  {"text": "Bathinda", "type": "Location", "start": 72, "end": 80, "score": 0.980},
  {"text": "Amarinder Singh", "type": "Person", "start": 101, "end": 116, "score": 0.998},
  {"text": "Chandigarh", "type": "Location", "start": 145, "end": 155, "score": 0.992},
  {"text": "Monday", "type": "Temporal", "start": 159, "end": 165, "score": 0.975}
]

Installation & Usage

Using Llama.cpp (Recommended for Local Deployment)

# Download the Q4_K_M GGUF model
wget https://huggingface.co/shreyasmeher/ConflLlama-NER-TOI-GGUF/resolve/main/model-unsloth-Q4_K_M.gguf

# Run with llama.cpp
./llama-cli -m model-unsloth-Q4_K_M.gguf \
  --system "Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields." \
  --prompt "Text: [YOUR TEXT]\nEntities: " \
  --temp 0.3 \
  --n-predict 512

Using Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json

# Load model and tokenizer
model_name = "shreyasmeher/ConflLlama-NER-TOI"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare input with correct formatting
text = "The Reserve Bank of India announced new policies in Mumbai on Tuesday."
prompt = f"""<|start_header_id|>system<|end_header_id|>

Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>

Text: {text}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Parse JSON output
entities = json.loads(result.split("Entities: ")[1])
print(entities)

Using Unsloth (For Further Fine-tuning)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="shreyasmeher/ConflLlama-NER-TOI",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Enable inference mode
FastLanguageModel.for_inference(model)

# Generate
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.3)
result = tokenizer.batch_decode(outputs)

Intended Use

This model is designed for research and analysis of news and information extraction:

News Analytics: Extract structured information from news articles at scale
Information Retrieval: Build searchable databases of entities from news corpora
Knowledge Graph Construction: Create entity relationship networks from news
Content Analysis: Identify key actors, locations, and organizations in news coverage
Media Monitoring: Track entity mentions across large news datasets
Social Science Research: Analyze patterns in news coverage and entity representation

Example Applications

Building entity databases from news archives
Tracking political figures and organizations in media
Analyzing geographic focus of news coverage
Preprocessing for downstream NLP tasks (summarization, Q&A, classification)
Creating entity timelines and co-occurrence networks

Limitations

Training Data Scope:
- Trained primarily on Indian news from Times of India
- May show bias toward Indian entities, locations, and naming conventions
- Performance on news from other countries/regions may vary
Entity Type Coverage:
- Limited to 4 entity types (Location, Organisation, Person, Temporal)
- Does not recognize: Events, Products, Diseases, Currencies, etc.
- No nested or overlapping entity support
Training Completeness:
- Model only saw ~42% of training data (5,000 steps vs ~11,875 for full epoch)
- Loss plateaued at 0.90, suggesting potential for improvement with longer training
- Early stopping may have left performance gains on the table
Annotation Quality:
- Training labels generated by ConfliBERT model (not human-annotated)
- Inherits any biases or errors from the annotation model
- Confidence threshold (≥0.9) may have excluded valid but uncertain entities
Context Window:
- Articles truncated to 510 tokens (~400-500 words)
- Entities appearing later in long articles were not seen during training
Language: English-only, optimized for Indian English conventions
Output Parsing: Model outputs JSON, but may occasionally produce malformed JSON. Implement robust parsing with error handling.
Prompt Sensitivity: Critical limitation - requires exact prompt formatting. Deviations will significantly degrade performance.

Performance Considerations

Expected Behavior

Strong Performance: Indian locations, political organizations, Hindi-origin names
Moderate Performance: International entities, technical/specialized terms
Potential Challenges:
- Ambiguous entity boundaries in complex phrases
- Entities with unconventional capitalization
- Code-mixed text (Hindi-English)
- Very long or very short articles

Recommended Use Cases

✅ Good Fit:

Extracting entities from Indian news articles
Processing Times of India or similar English-language Indian news
Large-scale batch processing of news corpora
Research requiring entities with confidence scores

❌ Poor Fit:

Real-time critical applications (e.g., emergency response)
Non-news text (scientific papers, social media, legal documents)
Languages other than English
Applications requiring 100% precision

Ethical Considerations

Responsible Use

Training Data Provenance:
- Model trained on Times of India articles without explicit permission
- Users should verify compliance with data usage policies for their applications
Bias and Representation:
- Training data reflects editorial decisions and biases of Times of India
- May overrepresent certain geographic regions, political perspectives, or demographics
- Entity recognition accuracy may vary across different communities
Privacy Concerns:
- Model extracts names of real individuals from text
- Users must handle personal information in compliance with privacy laws (GDPR, etc.)
- Do not use for surveillance or unauthorized profiling
Quality and Errors:
- Automated annotations (ConfliBERT) may contain errors
- Critical applications should validate entity extractions manually
- Do not use for legal, medical, or safety-critical decisions
Dual-Use Potential: While designed for research, entity extraction could support:
- ❌ Surveillance or profiling of individuals/groups
- ❌ Manipulation of public discourse
- ❌ Unauthorized data harvesting
- ✅ Academic research and journalistic analysis (intended use)

Transparency Requirements

Users should:

Disclose when findings are based on automated NER
Report model limitations in publications
Validate critical findings with manual review or alternative methods
Cite both the model and the ConfliBERT annotation source

Comparison with Original ConflLlama-NER

Aspect	ConflLlama-NER (CAMEO)	ConflLlama-NER-TOI (This Model)
Training Data	1,094 sentences	100,000 articles
Domain	Conflict events	General news
Entity Types	Source, Target, Related (role-based)	Location, Org, Person, Temporal (type-based)
Annotation	Manual (CAMEO-coded)	Automated (ConfliBERT)
Geographic Focus	International conflicts	Indian news
Use Case	Political violence research	General information extraction
Training Steps	2,000	5,000
Scale	Specialized, small	General, large

When to use which:

CAMEO version: Conflict analysis, event coding, semantic role extraction
TOI version (this): General news analysis, entity databases, broad NER tasks

Citation

@article{meher2025confllama,
  title={ConflLlama: Domain-specific adaptation of large language models for conflict event classification},
  author={Meher, Shreyas and Brandt, Patrick T.},
  journal={Research \& Politics},
  volume={12},
  number={3},
  year={2025},
  publisher={SAGE Publications},
  doi={10.1177/20531680251356282}
}

Acknowledgments

Funding: NSF Award 2311142
Computing Resources: Delta system at NCSA (University of Illinois) through ACCESS allocation CIS220162
Base Model: Unsloth team for Llama-3.1 8B Instruct optimizations
Annotation Model: Event Data UTD team for ConfliBERT
Data Source: Times of India corpus
Infrastructure: Hugging Face for model hosting and transformers library

Model Versions

This repository contains the following model variants:

16-bit merged: Full precision, largest file size, highest quality
4-bit quantized: Smaller, faster inference with minimal quality loss
GGUF (Q4_K_M, Q8_0): Optimized for llama.cpp deployment
LoRA adapters: For efficient storage and further fine-tuning

Future Work

Potential improvements for future iterations:

Complete Training: Run for 2-3 full epochs to reduce loss below 0.90
Hyperparameter Tuning: Increase learning rate, warmup, and LoRA rank
Enable Evaluation: Add validation set monitoring for optimal checkpoint selection
Expand Entity Types: Include Events, Quantities, Miscellaneous categories
Multilingual Support: Add Hindi, other Indian languages
Human Validation: Sample-based quality assessment of ConfliBERT annotations
Cross-Domain Testing: Evaluate on non-Indian news sources