ConflLlama-NER-TOI: Large-Scale Named Entity Recognition for Indian News

ConflLlama-NER Logo


⚠️ Important: Read Before Using

This model requires exact prompt formatting to work correctly. Please read the Critical: Inference & Prompt Formatting section below before attempting to use this model. Using incorrect prompts will result in poor performance or hallucinations.


ConflLlama-NER-TOI is a large-scale named entity recognition model fine-tuned on 100,000 Indian news articles from the Times of India. Built upon Llama-3.1 8B Instruct, this model identifies and classifies four entity types across diverse news domains:

  • Location: Geographic entities (cities, regions, countries)
  • Organisation: Companies, government bodies, institutions, political parties
  • Person: Named individuals, public figures, officials
  • Temporal: Time expressions, dates, periods

This model represents a significant scale-up from the original ConflLlama-NER, trained on 1,300x more data for broader domain coverage and improved generalization.


Key Features

  • Large-Scale Training: Fine-tuned on 100,000 news articles with high-confidence entity annotations
  • Multi-Domain Coverage: Trained on diverse news topics including politics, sports, crime, economy, and entertainment
  • ConfliBERT Annotations: Uses state-of-the-art entity recognition model for training data generation
  • High-Quality Filtering: Only entities with ≥0.9 confidence score included in training
  • JSON Output Format: Returns structured entity lists with text, type, position, and confidence scores
  • Efficient Deployment: Available in multiple quantization formats (Q4_K_M, Q8_0, BF16)
  • Instruction-Tuned: Built on Llama-3.1 Instruct for robust prompt following

Training Data

Dataset: Times of India Corpus with ConfliBERT Annotations

The model was trained on a large-scale dataset derived from Indian news articles:

  • Total Articles: ~1.5 million Times of India articles
  • Training Samples: 100,000 (randomly sampled)
  • Training/Test Split: 95,000 train / 5,000 test (95/5 split)
  • Text Processing: Articles truncated to 510 tokens (~400-500 words) to fit context limits
  • Annotation Method: Automated using eventdata-utd/conflibert-named-entity-recognition
  • Quality Filter: Only entities with confidence score ≥0.9 included
  • Total Entities: Hundreds of thousands of high-confidence annotations

Entity Type Definitions

Entity Type Description Examples
Location Geographic entities including cities, states, countries, regions "Mumbai", "Punjab", "India", "United States"
Organisation Companies, government bodies, institutions, political parties, agencies "Congress", "Reserve Bank of India", "Google", "Supreme Court"
Person Named individuals, politicians, celebrities, officials "Narendra Modi", "Virat Kohli", "Amit Shah"
Temporal Time expressions, dates, periods, durations "Monday", "2023", "last week", "evening"

Data Coverage

The training data spans multiple news domains:

  • Politics: Government actions, elections, policy announcements
  • Crime: Police reports, legal proceedings, incidents
  • Sports: Matches, players, tournaments
  • Economy: Business news, market reports, financial decisions
  • Entertainment: Celebrity news, film industry, cultural events
  • Local News: City-level events, regional developments

Model Architecture

  • Base Model: unsloth/llama-3-8b-Instruct-bnb-4bit
  • Fine-tuning Method: QLoRA (Quantized Low-Rank Adaptation)
  • Quantization: 4-bit with bitsandbytes
  • Maximum Sequence Length: 2048 tokens
  • LoRA Configuration:
    • Rank (r): 16
    • Alpha (lora_alpha): 16
    • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    • Dropout: 0
    • Gradient Checkpointing: Enabled (Unsloth optimization)

Training Configuration

  • Optimizer: AdamW 8-bit (memory efficient)
  • Learning Rate: 2e-4 with 3% warmup (150 steps)
  • Batch Size: 2 per device with 4 gradient accumulation steps (effective batch size: 8)
  • Training Steps: 5,000 steps (~42% of one epoch)
  • Precision: BFloat16 (when supported)
  • Hardware: NVIDIA A100-SXM4-40GB GPU on NCSA Delta
  • Training Time: ~3-4 hours
  • Memory Footprint: ~8 GB VRAM

Training Dynamics

  • Initial Loss: ~2.5
  • Final Loss: ~0.90 (plateau observed)
  • Gradient Norm: Stable at 0.25-0.35 throughout training
  • Learning Rate Schedule: Linear decay from 2e-4 to 0

Critical: Inference & Prompt Formatting

IMPORTANT: This model requires exact prompt formatting to function correctly. The model was instruction-tuned with a specific template. Deviating from this format will result in poor performance or hallucinations.

Option 1: Using LM Studio / Llama.cpp (Recommended)

System Prompt:

Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.

User Message:

Text: [YOUR ARTICLE TEXT HERE]
Entities: 

Option 2: Direct API/Python Integration

If your platform requires a single concatenated prompt:

prompt = """<|start_header_id|>system<|end_header_id|>

Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>

Text: {your_text_here}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

Example Usage

Input:

Text: The Punjab government has placed under suspension two officials following the Bathinda incident. Chief Minister Amarinder Singh announced the decision in Chandigarh on Monday.
Entities:

Expected Output:

[
  {"text": "Punjab government", "type": "Organisation", "start": 4, "end": 21, "score": 0.995},
  {"text": "Bathinda", "type": "Location", "start": 72, "end": 80, "score": 0.980},
  {"text": "Amarinder Singh", "type": "Person", "start": 101, "end": 116, "score": 0.998},
  {"text": "Chandigarh", "type": "Location", "start": 145, "end": 155, "score": 0.992},
  {"text": "Monday", "type": "Temporal", "start": 159, "end": 165, "score": 0.975}
]

Installation & Usage

Using Llama.cpp (Recommended for Local Deployment)

# Download the Q4_K_M GGUF model
wget https://huggingface.co/shreyasmeher/ConflLlama-NER-TOI-GGUF/resolve/main/model-unsloth-Q4_K_M.gguf

# Run with llama.cpp
./llama-cli -m model-unsloth-Q4_K_M.gguf \
  --system "Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields." \
  --prompt "Text: [YOUR TEXT]\nEntities: " \
  --temp 0.3 \
  --n-predict 512

Using Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json

# Load model and tokenizer
model_name = "shreyasmeher/ConflLlama-NER-TOI"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare input with correct formatting
text = "The Reserve Bank of India announced new policies in Mumbai on Tuesday."
prompt = f"""<|start_header_id|>system<|end_header_id|>

Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|>

Text: {text}
Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Parse JSON output
entities = json.loads(result.split("Entities: ")[1])
print(entities)

Using Unsloth (For Further Fine-tuning)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="shreyasmeher/ConflLlama-NER-TOI",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Enable inference mode
FastLanguageModel.for_inference(model)

# Generate
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.3)
result = tokenizer.batch_decode(outputs)

Intended Use

This model is designed for research and analysis of news and information extraction:

  1. News Analytics: Extract structured information from news articles at scale
  2. Information Retrieval: Build searchable databases of entities from news corpora
  3. Knowledge Graph Construction: Create entity relationship networks from news
  4. Content Analysis: Identify key actors, locations, and organizations in news coverage
  5. Media Monitoring: Track entity mentions across large news datasets
  6. Social Science Research: Analyze patterns in news coverage and entity representation

Example Applications

  • Building entity databases from news archives
  • Tracking political figures and organizations in media
  • Analyzing geographic focus of news coverage
  • Preprocessing for downstream NLP tasks (summarization, Q&A, classification)
  • Creating entity timelines and co-occurrence networks

Limitations

  1. Training Data Scope:

    • Trained primarily on Indian news from Times of India
    • May show bias toward Indian entities, locations, and naming conventions
    • Performance on news from other countries/regions may vary
  2. Entity Type Coverage:

    • Limited to 4 entity types (Location, Organisation, Person, Temporal)
    • Does not recognize: Events, Products, Diseases, Currencies, etc.
    • No nested or overlapping entity support
  3. Training Completeness:

    • Model only saw ~42% of training data (5,000 steps vs ~11,875 for full epoch)
    • Loss plateaued at 0.90, suggesting potential for improvement with longer training
    • Early stopping may have left performance gains on the table
  4. Annotation Quality:

    • Training labels generated by ConfliBERT model (not human-annotated)
    • Inherits any biases or errors from the annotation model
    • Confidence threshold (≥0.9) may have excluded valid but uncertain entities
  5. Context Window:

    • Articles truncated to 510 tokens (~400-500 words)
    • Entities appearing later in long articles were not seen during training
  6. Language: English-only, optimized for Indian English conventions

  7. Output Parsing: Model outputs JSON, but may occasionally produce malformed JSON. Implement robust parsing with error handling.

  8. Prompt Sensitivity: Critical limitation - requires exact prompt formatting. Deviations will significantly degrade performance.


Performance Considerations

Expected Behavior

  • Strong Performance: Indian locations, political organizations, Hindi-origin names
  • Moderate Performance: International entities, technical/specialized terms
  • Potential Challenges:
    • Ambiguous entity boundaries in complex phrases
    • Entities with unconventional capitalization
    • Code-mixed text (Hindi-English)
    • Very long or very short articles

Recommended Use Cases

Good Fit:

  • Extracting entities from Indian news articles
  • Processing Times of India or similar English-language Indian news
  • Large-scale batch processing of news corpora
  • Research requiring entities with confidence scores

Poor Fit:

  • Real-time critical applications (e.g., emergency response)
  • Non-news text (scientific papers, social media, legal documents)
  • Languages other than English
  • Applications requiring 100% precision

Ethical Considerations

Responsible Use

  1. Training Data Provenance:

    • Model trained on Times of India articles without explicit permission
    • Users should verify compliance with data usage policies for their applications
  2. Bias and Representation:

    • Training data reflects editorial decisions and biases of Times of India
    • May overrepresent certain geographic regions, political perspectives, or demographics
    • Entity recognition accuracy may vary across different communities
  3. Privacy Concerns:

    • Model extracts names of real individuals from text
    • Users must handle personal information in compliance with privacy laws (GDPR, etc.)
    • Do not use for surveillance or unauthorized profiling
  4. Quality and Errors:

    • Automated annotations (ConfliBERT) may contain errors
    • Critical applications should validate entity extractions manually
    • Do not use for legal, medical, or safety-critical decisions
  5. Dual-Use Potential: While designed for research, entity extraction could support:

    • ❌ Surveillance or profiling of individuals/groups
    • ❌ Manipulation of public discourse
    • ❌ Unauthorized data harvesting
    • ✅ Academic research and journalistic analysis (intended use)

Transparency Requirements

Users should:

  • Disclose when findings are based on automated NER
  • Report model limitations in publications
  • Validate critical findings with manual review or alternative methods
  • Cite both the model and the ConfliBERT annotation source

Comparison with Original ConflLlama-NER

Aspect ConflLlama-NER (CAMEO) ConflLlama-NER-TOI (This Model)
Training Data 1,094 sentences 100,000 articles
Domain Conflict events General news
Entity Types Source, Target, Related (role-based) Location, Org, Person, Temporal (type-based)
Annotation Manual (CAMEO-coded) Automated (ConfliBERT)
Geographic Focus International conflicts Indian news
Use Case Political violence research General information extraction
Training Steps 2,000 5,000
Scale Specialized, small General, large

When to use which:

  • CAMEO version: Conflict analysis, event coding, semantic role extraction
  • TOI version (this): General news analysis, entity databases, broad NER tasks

Citation

@article{meher2025confllama,
  title={ConflLlama: Domain-specific adaptation of large language models for conflict event classification},
  author={Meher, Shreyas and Brandt, Patrick T.},
  journal={Research \& Politics},
  volume={12},
  number={3},
  year={2025},
  publisher={SAGE Publications},
  doi={10.1177/20531680251356282}
}

Acknowledgments

  • Funding: NSF Award 2311142
  • Computing Resources: Delta system at NCSA (University of Illinois) through ACCESS allocation CIS220162
  • Base Model: Unsloth team for Llama-3.1 8B Instruct optimizations
  • Annotation Model: Event Data UTD team for ConfliBERT
  • Data Source: Times of India corpus
  • Infrastructure: Hugging Face for model hosting and transformers library

Model Versions

This repository contains the following model variants:

  • 16-bit merged: Full precision, largest file size, highest quality
  • 4-bit quantized: Smaller, faster inference with minimal quality loss
  • GGUF (Q4_K_M, Q8_0): Optimized for llama.cpp deployment
  • LoRA adapters: For efficient storage and further fine-tuning

Future Work

Potential improvements for future iterations:

  1. Complete Training: Run for 2-3 full epochs to reduce loss below 0.90
  2. Hyperparameter Tuning: Increase learning rate, warmup, and LoRA rank
  3. Enable Evaluation: Add validation set monitoring for optimal checkpoint selection
  4. Expand Entity Types: Include Events, Quantities, Miscellaneous categories
  5. Multilingual Support: Add Hindi, other Indian languages
  6. Human Validation: Sample-based quality assessment of ConfliBERT annotations
  7. Cross-Domain Testing: Evaluate on non-Indian news sources

License

Apache 2.0 - See LICENSE for details.


Related Resources


Contact

For questions, issues, or collaboration inquiries:


Downloads last month
78
GGUF
Model size
8B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

5-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shreyasmeher/ConflLlama-NER-TOI

Quantized
(657)
this model