--- base_model: unsloth/llama-3-8b-Instruct-bnb-4bit license: apache-2.0 language: - en tags: - llama.cpp - gguf - quantized - q4_k_m - q8_0 - named-entity-recognition - ner - news-analysis - indian-news - conflibert --- # ConflLlama-NER-TOI: Large-Scale Named Entity Recognition for Indian News

ConflLlama-NER Logo

--- ## ⚠️ Important: Read Before Using **This model requires exact prompt formatting to work correctly.** Please read the [Critical: Inference & Prompt Formatting](#critical-inference--prompt-formatting) section below before attempting to use this model. Using incorrect prompts will result in poor performance or hallucinations. --- **ConflLlama-NER-TOI** is a large-scale named entity recognition model fine-tuned on **100,000 Indian news articles** from the Times of India. Built upon **Llama-3.1 8B Instruct**, this model identifies and classifies four entity types across diverse news domains: - **Location**: Geographic entities (cities, regions, countries) - **Organisation**: Companies, government bodies, institutions, political parties - **Person**: Named individuals, public figures, officials - **Temporal**: Time expressions, dates, periods This model represents a significant scale-up from the original ConflLlama-NER, trained on 1,300x more data for broader domain coverage and improved generalization. --- ## Key Features - **Large-Scale Training**: Fine-tuned on 100,000 news articles with high-confidence entity annotations - **Multi-Domain Coverage**: Trained on diverse news topics including politics, sports, crime, economy, and entertainment - **ConfliBERT Annotations**: Uses state-of-the-art entity recognition model for training data generation - **High-Quality Filtering**: Only entities with ≥0.9 confidence score included in training - **JSON Output Format**: Returns structured entity lists with text, type, position, and confidence scores - **Efficient Deployment**: Available in multiple quantization formats (Q4_K_M, Q8_0, BF16) - **Instruction-Tuned**: Built on Llama-3.1 Instruct for robust prompt following --- ## Training Data ### Dataset: Times of India Corpus with ConfliBERT Annotations The model was trained on a large-scale dataset derived from Indian news articles: - **Total Articles**: ~1.5 million Times of India articles - **Training Samples**: 100,000 (randomly sampled) - **Training/Test Split**: 95,000 train / 5,000 test (95/5 split) - **Text Processing**: Articles truncated to 510 tokens (~400-500 words) to fit context limits - **Annotation Method**: Automated using `eventdata-utd/conflibert-named-entity-recognition` - **Quality Filter**: Only entities with confidence score ≥0.9 included - **Total Entities**: Hundreds of thousands of high-confidence annotations ### Entity Type Definitions | Entity Type | Description | Examples | |-------------|-------------|----------| | **Location** | Geographic entities including cities, states, countries, regions | "Mumbai", "Punjab", "India", "United States" | | **Organisation** | Companies, government bodies, institutions, political parties, agencies | "Congress", "Reserve Bank of India", "Google", "Supreme Court" | | **Person** | Named individuals, politicians, celebrities, officials | "Narendra Modi", "Virat Kohli", "Amit Shah" | | **Temporal** | Time expressions, dates, periods, durations | "Monday", "2023", "last week", "evening" | ### Data Coverage The training data spans multiple news domains: - **Politics**: Government actions, elections, policy announcements - **Crime**: Police reports, legal proceedings, incidents - **Sports**: Matches, players, tournaments - **Economy**: Business news, market reports, financial decisions - **Entertainment**: Celebrity news, film industry, cultural events - **Local News**: City-level events, regional developments --- ## Model Architecture - **Base Model**: `unsloth/llama-3-8b-Instruct-bnb-4bit` - **Fine-tuning Method**: QLoRA (Quantized Low-Rank Adaptation) - **Quantization**: 4-bit with bitsandbytes - **Maximum Sequence Length**: 2048 tokens - **LoRA Configuration**: - Rank (r): 16 - Alpha (lora_alpha): 16 - Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` - Dropout: 0 - Gradient Checkpointing: Enabled (Unsloth optimization) ### Training Configuration - **Optimizer**: AdamW 8-bit (memory efficient) - **Learning Rate**: 2e-4 with 3% warmup (150 steps) - **Batch Size**: 2 per device with 4 gradient accumulation steps (effective batch size: 8) - **Training Steps**: 5,000 steps (~42% of one epoch) - **Precision**: BFloat16 (when supported) - **Hardware**: NVIDIA A100-SXM4-40GB GPU on NCSA Delta - **Training Time**: ~3-4 hours - **Memory Footprint**: ~8 GB VRAM ### Training Dynamics - **Initial Loss**: ~2.5 - **Final Loss**: ~0.90 (plateau observed) - **Gradient Norm**: Stable at 0.25-0.35 throughout training - **Learning Rate Schedule**: Linear decay from 2e-4 to 0 --- ## Critical: Inference & Prompt Formatting **IMPORTANT**: This model requires **exact prompt formatting** to function correctly. The model was instruction-tuned with a specific template. Deviating from this format will result in poor performance or hallucinations. ### Option 1: Using LM Studio / Llama.cpp (Recommended) **System Prompt:** ``` Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields. ``` **User Message:** ``` Text: [YOUR ARTICLE TEXT HERE] Entities: ``` ### Option 2: Direct API/Python Integration If your platform requires a single concatenated prompt: ```python prompt = """<|start_header_id|>system<|end_header_id|> Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|> Text: {your_text_here} Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>""" ``` ### Example Usage **Input:** ``` Text: The Punjab government has placed under suspension two officials following the Bathinda incident. Chief Minister Amarinder Singh announced the decision in Chandigarh on Monday. Entities: ``` **Expected Output:** ```json [ {"text": "Punjab government", "type": "Organisation", "start": 4, "end": 21, "score": 0.995}, {"text": "Bathinda", "type": "Location", "start": 72, "end": 80, "score": 0.980}, {"text": "Amarinder Singh", "type": "Person", "start": 101, "end": 116, "score": 0.998}, {"text": "Chandigarh", "type": "Location", "start": 145, "end": 155, "score": 0.992}, {"text": "Monday", "type": "Temporal", "start": 159, "end": 165, "score": 0.975} ] ``` --- ## Installation & Usage ### Using Llama.cpp (Recommended for Local Deployment) ```bash # Download the Q4_K_M GGUF model wget https://huggingface.co/shreyasmeher/ConflLlama-NER-TOI-GGUF/resolve/main/model-unsloth-Q4_K_M.gguf # Run with llama.cpp ./llama-cli -m model-unsloth-Q4_K_M.gguf \ --system "Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields." \ --prompt "Text: [YOUR TEXT]\nEntities: " \ --temp 0.3 \ --n-predict 512 ``` ### Using Transformers (Python) ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch import json # Load model and tokenizer model_name = "shreyasmeher/ConflLlama-NER-TOI" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) # Prepare input with correct formatting text = "The Reserve Bank of India announced new policies in Mumbai on Tuesday." prompt = f"""<|start_header_id|>system<|end_header_id|> Extract all named entities from the following text. Return them as a JSON list with 'text', 'type', 'start', 'end', and 'score' fields.<|eot_id|><|start_header_id|>user<|end_header_id|> Text: {text} Entities: <|eot_id|><|start_header_id|>assistant<|end_header_id|>""" # Generate inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.3, do_sample=False, pad_token_id=tokenizer.eos_token_id ) result = tokenizer.decode(outputs[0], skip_special_tokens=True) # Parse JSON output entities = json.loads(result.split("Entities: ")[1]) print(entities) ``` ### Using Unsloth (For Further Fine-tuning) ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="shreyasmeher/ConflLlama-NER-TOI", max_seq_length=2048, dtype=None, load_in_4bit=True, ) # Enable inference mode FastLanguageModel.for_inference(model) # Generate inputs = tokenizer([prompt], return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.3) result = tokenizer.batch_decode(outputs) ``` --- ## Intended Use This model is designed for **research and analysis** of news and information extraction: 1. **News Analytics**: Extract structured information from news articles at scale 2. **Information Retrieval**: Build searchable databases of entities from news corpora 3. **Knowledge Graph Construction**: Create entity relationship networks from news 4. **Content Analysis**: Identify key actors, locations, and organizations in news coverage 5. **Media Monitoring**: Track entity mentions across large news datasets 6. **Social Science Research**: Analyze patterns in news coverage and entity representation ### Example Applications - Building entity databases from news archives - Tracking political figures and organizations in media - Analyzing geographic focus of news coverage - Preprocessing for downstream NLP tasks (summarization, Q&A, classification) - Creating entity timelines and co-occurrence networks --- ## Limitations 1. **Training Data Scope**: - Trained primarily on **Indian news** from Times of India - May show bias toward Indian entities, locations, and naming conventions - Performance on news from other countries/regions may vary 2. **Entity Type Coverage**: - Limited to 4 entity types (Location, Organisation, Person, Temporal) - Does not recognize: Events, Products, Diseases, Currencies, etc. - No nested or overlapping entity support 3. **Training Completeness**: - Model only saw ~42% of training data (5,000 steps vs ~11,875 for full epoch) - Loss plateaued at 0.90, suggesting potential for improvement with longer training - Early stopping may have left performance gains on the table 4. **Annotation Quality**: - Training labels generated by ConfliBERT model (not human-annotated) - Inherits any biases or errors from the annotation model - Confidence threshold (≥0.9) may have excluded valid but uncertain entities 5. **Context Window**: - Articles truncated to 510 tokens (~400-500 words) - Entities appearing later in long articles were not seen during training 6. **Language**: English-only, optimized for Indian English conventions 7. **Output Parsing**: Model outputs JSON, but may occasionally produce malformed JSON. Implement robust parsing with error handling. 8. **Prompt Sensitivity**: **Critical limitation** - requires exact prompt formatting. Deviations will significantly degrade performance. --- ## Performance Considerations ### Expected Behavior - **Strong Performance**: Indian locations, political organizations, Hindi-origin names - **Moderate Performance**: International entities, technical/specialized terms - **Potential Challenges**: - Ambiguous entity boundaries in complex phrases - Entities with unconventional capitalization - Code-mixed text (Hindi-English) - Very long or very short articles ### Recommended Use Cases ✅ **Good Fit:** - Extracting entities from Indian news articles - Processing Times of India or similar English-language Indian news - Large-scale batch processing of news corpora - Research requiring entities with confidence scores ❌ **Poor Fit:** - Real-time critical applications (e.g., emergency response) - Non-news text (scientific papers, social media, legal documents) - Languages other than English - Applications requiring 100% precision --- ## Ethical Considerations ### Responsible Use 1. **Training Data Provenance**: - Model trained on Times of India articles without explicit permission - Users should verify compliance with data usage policies for their applications 2. **Bias and Representation**: - Training data reflects editorial decisions and biases of Times of India - May overrepresent certain geographic regions, political perspectives, or demographics - Entity recognition accuracy may vary across different communities 3. **Privacy Concerns**: - Model extracts names of real individuals from text - Users must handle personal information in compliance with privacy laws (GDPR, etc.) - Do not use for surveillance or unauthorized profiling 4. **Quality and Errors**: - Automated annotations (ConfliBERT) may contain errors - Critical applications should validate entity extractions manually - Do not use for legal, medical, or safety-critical decisions 5. **Dual-Use Potential**: While designed for research, entity extraction could support: - ❌ Surveillance or profiling of individuals/groups - ❌ Manipulation of public discourse - ❌ Unauthorized data harvesting - ✅ Academic research and journalistic analysis (intended use) ### Transparency Requirements Users should: - Disclose when findings are based on automated NER - Report model limitations in publications - Validate critical findings with manual review or alternative methods - Cite both the model and the ConfliBERT annotation source --- ## Comparison with Original ConflLlama-NER | Aspect | ConflLlama-NER (CAMEO) | ConflLlama-NER-TOI (This Model) | |--------|------------------------|----------------------------------| | **Training Data** | 1,094 sentences | 100,000 articles | | **Domain** | Conflict events | General news | | **Entity Types** | Source, Target, Related (role-based) | Location, Org, Person, Temporal (type-based) | | **Annotation** | Manual (CAMEO-coded) | Automated (ConfliBERT) | | **Geographic Focus** | International conflicts | Indian news | | **Use Case** | Political violence research | General information extraction | | **Training Steps** | 2,000 | 5,000 | | **Scale** | Specialized, small | General, large | **When to use which:** - **CAMEO version**: Conflict analysis, event coding, semantic role extraction - **TOI version (this)**: General news analysis, entity databases, broad NER tasks --- ## Citation ```bibtex @article{meher2025confllama, title={ConflLlama: Domain-specific adaptation of large language models for conflict event classification}, author={Meher, Shreyas and Brandt, Patrick T.}, journal={Research \& Politics}, volume={12}, number={3}, year={2025}, publisher={SAGE Publications}, doi={10.1177/20531680251356282} } ``` --- ## Acknowledgments - **Funding**: NSF Award 2311142 - **Computing Resources**: Delta system at NCSA (University of Illinois) through ACCESS allocation CIS220162 - **Base Model**: [Unsloth](https://github.com/unslothai/unsloth) team for Llama-3.1 8B Instruct optimizations - **Annotation Model**: Event Data UTD team for ConfliBERT - **Data Source**: Times of India corpus - **Infrastructure**: Hugging Face for model hosting and transformers library --- ## Model Versions This repository contains the following model variants: - **16-bit merged**: Full precision, largest file size, highest quality - **4-bit quantized**: Smaller, faster inference with minimal quality loss - **GGUF (Q4_K_M, Q8_0)**: Optimized for llama.cpp deployment - **LoRA adapters**: For efficient storage and further fine-tuning --- ## Future Work Potential improvements for future iterations: 1. **Complete Training**: Run for 2-3 full epochs to reduce loss below 0.90 2. **Hyperparameter Tuning**: Increase learning rate, warmup, and LoRA rank 3. **Enable Evaluation**: Add validation set monitoring for optimal checkpoint selection 4. **Expand Entity Types**: Include Events, Quantities, Miscellaneous categories 5. **Multilingual Support**: Add Hindi, other Indian languages 6. **Human Validation**: Sample-based quality assessment of ConfliBERT annotations 7. **Cross-Domain Testing**: Evaluate on non-Indian news sources --- ## License Apache 2.0 - See [LICENSE](LICENSE) for details. --- ## Related Resources - **Original ConflLlama (Attack Classification)**: [shreyasmeher/confllama](https://huggingface.co/shreyasmeher/confllama) - **ConflLlama-NER (CAMEO version)**: [shreyasmeher/confllama-ner-sft](https://huggingface.co/shreyasmeher/confllama-ner-sft) - **ConfliBERT Annotation Model**: [eventdata-utd/conflibert-named-entity-recognition](https://huggingface.co/eventdata-utd/conflibert-named-entity-recognition) - **Research Paper**: [https://doi.org/10.1177/20531680251356282](https://doi.org/10.1177/20531680251356282) - **Unsloth Framework**: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth) --- ## Contact For questions, issues, or collaboration inquiries: - **GitHub Issues**: [Repository Issues](https://github.com/shreyasmeher/confllama) - **Hugging Face**: [@shreyasmeher](https://huggingface.co/shreyasmeher) - **Email**: shreyas.meher@utdallas.edu ---