--- language: en library_name: transformers pipeline_tag: token-classification tags: - ner - token-classification - cybersecurity - threat-intelligence - secureBert license: mit metrics: - accuracy base_model: - answerdotai/ModernBERT-large --- # Model Overview **SecureModernBERT-NER** represents a new generation of cybersecurity-focused language models — combining the **state-of-the-art architecture of ModernBERT** with one of the **largest and most diverse CTI-labelled NER corpora ever built**. Unlike conventional NER systems, SecureModernBERT-NER recognises **22 finely-grained, security-specific entity types**, covering the full spectrum of cyber-threat intelligence — from `THREAT-ACTOR` and `MALWARE` to `CVE`, `IPV4`, `DOMAIN`, and `REGISTRY-KEYS`. Trained on more than **half a million manually curated spans** sourced from real-world threat reports, vulnerability advisories, and incident analyses, it achieves an exceptional balance of **accuracy, generalisation, and contextual depth**. This model is designed to **parse complex security narratives with human-level precision**, extracting both contextual metadata (e.g., `ORG`, `PRODUCT`, `PLATFORM`) and highly technical indicators (e.g., `HASHES`, `URLS`, `NETWORK ADDRESSES`) — all within a single unified framework. SecureModernBERT-NER sets a new standard for **automated CTI entity recognition**, enabling the next wave of **threat-intelligence automation, enrichment, and analytics**. ## Quick Start ```python from transformers import pipeline model_id = "attack-vector/SecureModernBERT-NER" pipe = pipeline( task="token-classification", model=model_id, tokenizer=model_id, aggregation_strategy="first", ) text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts." predictions = pipe(text) for pred in predictions: print(pred) ``` Sample output: ``` {'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8} {'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42} {'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74} ``` ## Intended Use & Limitations - **Use cases:** automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems. - **Languages:** English (model was trained and evaluated on English sources only). - **Input format:** free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training. - **Limitations:** noisy or ambiguous extractions may occur, especially with rare entity types (`IPV6`, `EMAIL`) and obfuscated strings. The model does not normalise entities (e.g., deobfuscating `hxxp`) nor validate indicator authenticity. Always pair with downstream validation and human review. ## Training Data - **Size:** 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format. - **Label distribution (spans):** `ORG` (approx. 198k), `PRODUCT` (approx. 79k), `MALWARE` (approx. 67k), `PLATFORM` (approx. 57k), `THREAT-ACTOR` (approx. 49k), `SERVICE` (approx. 46k), `CVE` (approx. 41k), `LOC` (approx. 38k), `SECTOR` (approx. 34k), `TOOL` (approx. 29k), plus indicator types such as `URL`, `IPV4`, `SHA256`, `MD5`, and `REGISTRY-KEYS`. - **Pre-processing:** JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload. ## Label Mapping | Label | Description | Example mention | |-------|-------------|-----------------| | URL | Web address or obfuscated link used in campaigns. | `hxxp://185.222.202.55` | | ORG | Organisations such as companies, CERTs, or research groups. | `Microsoft Threat Intelligence` | | SERVICE | Online or cloud services referenced in attacks. | `Google Ads` | | SECTOR | Industry sectors or verticals targeted. | `critical infrastructure` | | FILEPATH | File system paths observed in malware samples. | `C:\Windows\System32\svchost.exe` | | DOMAIN | Fully qualified domains or subdomains. | `malicious-domain[.]com` | | PLATFORM | Operating systems or computing platforms. | `Windows Server` | | THREAT-ACTOR | Named adversary groups or aliases. | `LockBit` | | PRODUCT | Commercial or open-source software products. | `VMware ESXi` | | MALWARE | Malware families, strains, or toolkits. | `TrickBot` | | LOC | Countries, cities, or regions. | `United States` | | CVE | CVE identifiers for vulnerabilities. | `CVE-2023-23397` | | TOOL | Legitimate or dual-use tools leveraged in incidents. | `Cobalt Strike` | | IPV4 | IPv4 addresses. | `185.222.202.55` | | MITRE-TACTIC | MITRE ATT&CK tactic categories. | `Credential Access` | | MD5 | MD5 cryptographic hashes. | `d41d8cd98f00b204e9800998ecf8427e` | | CAMPAIGN | Named operations or campaigns. | `Operation Cronos` | | SHA1 | SHA-1 hashes. | `da39a3ee5e6b4b0d3255bfef95601890afd80709` | | SHA256 | SHA-256 hashes. | `9e107d9d372bb6826bd81d3542a419d6...` | | EMAIL | Email addresses. | `alerts@example.com` | | IPV6 | IPv6 addresses. | `2001:0db8:85a3:0000:0000:8a2e:0370:7334` | | REGISTRY-KEYS | Windows registry keys or paths. | `HKLM\Software\Microsoft\Windows\CurrentVersion\Run` | ## Training Procedure - **Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large). - **Hardware:** single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM). - **Optimisation setup:** mixed precision `fp16`, optimiser `adamw_torch`, cosine learning-rate scheduler, gradient accumulation `1`. - **Key hyperparameters:** learning rate `5e-5`, batch size `128`, epochs `5`, maximum sequence length `128`. | Parameter | Value | |-----------|-------| | Mixed precision | `fp16` | | Batch size | `128` | | Learning rate | `5e-5` | | Optimiser | `adamw_torch` | | Scheduler | `cosine` | | Epochs | `5` | | Gradient accumulation | `1` | | Max sequence length | `128` | ## Evaluation AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring): | Metric | Score | |------------|--------| | Precision | 0.8468 | | Recall | 0.8484 | | F1 | 0.8476 | | Accuracy | 0.9589 | An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality. | Label | Used | Accuracy | |-------|------|----------| | CAMPAIGN | 1,817 | 0.7980 | | CVE | 28,293 | 0.9995 | | DOMAIN | 12,182 | 0.8878 | | EMAIL | 731 | 0.8495 | | FILEPATH | 13,889 | 0.7957 | | IPV4 | 1,164 | 0.9631 | | IPV6 | 563 | 0.7425 | | LOC | 7,915 | 0.9557 | | MALWARE | 10,405 | 0.9087 | | MD5 | 389 | 0.9100 | | MITRE-TACTIC | 2,181 | 0.7093 | | ORG | 36,324 | 0.9301 | | PLATFORM | 8,036 | 0.8977 | | PRODUCT | 18,720 | 0.8432 | | REGISTRY-KEYS | 1,589 | 0.8490 | | SECTOR | 6,453 | 0.8309 | | SERVICE | 8,533 | 0.8179 | | SHA1 | 222 | 0.9189 | | SHA256 | 2,146 | 0.9874 | | THREAT-ACTOR | 9,532 | 0.9418 | | TOOL | 4,874 | 0.7895 | | URL | 7,470 | 0.9801 | - **Macro accuracy:** 0.8776 Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint. These metrics were computed with the `seqeval` micro-average at the entity level. ## External Benchmarks The following tables report detailed results on a shared CTI validation set. **Do not compare the per-label values across models directly:** each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution. ### CyberPeace-Institute/SecureBERT-NER | Label | Used | Accuracy | |-------|------|----------| | ACT | 3,945 | 0.1706 | | APT | 9,518 | 0.5331 | | DOM | 10,694 | 0.0196 | | EMAIL | 731 | 0.0000 | | FILE | 31,864 | 0.0747 | | IP | 1,251 | 0.0088 | | LOC | 7,895 | 0.8711 | | MAL | 10,341 | 0.6076 | | MD5 | 354 | 0.8672 | | O | 16,275 | 0.4700 | | OS | 7,974 | 0.6598 | | SECTEAM | 36,083 | 0.3509 | | SHA1 | 191 | 0.0209 | | SHA2 | 1,647 | 0.9709 | | TOOL | 4,816 | 0.4043 | | URL | 6,997 | 0.0795 | | VULID | 27,586 | 0.3849 | - **Macro accuracy:** 0.3820 ### PranavaKailash/CyNER-2.0-DeBERTa-v3-base | Label | Used | Accuracy | |-------|------|----------| | Indicator | 35,936 | 0.7878 | | Location | 7,895 | 0.0113 | | Malware | 12,125 | 0.7800 | | O | 2,896 | 0.7652 | | Organization | 42,537 | 0.6556 | | System | 35,063 | 0.7259 | | TOOL | 4,820 | 0.0000 | | Threat Group | 9,522 | 0.0000 | | Vulnerability | 27,673 | 0.1876 | - **Macro accuracy:** 0.4348 ### cisco-ai/SecureBERT2.0-NER | Label | Used | Accuracy | |-------|------|----------| | Indicator | 35,789 | 0.8854 | | Malware | 16,926 | 0.6204 | | O | 10,786 | 0.6813 | | Organization | 51,993 | 0.5579 | | System | 34,955 | 0.6600 | | Vulnerability | 27,525 | 0.2552 | - **Macro accuracy:** 0.6100 ## Responsible Use - Confirm entity detections before acting on indicators (e.g., automated blocking). - Combine with enrichment and scoring systems to filter false positives. - Monitor for drift if applying to new domains (e.g., non-English sources, informal channels). - Respect licensing and confidentiality of any proprietary CTI sources used for inference. ## Support & Connect * ❤️ **Like the repo** if you found it useful * ☕ **Support me:** Say thanks by buying me a coffee! [https://buymeacoffee.com/juanmcristobal](https://buymeacoffee.com/juanmcristobal) * 💼 **Open to work:** [https://www.linkedin.com/in/jmcristobal/](https://www.linkedin.com/in/jmcristobal/) If you use SecureModernBERT-NER in a project, feel free to share it in the Discussions/Issues — I love seeing real-world use cases. ## Citation If you find this model useful, please cite the repository and the base model: ``` @software{securemodernbert_ner_2025, author = {Juan Manuel Cristóbal Moreno}, title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/attack-vector/SecureModernBERT-NER} } ``` ## Contact Questions or feedback? Open an issue on the Hugging Face model repository or reach out at [`@juanmcristobal`](https://huggingface.co/juanmcristobal).