| | --- |
| | language: en |
| | library_name: transformers |
| | pipeline_tag: token-classification |
| | tags: |
| | - ner |
| | - token-classification |
| | - cybersecurity |
| | - threat-intelligence |
| | - secureBert |
| | license: mit |
| | metrics: |
| | - accuracy |
| | base_model: |
| | - answerdotai/ModernBERT-large |
| | --- |
| | |
| | # Model Overview |
| |
|
| | **SecureModernBERT-NER** represents a new generation of cybersecurity-focused language models — combining the **state-of-the-art architecture of ModernBERT** with one of the **largest and most diverse CTI-labelled NER corpora ever built**. |
| |
|
| | Unlike conventional NER systems, SecureModernBERT-NER recognises **22 finely-grained, security-specific entity types**, covering the full spectrum of cyber-threat intelligence — from `THREAT-ACTOR` and `MALWARE` to `CVE`, `IPV4`, `DOMAIN`, and `REGISTRY-KEYS`. |
| |
|
| | Trained on more than **half a million manually curated spans** sourced from real-world threat reports, vulnerability advisories, and incident analyses, it achieves an exceptional balance of **accuracy, generalisation, and contextual depth**. |
| |
|
| | This model is designed to **parse complex security narratives with human-level precision**, extracting both contextual metadata (e.g., `ORG`, `PRODUCT`, `PLATFORM`) and highly technical indicators (e.g., `HASHES`, `URLS`, `NETWORK ADDRESSES`) — all within a single unified framework. |
| |
|
| | SecureModernBERT-NER sets a new standard for **automated CTI entity recognition**, enabling the next wave of **threat-intelligence automation, enrichment, and analytics**. |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | model_id = "attack-vector/SecureModernBERT-NER" |
| | |
| | pipe = pipeline( |
| | task="token-classification", |
| | model=model_id, |
| | tokenizer=model_id, |
| | aggregation_strategy="first", |
| | ) |
| | |
| | text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts." |
| | predictions = pipe(text) |
| | for pred in predictions: |
| | print(pred) |
| | ``` |
| |
|
| | Sample output: |
| |
|
| | ``` |
| | {'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8} |
| | {'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42} |
| | {'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74} |
| | ``` |
| |
|
| | ## Intended Use & Limitations |
| |
|
| | - **Use cases:** automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems. |
| | - **Languages:** English (model was trained and evaluated on English sources only). |
| | - **Input format:** free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training. |
| | - **Limitations:** noisy or ambiguous extractions may occur, especially with rare entity types (`IPV6`, `EMAIL`) and obfuscated strings. The model does not normalise entities (e.g., deobfuscating `hxxp`) nor validate indicator authenticity. Always pair with downstream validation and human review. |
| |
|
| | ## Training Data |
| |
|
| | - **Size:** 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format. |
| | - **Label distribution (spans):** `ORG` (approx. 198k), `PRODUCT` (approx. 79k), `MALWARE` (approx. 67k), `PLATFORM` (approx. 57k), `THREAT-ACTOR` (approx. 49k), `SERVICE` (approx. 46k), `CVE` (approx. 41k), `LOC` (approx. 38k), `SECTOR` (approx. 34k), `TOOL` (approx. 29k), plus indicator types such as `URL`, `IPV4`, `SHA256`, `MD5`, and `REGISTRY-KEYS`. |
| | - **Pre-processing:** JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload. |
| |
|
| | ## Label Mapping |
| |
|
| | | Label | Description | Example mention | |
| | |-------|-------------|-----------------| |
| | | URL | Web address or obfuscated link used in campaigns. | `hxxp://185.222.202.55` | |
| | | ORG | Organisations such as companies, CERTs, or research groups. | `Microsoft Threat Intelligence` | |
| | | SERVICE | Online or cloud services referenced in attacks. | `Google Ads` | |
| | | SECTOR | Industry sectors or verticals targeted. | `critical infrastructure` | |
| | | FILEPATH | File system paths observed in malware samples. | `C:\Windows\System32\svchost.exe` | |
| | | DOMAIN | Fully qualified domains or subdomains. | `malicious-domain[.]com` | |
| | | PLATFORM | Operating systems or computing platforms. | `Windows Server` | |
| | | THREAT-ACTOR | Named adversary groups or aliases. | `LockBit` | |
| | | PRODUCT | Commercial or open-source software products. | `VMware ESXi` | |
| | | MALWARE | Malware families, strains, or toolkits. | `TrickBot` | |
| | | LOC | Countries, cities, or regions. | `United States` | |
| | | CVE | CVE identifiers for vulnerabilities. | `CVE-2023-23397` | |
| | | TOOL | Legitimate or dual-use tools leveraged in incidents. | `Cobalt Strike` | |
| | | IPV4 | IPv4 addresses. | `185.222.202.55` | |
| | | MITRE-TACTIC | MITRE ATT&CK tactic categories. | `Credential Access` | |
| | | MD5 | MD5 cryptographic hashes. | `d41d8cd98f00b204e9800998ecf8427e` | |
| | | CAMPAIGN | Named operations or campaigns. | `Operation Cronos` | |
| | | SHA1 | SHA-1 hashes. | `da39a3ee5e6b4b0d3255bfef95601890afd80709` | |
| | | SHA256 | SHA-256 hashes. | `9e107d9d372bb6826bd81d3542a419d6...` | |
| | | EMAIL | Email addresses. | `alerts@example.com` | |
| | | IPV6 | IPv6 addresses. | `2001:0db8:85a3:0000:0000:8a2e:0370:7334` | |
| | | REGISTRY-KEYS | Windows registry keys or paths. | `HKLM\Software\Microsoft\Windows\CurrentVersion\Run` | |
| |
|
| | ## Training Procedure |
| |
|
| | - **Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large). |
| | - **Hardware:** single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM). |
| | - **Optimisation setup:** mixed precision `fp16`, optimiser `adamw_torch`, cosine learning-rate scheduler, gradient accumulation `1`. |
| | - **Key hyperparameters:** learning rate `5e-5`, batch size `128`, epochs `5`, maximum sequence length `128`. |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Mixed precision | `fp16` | |
| | | Batch size | `128` | |
| | | Learning rate | `5e-5` | |
| | | Optimiser | `adamw_torch` | |
| | | Scheduler | `cosine` | |
| | | Epochs | `5` | |
| | | Gradient accumulation | `1` | |
| | | Max sequence length | `128` | |
| |
|
| | ## Evaluation |
| |
|
| | AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring): |
| |
|
| | | Metric | Score | |
| | |------------|--------| |
| | | Precision | 0.8468 | |
| | | Recall | 0.8484 | |
| | | F1 | 0.8476 | |
| | | Accuracy | 0.9589 | |
| |
|
| | An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality. |
| |
|
| | | Label | Used | Accuracy | |
| | |-------|------|----------| |
| | | CAMPAIGN | 1,817 | 0.7980 | |
| | | CVE | 28,293 | 0.9995 | |
| | | DOMAIN | 12,182 | 0.8878 | |
| | | EMAIL | 731 | 0.8495 | |
| | | FILEPATH | 13,889 | 0.7957 | |
| | | IPV4 | 1,164 | 0.9631 | |
| | | IPV6 | 563 | 0.7425 | |
| | | LOC | 7,915 | 0.9557 | |
| | | MALWARE | 10,405 | 0.9087 | |
| | | MD5 | 389 | 0.9100 | |
| | | MITRE-TACTIC | 2,181 | 0.7093 | |
| | | ORG | 36,324 | 0.9301 | |
| | | PLATFORM | 8,036 | 0.8977 | |
| | | PRODUCT | 18,720 | 0.8432 | |
| | | REGISTRY-KEYS | 1,589 | 0.8490 | |
| | | SECTOR | 6,453 | 0.8309 | |
| | | SERVICE | 8,533 | 0.8179 | |
| | | SHA1 | 222 | 0.9189 | |
| | | SHA256 | 2,146 | 0.9874 | |
| | | THREAT-ACTOR | 9,532 | 0.9418 | |
| | | TOOL | 4,874 | 0.7895 | |
| | | URL | 7,470 | 0.9801 | |
| |
|
| | - **Macro accuracy:** 0.8776 |
| |
|
| | Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint. |
| |
|
| | These metrics were computed with the `seqeval` micro-average at the entity level. |
| |
|
| | ## External Benchmarks |
| |
|
| | The following tables report detailed results on a shared CTI validation set. **Do not compare the per-label values across models directly:** each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution. |
| |
|
| |
|
| | ### CyberPeace-Institute/SecureBERT-NER |
| |
|
| | | Label | Used | Accuracy | |
| | |-------|------|----------| |
| | | ACT | 3,945 | 0.1706 | |
| | | APT | 9,518 | 0.5331 | |
| | | DOM | 10,694 | 0.0196 | |
| | | EMAIL | 731 | 0.0000 | |
| | | FILE | 31,864 | 0.0747 | |
| | | IP | 1,251 | 0.0088 | |
| | | LOC | 7,895 | 0.8711 | |
| | | MAL | 10,341 | 0.6076 | |
| | | MD5 | 354 | 0.8672 | |
| | | O | 16,275 | 0.4700 | |
| | | OS | 7,974 | 0.6598 | |
| | | SECTEAM | 36,083 | 0.3509 | |
| | | SHA1 | 191 | 0.0209 | |
| | | SHA2 | 1,647 | 0.9709 | |
| | | TOOL | 4,816 | 0.4043 | |
| | | URL | 6,997 | 0.0795 | |
| | | VULID | 27,586 | 0.3849 | |
| |
|
| | - **Macro accuracy:** 0.3820 |
| |
|
| | ### PranavaKailash/CyNER-2.0-DeBERTa-v3-base |
| |
|
| | | Label | Used | Accuracy | |
| | |-------|------|----------| |
| | | Indicator | 35,936 | 0.7878 | |
| | | Location | 7,895 | 0.0113 | |
| | | Malware | 12,125 | 0.7800 | |
| | | O | 2,896 | 0.7652 | |
| | | Organization | 42,537 | 0.6556 | |
| | | System | 35,063 | 0.7259 | |
| | | TOOL | 4,820 | 0.0000 | |
| | | Threat Group | 9,522 | 0.0000 | |
| | | Vulnerability | 27,673 | 0.1876 | |
| |
|
| | - **Macro accuracy:** 0.4348 |
| |
|
| | ### cisco-ai/SecureBERT2.0-NER |
| |
|
| | | Label | Used | Accuracy | |
| | |-------|------|----------| |
| | | Indicator | 35,789 | 0.8854 | |
| | | Malware | 16,926 | 0.6204 | |
| | | O | 10,786 | 0.6813 | |
| | | Organization | 51,993 | 0.5579 | |
| | | System | 34,955 | 0.6600 | |
| | | Vulnerability | 27,525 | 0.2552 | |
| |
|
| | - **Macro accuracy:** 0.6100 |
| |
|
| |
|
| | ## Responsible Use |
| |
|
| | - Confirm entity detections before acting on indicators (e.g., automated blocking). |
| | - Combine with enrichment and scoring systems to filter false positives. |
| | - Monitor for drift if applying to new domains (e.g., non-English sources, informal channels). |
| | - Respect licensing and confidentiality of any proprietary CTI sources used for inference. |
| |
|
| |
|
| | ## Support & Connect |
| |
|
| | * ❤️ **Like the repo** if you found it useful |
| | * ☕ **Support me:** Say thanks by buying me a coffee! [https://buymeacoffee.com/juanmcristobal](https://buymeacoffee.com/juanmcristobal) |
| | * 💼 **Open to work:** [https://www.linkedin.com/in/jmcristobal/](https://www.linkedin.com/in/jmcristobal/) |
| |
|
| | If you use SecureModernBERT-NER in a project, feel free to share it in the Discussions/Issues — I love seeing real-world use cases. |
| |
|
| | ## Citation |
| |
|
| | If you find this model useful, please cite the repository and the base model: |
| |
|
| | ``` |
| | @software{securemodernbert_ner_2025, |
| | author = {Juan Manuel Cristóbal Moreno}, |
| | title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser}, |
| | year = {2025}, |
| | publisher = {Hugging Face}, |
| | url = {https://huggingface.co/attack-vector/SecureModernBERT-NER} |
| | } |
| | ``` |
| |
|
| | ## Contact |
| |
|
| | Questions or feedback? Open an issue on the Hugging Face model repository or reach out at [`@juanmcristobal`](https://huggingface.co/juanmcristobal). |