juanmcristobal's picture
Update README.md
fdcad4c verified
---
language: en
library_name: transformers
pipeline_tag: token-classification
tags:
- ner
- token-classification
- cybersecurity
- threat-intelligence
- secureBert
license: mit
metrics:
- accuracy
base_model:
- answerdotai/ModernBERT-large
---
# Model Overview
**SecureModernBERT-NER** represents a new generation of cybersecurity-focused language models — combining the **state-of-the-art architecture of ModernBERT** with one of the **largest and most diverse CTI-labelled NER corpora ever built**.
Unlike conventional NER systems, SecureModernBERT-NER recognises **22 finely-grained, security-specific entity types**, covering the full spectrum of cyber-threat intelligence — from `THREAT-ACTOR` and `MALWARE` to `CVE`, `IPV4`, `DOMAIN`, and `REGISTRY-KEYS`.
Trained on more than **half a million manually curated spans** sourced from real-world threat reports, vulnerability advisories, and incident analyses, it achieves an exceptional balance of **accuracy, generalisation, and contextual depth**.
This model is designed to **parse complex security narratives with human-level precision**, extracting both contextual metadata (e.g., `ORG`, `PRODUCT`, `PLATFORM`) and highly technical indicators (e.g., `HASHES`, `URLS`, `NETWORK ADDRESSES`) — all within a single unified framework.
SecureModernBERT-NER sets a new standard for **automated CTI entity recognition**, enabling the next wave of **threat-intelligence automation, enrichment, and analytics**.
## Quick Start
```python
from transformers import pipeline
model_id = "attack-vector/SecureModernBERT-NER"
pipe = pipeline(
task="token-classification",
model=model_id,
tokenizer=model_id,
aggregation_strategy="first",
)
text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts."
predictions = pipe(text)
for pred in predictions:
print(pred)
```
Sample output:
```
{'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8}
{'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42}
{'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74}
```
## Intended Use & Limitations
- **Use cases:** automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems.
- **Languages:** English (model was trained and evaluated on English sources only).
- **Input format:** free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training.
- **Limitations:** noisy or ambiguous extractions may occur, especially with rare entity types (`IPV6`, `EMAIL`) and obfuscated strings. The model does not normalise entities (e.g., deobfuscating `hxxp`) nor validate indicator authenticity. Always pair with downstream validation and human review.
## Training Data
- **Size:** 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format.
- **Label distribution (spans):** `ORG` (approx. 198k), `PRODUCT` (approx. 79k), `MALWARE` (approx. 67k), `PLATFORM` (approx. 57k), `THREAT-ACTOR` (approx. 49k), `SERVICE` (approx. 46k), `CVE` (approx. 41k), `LOC` (approx. 38k), `SECTOR` (approx. 34k), `TOOL` (approx. 29k), plus indicator types such as `URL`, `IPV4`, `SHA256`, `MD5`, and `REGISTRY-KEYS`.
- **Pre-processing:** JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload.
## Label Mapping
| Label | Description | Example mention |
|-------|-------------|-----------------|
| URL | Web address or obfuscated link used in campaigns. | `hxxp://185.222.202.55` |
| ORG | Organisations such as companies, CERTs, or research groups. | `Microsoft Threat Intelligence` |
| SERVICE | Online or cloud services referenced in attacks. | `Google Ads` |
| SECTOR | Industry sectors or verticals targeted. | `critical infrastructure` |
| FILEPATH | File system paths observed in malware samples. | `C:\Windows\System32\svchost.exe` |
| DOMAIN | Fully qualified domains or subdomains. | `malicious-domain[.]com` |
| PLATFORM | Operating systems or computing platforms. | `Windows Server` |
| THREAT-ACTOR | Named adversary groups or aliases. | `LockBit` |
| PRODUCT | Commercial or open-source software products. | `VMware ESXi` |
| MALWARE | Malware families, strains, or toolkits. | `TrickBot` |
| LOC | Countries, cities, or regions. | `United States` |
| CVE | CVE identifiers for vulnerabilities. | `CVE-2023-23397` |
| TOOL | Legitimate or dual-use tools leveraged in incidents. | `Cobalt Strike` |
| IPV4 | IPv4 addresses. | `185.222.202.55` |
| MITRE-TACTIC | MITRE ATT&CK tactic categories. | `Credential Access` |
| MD5 | MD5 cryptographic hashes. | `d41d8cd98f00b204e9800998ecf8427e` |
| CAMPAIGN | Named operations or campaigns. | `Operation Cronos` |
| SHA1 | SHA-1 hashes. | `da39a3ee5e6b4b0d3255bfef95601890afd80709` |
| SHA256 | SHA-256 hashes. | `9e107d9d372bb6826bd81d3542a419d6...` |
| EMAIL | Email addresses. | `alerts@example.com` |
| IPV6 | IPv6 addresses. | `2001:0db8:85a3:0000:0000:8a2e:0370:7334` |
| REGISTRY-KEYS | Windows registry keys or paths. | `HKLM\Software\Microsoft\Windows\CurrentVersion\Run` |
## Training Procedure
- **Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large).
- **Hardware:** single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM).
- **Optimisation setup:** mixed precision `fp16`, optimiser `adamw_torch`, cosine learning-rate scheduler, gradient accumulation `1`.
- **Key hyperparameters:** learning rate `5e-5`, batch size `128`, epochs `5`, maximum sequence length `128`.
| Parameter | Value |
|-----------|-------|
| Mixed precision | `fp16` |
| Batch size | `128` |
| Learning rate | `5e-5` |
| Optimiser | `adamw_torch` |
| Scheduler | `cosine` |
| Epochs | `5` |
| Gradient accumulation | `1` |
| Max sequence length | `128` |
## Evaluation
AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring):
| Metric | Score |
|------------|--------|
| Precision | 0.8468 |
| Recall | 0.8484 |
| F1 | 0.8476 |
| Accuracy | 0.9589 |
An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality.
| Label | Used | Accuracy |
|-------|------|----------|
| CAMPAIGN | 1,817 | 0.7980 |
| CVE | 28,293 | 0.9995 |
| DOMAIN | 12,182 | 0.8878 |
| EMAIL | 731 | 0.8495 |
| FILEPATH | 13,889 | 0.7957 |
| IPV4 | 1,164 | 0.9631 |
| IPV6 | 563 | 0.7425 |
| LOC | 7,915 | 0.9557 |
| MALWARE | 10,405 | 0.9087 |
| MD5 | 389 | 0.9100 |
| MITRE-TACTIC | 2,181 | 0.7093 |
| ORG | 36,324 | 0.9301 |
| PLATFORM | 8,036 | 0.8977 |
| PRODUCT | 18,720 | 0.8432 |
| REGISTRY-KEYS | 1,589 | 0.8490 |
| SECTOR | 6,453 | 0.8309 |
| SERVICE | 8,533 | 0.8179 |
| SHA1 | 222 | 0.9189 |
| SHA256 | 2,146 | 0.9874 |
| THREAT-ACTOR | 9,532 | 0.9418 |
| TOOL | 4,874 | 0.7895 |
| URL | 7,470 | 0.9801 |
- **Macro accuracy:** 0.8776
Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint.
These metrics were computed with the `seqeval` micro-average at the entity level.
## External Benchmarks
The following tables report detailed results on a shared CTI validation set. **Do not compare the per-label values across models directly:** each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution.
### CyberPeace-Institute/SecureBERT-NER
| Label | Used | Accuracy |
|-------|------|----------|
| ACT | 3,945 | 0.1706 |
| APT | 9,518 | 0.5331 |
| DOM | 10,694 | 0.0196 |
| EMAIL | 731 | 0.0000 |
| FILE | 31,864 | 0.0747 |
| IP | 1,251 | 0.0088 |
| LOC | 7,895 | 0.8711 |
| MAL | 10,341 | 0.6076 |
| MD5 | 354 | 0.8672 |
| O | 16,275 | 0.4700 |
| OS | 7,974 | 0.6598 |
| SECTEAM | 36,083 | 0.3509 |
| SHA1 | 191 | 0.0209 |
| SHA2 | 1,647 | 0.9709 |
| TOOL | 4,816 | 0.4043 |
| URL | 6,997 | 0.0795 |
| VULID | 27,586 | 0.3849 |
- **Macro accuracy:** 0.3820
### PranavaKailash/CyNER-2.0-DeBERTa-v3-base
| Label | Used | Accuracy |
|-------|------|----------|
| Indicator | 35,936 | 0.7878 |
| Location | 7,895 | 0.0113 |
| Malware | 12,125 | 0.7800 |
| O | 2,896 | 0.7652 |
| Organization | 42,537 | 0.6556 |
| System | 35,063 | 0.7259 |
| TOOL | 4,820 | 0.0000 |
| Threat Group | 9,522 | 0.0000 |
| Vulnerability | 27,673 | 0.1876 |
- **Macro accuracy:** 0.4348
### cisco-ai/SecureBERT2.0-NER
| Label | Used | Accuracy |
|-------|------|----------|
| Indicator | 35,789 | 0.8854 |
| Malware | 16,926 | 0.6204 |
| O | 10,786 | 0.6813 |
| Organization | 51,993 | 0.5579 |
| System | 34,955 | 0.6600 |
| Vulnerability | 27,525 | 0.2552 |
- **Macro accuracy:** 0.6100
## Responsible Use
- Confirm entity detections before acting on indicators (e.g., automated blocking).
- Combine with enrichment and scoring systems to filter false positives.
- Monitor for drift if applying to new domains (e.g., non-English sources, informal channels).
- Respect licensing and confidentiality of any proprietary CTI sources used for inference.
## Support & Connect
* ❤️ **Like the repo** if you found it useful
***Support me:** Say thanks by buying me a coffee! [https://buymeacoffee.com/juanmcristobal](https://buymeacoffee.com/juanmcristobal)
* 💼 **Open to work:** [https://www.linkedin.com/in/jmcristobal/](https://www.linkedin.com/in/jmcristobal/)
If you use SecureModernBERT-NER in a project, feel free to share it in the Discussions/Issues — I love seeing real-world use cases.
## Citation
If you find this model useful, please cite the repository and the base model:
```
@software{securemodernbert_ner_2025,
author = {Juan Manuel Cristóbal Moreno},
title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/attack-vector/SecureModernBERT-NER}
}
```
## Contact
Questions or feedback? Open an issue on the Hugging Face model repository or reach out at [`@juanmcristobal`](https://huggingface.co/juanmcristobal).