File size: 10,631 Bytes
7a093c4
b5b1ee2
7a093c4
b5b1ee2
7a093c4
b5b1ee2
7a093c4
b5b1ee2
 
f0e8111
 
 
 
 
 
7a093c4
 
7e5b84d
b5b1ee2
7e5b84d
 
 
 
 
 
 
 
 
b5b1ee2
 
 
 
 
 
7e5b84d
b5b1ee2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0e8111
b5b1ee2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7577dd8
b5b1ee2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7577dd8
b5b1ee2
 
 
 
 
 
 
 
7577dd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5b1ee2
 
7577dd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e5b84d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7577dd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5b1ee2
7a093c4
b5b1ee2
 
 
 
7a093c4
38d551f
 
 
fdcad4c
 
 
38d551f
 
 
b5b1ee2
7a093c4
b5b1ee2
7a093c4
b5b1ee2
 
f0e8111
b5b1ee2
 
 
f0e8111
b5b1ee2
 
7a093c4
b5b1ee2
7a093c4
f0e8111
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
language: en
library_name: transformers
pipeline_tag: token-classification
tags:
- ner
- token-classification
- cybersecurity
- threat-intelligence
- secureBert
license: mit
metrics:
- accuracy
base_model:
- answerdotai/ModernBERT-large
---

# Model Overview

**SecureModernBERT-NER** represents a new generation of cybersecurity-focused language models — combining the **state-of-the-art architecture of ModernBERT** with one of the **largest and most diverse CTI-labelled NER corpora ever built**.  

Unlike conventional NER systems, SecureModernBERT-NER recognises **22 finely-grained, security-specific entity types**, covering the full spectrum of cyber-threat intelligence — from `THREAT-ACTOR` and `MALWARE` to `CVE`, `IPV4`, `DOMAIN`, and `REGISTRY-KEYS`.  

Trained on more than **half a million manually curated spans** sourced from real-world threat reports, vulnerability advisories, and incident analyses, it achieves an exceptional balance of **accuracy, generalisation, and contextual depth**.  

This model is designed to **parse complex security narratives with human-level precision**, extracting both contextual metadata (e.g., `ORG`, `PRODUCT`, `PLATFORM`) and highly technical indicators (e.g., `HASHES`, `URLS`, `NETWORK ADDRESSES`) — all within a single unified framework.  

SecureModernBERT-NER sets a new standard for **automated CTI entity recognition**, enabling the next wave of **threat-intelligence automation, enrichment, and analytics**.  

## Quick Start

```python
from transformers import pipeline

model_id = "attack-vector/SecureModernBERT-NER"

pipe = pipeline(
    task="token-classification",
    model=model_id,
    tokenizer=model_id,
    aggregation_strategy="first",
)

text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts."
predictions = pipe(text)
for pred in predictions:
    print(pred)
```

Sample output:

```
{'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8}
{'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42}
{'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74}
```

## Intended Use & Limitations

- **Use cases:** automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems.
- **Languages:** English (model was trained and evaluated on English sources only).
- **Input format:** free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training.
- **Limitations:** noisy or ambiguous extractions may occur, especially with rare entity types (`IPV6`, `EMAIL`) and obfuscated strings. The model does not normalise entities (e.g., deobfuscating `hxxp`) nor validate indicator authenticity. Always pair with downstream validation and human review.

## Training Data

- **Size:** 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format.
- **Label distribution (spans):** `ORG` (approx. 198k), `PRODUCT` (approx. 79k), `MALWARE` (approx. 67k), `PLATFORM` (approx. 57k), `THREAT-ACTOR` (approx. 49k), `SERVICE` (approx. 46k), `CVE` (approx. 41k), `LOC` (approx. 38k), `SECTOR` (approx. 34k), `TOOL` (approx. 29k), plus indicator types such as `URL`, `IPV4`, `SHA256`, `MD5`, and `REGISTRY-KEYS`.
- **Pre-processing:** JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload.

## Label Mapping

| Label | Description | Example mention |
|-------|-------------|-----------------|
| URL | Web address or obfuscated link used in campaigns. | `hxxp://185.222.202.55` |
| ORG | Organisations such as companies, CERTs, or research groups. | `Microsoft Threat Intelligence` |
| SERVICE | Online or cloud services referenced in attacks. | `Google Ads` |
| SECTOR | Industry sectors or verticals targeted. | `critical infrastructure` |
| FILEPATH | File system paths observed in malware samples. | `C:\Windows\System32\svchost.exe` |
| DOMAIN | Fully qualified domains or subdomains. | `malicious-domain[.]com` |
| PLATFORM | Operating systems or computing platforms. | `Windows Server` |
| THREAT-ACTOR | Named adversary groups or aliases. | `LockBit` |
| PRODUCT | Commercial or open-source software products. | `VMware ESXi` |
| MALWARE | Malware families, strains, or toolkits. | `TrickBot` |
| LOC | Countries, cities, or regions. | `United States` |
| CVE | CVE identifiers for vulnerabilities. | `CVE-2023-23397` |
| TOOL | Legitimate or dual-use tools leveraged in incidents. | `Cobalt Strike` |
| IPV4 | IPv4 addresses. | `185.222.202.55` |
| MITRE-TACTIC | MITRE ATT&CK tactic categories. | `Credential Access` |
| MD5 | MD5 cryptographic hashes. | `d41d8cd98f00b204e9800998ecf8427e` |
| CAMPAIGN | Named operations or campaigns. | `Operation Cronos` |
| SHA1 | SHA-1 hashes. | `da39a3ee5e6b4b0d3255bfef95601890afd80709` |
| SHA256 | SHA-256 hashes. | `9e107d9d372bb6826bd81d3542a419d6...` |
| EMAIL | Email addresses. | `alerts@example.com` |
| IPV6 | IPv6 addresses. | `2001:0db8:85a3:0000:0000:8a2e:0370:7334` |
| REGISTRY-KEYS | Windows registry keys or paths. | `HKLM\Software\Microsoft\Windows\CurrentVersion\Run` |

## Training Procedure

- **Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large).
- **Hardware:** single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM).
- **Optimisation setup:** mixed precision `fp16`, optimiser `adamw_torch`, cosine learning-rate scheduler, gradient accumulation `1`.
- **Key hyperparameters:** learning rate `5e-5`, batch size `128`, epochs `5`, maximum sequence length `128`.

| Parameter | Value |
|-----------|-------|
| Mixed precision | `fp16` |
| Batch size | `128` |
| Learning rate | `5e-5` |
| Optimiser | `adamw_torch` |
| Scheduler | `cosine` |
| Epochs | `5` |
| Gradient accumulation | `1` |
| Max sequence length | `128` |

## Evaluation

AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring):

| Metric     | Score  |
|------------|--------|
| Precision  | 0.8468 |
| Recall     | 0.8484 |
| F1         | 0.8476 |
| Accuracy   | 0.9589 |

An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality.

| Label | Used | Accuracy |
|-------|------|----------|
| CAMPAIGN | 1,817 | 0.7980 |
| CVE | 28,293 | 0.9995 |
| DOMAIN | 12,182 | 0.8878 |
| EMAIL | 731 | 0.8495 |
| FILEPATH | 13,889 | 0.7957 |
| IPV4 | 1,164 | 0.9631 |
| IPV6 | 563 | 0.7425 |
| LOC | 7,915 | 0.9557 |
| MALWARE | 10,405 | 0.9087 |
| MD5 | 389 | 0.9100 |
| MITRE-TACTIC | 2,181 | 0.7093 |
| ORG | 36,324 | 0.9301 |
| PLATFORM | 8,036 | 0.8977 |
| PRODUCT | 18,720 | 0.8432 |
| REGISTRY-KEYS | 1,589 | 0.8490 |
| SECTOR | 6,453 | 0.8309 |
| SERVICE | 8,533 | 0.8179 |
| SHA1 | 222 | 0.9189 |
| SHA256 | 2,146 | 0.9874 |
| THREAT-ACTOR | 9,532 | 0.9418 |
| TOOL | 4,874 | 0.7895 |
| URL | 7,470 | 0.9801 |

- **Macro accuracy:** 0.8776

Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint.

These metrics were computed with the `seqeval` micro-average at the entity level.

## External Benchmarks

The following tables report detailed results on a shared CTI validation set. **Do not compare the per-label values across models directly:** each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution.


### CyberPeace-Institute/SecureBERT-NER

| Label | Used | Accuracy |
|-------|------|----------|
| ACT | 3,945 | 0.1706 |
| APT | 9,518 | 0.5331 |
| DOM | 10,694 | 0.0196 |
| EMAIL | 731 | 0.0000 |
| FILE | 31,864 | 0.0747 |
| IP | 1,251 | 0.0088 |
| LOC | 7,895 | 0.8711 |
| MAL | 10,341 | 0.6076 |
| MD5 | 354 | 0.8672 |
| O | 16,275 | 0.4700 |
| OS | 7,974 | 0.6598 |
| SECTEAM | 36,083 | 0.3509 |
| SHA1 | 191 | 0.0209 |
| SHA2 | 1,647 | 0.9709 |
| TOOL | 4,816 | 0.4043 |
| URL | 6,997 | 0.0795 |
| VULID | 27,586 | 0.3849 |

- **Macro accuracy:** 0.3820

### PranavaKailash/CyNER-2.0-DeBERTa-v3-base

| Label | Used | Accuracy |
|-------|------|----------|
| Indicator | 35,936 | 0.7878 |
| Location | 7,895 | 0.0113 |
| Malware | 12,125 | 0.7800 |
| O | 2,896 | 0.7652 |
| Organization | 42,537 | 0.6556 |
| System | 35,063 | 0.7259 |
| TOOL | 4,820 | 0.0000 |
| Threat Group | 9,522 | 0.0000 |
| Vulnerability | 27,673 | 0.1876 |

- **Macro accuracy:** 0.4348

### cisco-ai/SecureBERT2.0-NER

| Label | Used | Accuracy |
|-------|------|----------|
| Indicator | 35,789 | 0.8854 |
| Malware | 16,926 | 0.6204 |
| O | 10,786 | 0.6813 |
| Organization | 51,993 | 0.5579 |
| System | 34,955 | 0.6600 |
| Vulnerability | 27,525 | 0.2552 |

- **Macro accuracy:** 0.6100


## Responsible Use

- Confirm entity detections before acting on indicators (e.g., automated blocking).
- Combine with enrichment and scoring systems to filter false positives.
- Monitor for drift if applying to new domains (e.g., non-English sources, informal channels).
- Respect licensing and confidentiality of any proprietary CTI sources used for inference.


## Support & Connect

* ❤️ **Like the repo** if you found it useful
***Support me:** Say thanks by buying me a coffee! [https://buymeacoffee.com/juanmcristobal](https://buymeacoffee.com/juanmcristobal)
* 💼 **Open to work:** [https://www.linkedin.com/in/jmcristobal/](https://www.linkedin.com/in/jmcristobal/)

If you use SecureModernBERT-NER in a project, feel free to share it in the Discussions/Issues — I love seeing real-world use cases.

## Citation

If you find this model useful, please cite the repository and the base model:

```
@software{securemodernbert_ner_2025,
  author = {Juan Manuel Cristóbal Moreno},
  title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/attack-vector/SecureModernBERT-NER}
}
```

## Contact

Questions or feedback? Open an issue on the Hugging Face model repository or reach out at [`@juanmcristobal`](https://huggingface.co/juanmcristobal).