Update README.md

fdcad4c verified 2 months ago

10.6 kB

	---
	language: en
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- ner
	- token-classification
	- cybersecurity
	- threat-intelligence
	- secureBert
	license: mit
	metrics:
	- accuracy
	base_model:
	- answerdotai/ModernBERT-large
	---

	# Model Overview

	SecureModernBERT-NER represents a new generation of cybersecurity-focused language models — combining the state-of-the-art architecture of ModernBERT with one of the largest and most diverse CTI-labelled NER corpora ever built.

	Unlike conventional NER systems, SecureModernBERT-NER recognises 22 finely-grained, security-specific entity types, covering the full spectrum of cyber-threat intelligence — from `THREAT-ACTOR` and `MALWARE` to `CVE`, `IPV4`, `DOMAIN`, and `REGISTRY-KEYS`.

	Trained on more than half a million manually curated spans sourced from real-world threat reports, vulnerability advisories, and incident analyses, it achieves an exceptional balance of accuracy, generalisation, and contextual depth.

	This model is designed to parse complex security narratives with human-level precision, extracting both contextual metadata (e.g., `ORG`, `PRODUCT`, `PLATFORM`) and highly technical indicators (e.g., `HASHES`, `URLS`, `NETWORK ADDRESSES`) — all within a single unified framework.

	SecureModernBERT-NER sets a new standard for automated CTI entity recognition, enabling the next wave of threat-intelligence automation, enrichment, and analytics.

	## Quick Start

	```python
	from transformers import pipeline

	model_id = "attack-vector/SecureModernBERT-NER"

	pipe = pipeline(
	task="token-classification",
	model=model_id,
	tokenizer=model_id,
	aggregation_strategy="first",
	)

	text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts."
	predictions = pipe(text)
	for pred in predictions:
	print(pred)
	```

	Sample output:

	```
	{'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8}
	{'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42}
	{'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74}
	```

	## Intended Use & Limitations

	- Use cases: automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems.
	- Languages: English (model was trained and evaluated on English sources only).
	- Input format: free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training.
	- Limitations: noisy or ambiguous extractions may occur, especially with rare entity types (`IPV6`, `EMAIL`) and obfuscated strings. The model does not normalise entities (e.g., deobfuscating `hxxp`) nor validate indicator authenticity. Always pair with downstream validation and human review.

	## Training Data

	- Size: 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format.
	- Label distribution (spans): `ORG` (approx. 198k), `PRODUCT` (approx. 79k), `MALWARE` (approx. 67k), `PLATFORM` (approx. 57k), `THREAT-ACTOR` (approx. 49k), `SERVICE` (approx. 46k), `CVE` (approx. 41k), `LOC` (approx. 38k), `SECTOR` (approx. 34k), `TOOL` (approx. 29k), plus indicator types such as `URL`, `IPV4`, `SHA256`, `MD5`, and `REGISTRY-KEYS`.
	- Pre-processing: JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload.

	## Label Mapping

	\| Label \| Description \| Example mention \|
	\|-------\|-------------\|-----------------\|
	\| URL \| Web address or obfuscated link used in campaigns. \| `hxxp://185.222.202.55` \|
	\| ORG \| Organisations such as companies, CERTs, or research groups. \| `Microsoft Threat Intelligence` \|
	\| SERVICE \| Online or cloud services referenced in attacks. \| `Google Ads` \|
	\| SECTOR \| Industry sectors or verticals targeted. \| `critical infrastructure` \|
	\| FILEPATH \| File system paths observed in malware samples. \| `C:\Windows\System32\svchost.exe` \|
	\| DOMAIN \| Fully qualified domains or subdomains. \| `malicious-domain[.]com` \|
	\| PLATFORM \| Operating systems or computing platforms. \| `Windows Server` \|
	\| THREAT-ACTOR \| Named adversary groups or aliases. \| `LockBit` \|
	\| PRODUCT \| Commercial or open-source software products. \| `VMware ESXi` \|
	\| MALWARE \| Malware families, strains, or toolkits. \| `TrickBot` \|
	\| LOC \| Countries, cities, or regions. \| `United States` \|
	\| CVE \| CVE identifiers for vulnerabilities. \| `CVE-2023-23397` \|
	\| TOOL \| Legitimate or dual-use tools leveraged in incidents. \| `Cobalt Strike` \|
	\| IPV4 \| IPv4 addresses. \| `185.222.202.55` \|
	\| MITRE-TACTIC \| MITRE ATT&CK tactic categories. \| `Credential Access` \|
	\| MD5 \| MD5 cryptographic hashes. \| `d41d8cd98f00b204e9800998ecf8427e` \|
	\| CAMPAIGN \| Named operations or campaigns. \| `Operation Cronos` \|
	\| SHA1 \| SHA-1 hashes. \| `da39a3ee5e6b4b0d3255bfef95601890afd80709` \|
	\| SHA256 \| SHA-256 hashes. \| `9e107d9d372bb6826bd81d3542a419d6...` \|
	\| EMAIL \| Email addresses. \| `alerts@example.com` \|
	\| IPV6 \| IPv6 addresses. \| `2001:0db8:85a3:0000:0000:8a2e:0370:7334` \|
	\| REGISTRY-KEYS \| Windows registry keys or paths. \| `HKLM\Software\Microsoft\Windows\CurrentVersion\Run` \|

	## Training Procedure

	- Base model: [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large).
	- Hardware: single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM).
	- Optimisation setup: mixed precision `fp16`, optimiser `adamw_torch`, cosine learning-rate scheduler, gradient accumulation `1`.
	- Key hyperparameters: learning rate `5e-5`, batch size `128`, epochs `5`, maximum sequence length `128`.

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Mixed precision \| `fp16` \|
	\| Batch size \| `128` \|
	\| Learning rate \| `5e-5` \|
	\| Optimiser \| `adamw_torch` \|
	\| Scheduler \| `cosine` \|
	\| Epochs \| `5` \|
	\| Gradient accumulation \| `1` \|
	\| Max sequence length \| `128` \|

	## Evaluation

	AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring):

	\| Metric \| Score \|
	\|------------\|--------\|
	\| Precision \| 0.8468 \|
	\| Recall \| 0.8484 \|
	\| F1 \| 0.8476 \|
	\| Accuracy \| 0.9589 \|

	An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality.

	\| Label \| Used \| Accuracy \|
	\|-------\|------\|----------\|
	\| CAMPAIGN \| 1,817 \| 0.7980 \|
	\| CVE \| 28,293 \| 0.9995 \|
	\| DOMAIN \| 12,182 \| 0.8878 \|
	\| EMAIL \| 731 \| 0.8495 \|
	\| FILEPATH \| 13,889 \| 0.7957 \|
	\| IPV4 \| 1,164 \| 0.9631 \|
	\| IPV6 \| 563 \| 0.7425 \|
	\| LOC \| 7,915 \| 0.9557 \|
	\| MALWARE \| 10,405 \| 0.9087 \|
	\| MD5 \| 389 \| 0.9100 \|
	\| MITRE-TACTIC \| 2,181 \| 0.7093 \|
	\| ORG \| 36,324 \| 0.9301 \|
	\| PLATFORM \| 8,036 \| 0.8977 \|
	\| PRODUCT \| 18,720 \| 0.8432 \|
	\| REGISTRY-KEYS \| 1,589 \| 0.8490 \|
	\| SECTOR \| 6,453 \| 0.8309 \|
	\| SERVICE \| 8,533 \| 0.8179 \|
	\| SHA1 \| 222 \| 0.9189 \|
	\| SHA256 \| 2,146 \| 0.9874 \|
	\| THREAT-ACTOR \| 9,532 \| 0.9418 \|
	\| TOOL \| 4,874 \| 0.7895 \|
	\| URL \| 7,470 \| 0.9801 \|

	- Macro accuracy: 0.8776

	Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint.

	These metrics were computed with the `seqeval` micro-average at the entity level.

	## External Benchmarks

	The following tables report detailed results on a shared CTI validation set. Do not compare the per-label values across models directly: each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution.


	### CyberPeace-Institute/SecureBERT-NER

	\| Label \| Used \| Accuracy \|
	\|-------\|------\|----------\|
	\| ACT \| 3,945 \| 0.1706 \|
	\| APT \| 9,518 \| 0.5331 \|
	\| DOM \| 10,694 \| 0.0196 \|
	\| EMAIL \| 731 \| 0.0000 \|
	\| FILE \| 31,864 \| 0.0747 \|
	\| IP \| 1,251 \| 0.0088 \|
	\| LOC \| 7,895 \| 0.8711 \|
	\| MAL \| 10,341 \| 0.6076 \|
	\| MD5 \| 354 \| 0.8672 \|
	\| O \| 16,275 \| 0.4700 \|
	\| OS \| 7,974 \| 0.6598 \|
	\| SECTEAM \| 36,083 \| 0.3509 \|
	\| SHA1 \| 191 \| 0.0209 \|
	\| SHA2 \| 1,647 \| 0.9709 \|
	\| TOOL \| 4,816 \| 0.4043 \|
	\| URL \| 6,997 \| 0.0795 \|
	\| VULID \| 27,586 \| 0.3849 \|

	- Macro accuracy: 0.3820

	### PranavaKailash/CyNER-2.0-DeBERTa-v3-base

	\| Label \| Used \| Accuracy \|
	\|-------\|------\|----------\|
	\| Indicator \| 35,936 \| 0.7878 \|
	\| Location \| 7,895 \| 0.0113 \|
	\| Malware \| 12,125 \| 0.7800 \|
	\| O \| 2,896 \| 0.7652 \|
	\| Organization \| 42,537 \| 0.6556 \|
	\| System \| 35,063 \| 0.7259 \|
	\| TOOL \| 4,820 \| 0.0000 \|
	\| Threat Group \| 9,522 \| 0.0000 \|
	\| Vulnerability \| 27,673 \| 0.1876 \|

	- Macro accuracy: 0.4348

	### cisco-ai/SecureBERT2.0-NER

	\| Label \| Used \| Accuracy \|
	\|-------\|------\|----------\|
	\| Indicator \| 35,789 \| 0.8854 \|
	\| Malware \| 16,926 \| 0.6204 \|
	\| O \| 10,786 \| 0.6813 \|
	\| Organization \| 51,993 \| 0.5579 \|
	\| System \| 34,955 \| 0.6600 \|
	\| Vulnerability \| 27,525 \| 0.2552 \|

	- Macro accuracy: 0.6100


	## Responsible Use

	- Confirm entity detections before acting on indicators (e.g., automated blocking).
	- Combine with enrichment and scoring systems to filter false positives.
	- Monitor for drift if applying to new domains (e.g., non-English sources, informal channels).
	- Respect licensing and confidentiality of any proprietary CTI sources used for inference.


	## Support & Connect

	* ❤️ Like the repo if you found it useful
	* ☕ Support me: Say thanks by buying me a coffee! [https://buymeacoffee.com/juanmcristobal](https://buymeacoffee.com/juanmcristobal)
	* 💼 Open to work: [https://www.linkedin.com/in/jmcristobal/](https://www.linkedin.com/in/jmcristobal/)

	If you use SecureModernBERT-NER in a project, feel free to share it in the Discussions/Issues — I love seeing real-world use cases.

	## Citation

	If you find this model useful, please cite the repository and the base model:

	```
	@software{securemodernbert_ner_2025,
	author = {Juan Manuel Cristóbal Moreno},
	title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/attack-vector/SecureModernBERT-NER}
	}
	```

	## Contact

	Questions or feedback? Open an issue on the Hugging Face model repository or reach out at [`@juanmcristobal`](https://huggingface.co/juanmcristobal).