File size: 3,176 Bytes
a160170 4704d21 a160170 d716381 a160170 d716381 a160170 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
license: apache-2.0
tags:
- domain-generation-algorithm
- cybersecurity
- domain-classification
- security
- malware-detection
language:
- en
library_name: transformers
pipeline_tag: text-classification
base_model: answerdotai/ModernBERT-base
---
# ModernBERT DGA Detector
This model is designed to classify domains as either legitimate or generated by Domain Generation Algorithms (DGA).
## Model Description
- **Model Type:** BERT-based sequence classification
- **Task:** Binary classification (Legitimate vs DGA domains)
- **Base Model:** ModernBERT-base
- **Training Data:** Domain names dataset
- **Author:** Reynier Leyva La O, Carlos A. Catania
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Reynier/modernbert-dga-detector")
model = AutoModelForSequenceClassification.from_pretrained("Reynier/modernbert-dga-detector")
# Example prediction
def predict_domain(domain):
inputs = tokenizer(domain, return_tensors="pt", max_length=64, truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.softmax(outputs.logits, dim=-1)
legit_prob = predictions[0][0].item()
dga_prob = predictions[0][1].item()
return {"prediction": "DGA" if dga_prob > legit_prob else "LEGITIMATE",
"confidence": max(legit_prob, dga_prob)}
# Test examples
domains = ["google.com", "xkvbzpqr.net", "facebook.com", "abcdef123456.com"]
for domain in domains:
result = predict_domain(domain)
print(f"{domain} -> {result['prediction']} (confidence: {result['confidence']:.3f})")
```
## Model Architecture
The model is based on ModernBERT and fine-tuned for domain classification:
- Input: Domain names (text)
- Output: Binary classification (0=LEGITIMATE, 1=DGA)
- Max sequence length: 64 tokens
## Training Details
This model was fine-tuned on a dataset of legitimate and DGA-generated domains using:
- Base model: answerdotai/ModernBERT-base
- Framework: Transformers/PyTorch
- Task: Binary sequence classification
## Performance
Add your model's performance metrics here when available:
- Accuracy: 0.9658 ± 0.0153
- Precision: 0.9704 ± 0.0253
- Recall: 0.9582 ± 0.0147
- F1-Score: 0.9579 ± 0.0167
- FPR: 0.0267 ± 0.0233
- TPR: 0.9582 ± 0.0147
- Query Time 0.1226 ± 0.0253 in CPU do not need GPU
## Use Cases
- **Cybersecurity**: Detect malicious domains generated by malware
- **Network Security**: Filter potentially harmful domains
- **Threat Intelligence**: Analyze domain patterns in security feeds
## Limitations
- This model is trained specifically for domain classification
- Performance may vary on domains from different TLDs or languages
- Regular retraining may be needed as DGA techniques evolve
- Model performance depends on the quality and diversity of training data
## Citation
If you use this model in your research or applications, please cite it appropriately.
## Related Models
Check out the author's other security models:
- [Llama3_8B-DGA-Detector](https://huggingface.co/Reynier/Llama3_8B-DGA-Detector)
|