File size: 3,176 Bytes
a160170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4704d21
a160170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d716381
a160170
 
 
 
 
 
 
 
 
 
 
 
d716381
 
 
 
 
 
 
a160170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: apache-2.0
tags:
- domain-generation-algorithm
- cybersecurity
- domain-classification
- security
- malware-detection
language:
- en
library_name: transformers
pipeline_tag: text-classification
base_model: answerdotai/ModernBERT-base
---

# ModernBERT DGA Detector

This model is designed to classify domains as either legitimate or generated by Domain Generation Algorithms (DGA).

## Model Description

- **Model Type:** BERT-based sequence classification
- **Task:** Binary classification (Legitimate vs DGA domains)
- **Base Model:** ModernBERT-base
- **Training Data:** Domain names dataset
- **Author:** Reynier Leyva La O, Carlos A. Catania

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Reynier/modernbert-dga-detector")
model = AutoModelForSequenceClassification.from_pretrained("Reynier/modernbert-dga-detector")

# Example prediction
def predict_domain(domain):
    inputs = tokenizer(domain, return_tensors="pt", max_length=64, truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.softmax(outputs.logits, dim=-1)
        legit_prob = predictions[0][0].item()
        dga_prob = predictions[0][1].item()
    return {"prediction": "DGA" if dga_prob > legit_prob else "LEGITIMATE", 
             "confidence": max(legit_prob, dga_prob)}

# Test examples
domains = ["google.com", "xkvbzpqr.net", "facebook.com", "abcdef123456.com"]
for domain in domains:
    result = predict_domain(domain)
    print(f"{domain} -> {result['prediction']} (confidence: {result['confidence']:.3f})")
```

## Model Architecture

The model is based on ModernBERT and fine-tuned for domain classification:
- Input: Domain names (text)
- Output: Binary classification (0=LEGITIMATE, 1=DGA)
- Max sequence length: 64 tokens

## Training Details

This model was fine-tuned on a dataset of legitimate and DGA-generated domains using:
- Base model: answerdotai/ModernBERT-base
- Framework: Transformers/PyTorch
- Task: Binary sequence classification

## Performance

Add your model's performance metrics here when available:
- Accuracy: 0.9658 ± 0.0153
- Precision: 0.9704 ± 0.0253  
- Recall: 0.9582 ± 0.0147
- F1-Score: 0.9579 ± 0.0167
- FPR: 0.0267 ± 0.0233
- TPR: 0.9582 ± 0.0147
- Query Time 0.1226 ± 0.0253  in CPU do not need GPU

## Use Cases

- **Cybersecurity**: Detect malicious domains generated by malware
- **Network Security**: Filter potentially harmful domains
- **Threat Intelligence**: Analyze domain patterns in security feeds

## Limitations

- This model is trained specifically for domain classification
- Performance may vary on domains from different TLDs or languages
- Regular retraining may be needed as DGA techniques evolve
- Model performance depends on the quality and diversity of training data

## Citation

If you use this model in your research or applications, please cite it appropriately.

## Related Models

Check out the author's other security models:
- [Llama3_8B-DGA-Detector](https://huggingface.co/Reynier/Llama3_8B-DGA-Detector)