modernbert-dga-detector / README.md

Reynier

Update README.md

4704d21 verified 5 months ago

preview code

raw

history blame contribute delete

3.18 kB

metadata

license: apache-2.0
tags:
  - domain-generation-algorithm
  - cybersecurity
  - domain-classification
  - security
  - malware-detection
language:
  - en
library_name: transformers
pipeline_tag: text-classification
base_model: answerdotai/ModernBERT-base

ModernBERT DGA Detector

This model is designed to classify domains as either legitimate or generated by Domain Generation Algorithms (DGA).

Model Description

Model Type: BERT-based sequence classification
Task: Binary classification (Legitimate vs DGA domains)
Base Model: ModernBERT-base
Training Data: Domain names dataset
Author: Reynier Leyva La O, Carlos A. Catania

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Reynier/modernbert-dga-detector")
model = AutoModelForSequenceClassification.from_pretrained("Reynier/modernbert-dga-detector")

# Example prediction
def predict_domain(domain):
    inputs = tokenizer(domain, return_tensors="pt", max_length=64, truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.softmax(outputs.logits, dim=-1)
        legit_prob = predictions[0][0].item()
        dga_prob = predictions[0][1].item()
    return {"prediction": "DGA" if dga_prob > legit_prob else "LEGITIMATE", 
             "confidence": max(legit_prob, dga_prob)}

# Test examples
domains = ["google.com", "xkvbzpqr.net", "facebook.com", "abcdef123456.com"]
for domain in domains:
    result = predict_domain(domain)
    print(f"{domain} -> {result['prediction']} (confidence: {result['confidence']:.3f})")

Model Architecture

The model is based on ModernBERT and fine-tuned for domain classification:

Input: Domain names (text)
Output: Binary classification (0=LEGITIMATE, 1=DGA)
Max sequence length: 64 tokens

Training Details

This model was fine-tuned on a dataset of legitimate and DGA-generated domains using:

Base model: answerdotai/ModernBERT-base
Framework: Transformers/PyTorch
Task: Binary sequence classification

Performance

Add your model's performance metrics here when available:

Accuracy: 0.9658 ± 0.0153
Precision: 0.9704 ± 0.0253
Recall: 0.9582 ± 0.0147
F1-Score: 0.9579 ± 0.0167
FPR: 0.0267 ± 0.0233
TPR: 0.9582 ± 0.0147
Query Time 0.1226 ± 0.0253 in CPU do not need GPU

Use Cases

Cybersecurity: Detect malicious domains generated by malware
Network Security: Filter potentially harmful domains
Threat Intelligence: Analyze domain patterns in security feeds

Limitations

This model is trained specifically for domain classification
Performance may vary on domains from different TLDs or languages
Regular retraining may be needed as DGA techniques evolve
Model performance depends on the quality and diversity of training data

Citation

If you use this model in your research or applications, please cite it appropriately.

Related Models

Check out the author's other security models:

Llama3_8B-DGA-Detector