GLiNER Guard — Unified Multitask Guardrail
One encoder model that replaces your entire guardrail stack: safety classification, PII detection, adversarial attack detection, intent and tone analysis — all in a single forward pass.

145M params · GLiNER2 · biencoder · modernbert multilingual · zero-shot classification, NER and more · no LLM required
Installation
Install dependencies
(now via our fork, wi'll update installation part after PR to GLiNER2 repo)
pip install "gliner2 @ git+https://github.com/bogdanminko/GLiNER2.git@feature/bi-encoder"
Usage
Classify Harmful messages and Detect PII via single forward pass
from gliner2 import GLiNER2
model = GLiNER2.from_pretrained("raft-security-lab/gliner-guard-biencoder")
model.config.cache_labels = True
PII_LABELS = ["person", "location", "email", "phone"]
SAFETY_LABELS = ["safe", "unsafe"]
schema = (model.create_schema()
.entities(entity_types=PII_LABELS, threshold=0.4)
.classification(task="safety", labels=SAFETY_LABELS)
)
result = model.extract(
"Send $500 to John Smith at john.smith@gmail.com or I'll leak your photos",
schema=schema
)
output:
{'entities': {'person': ['John Smith'],
'location': [],
'email': ['john.smith@gmail.com'],
'phone': []},
'safety': 'unsafe'}
Supported Tasks
GLiNER Guard is purpose-built for 6 guardrail tasks via a shared encoder — no LLM required.
Thanks to zero-shot generalization, it can also handle custom labels outside the training taxonomy.
| Task | Type | Labels | Key Labels |
|---|---|---|---|
| Safety | single-label | 2 | safe unsafe |
| PII / NER | span extraction | 32 | person email phone card_number address |
| Adversarial Detection | multi-label | 15 | jailbreak_persona prompt_injection instruction_override data_exfiltration |
| Harmful Content | multi-label | 30 | hate_speech violence child_exploitation fraud pii_exposure |
| Intent | single-label | 13 | informational adversarial threatening solicitation |
| Tone of Voice | single-label | 10 | neutral aggressive manipulative deceptive |
Safety — all 2 labels
Classifies whether a message is safe or unsafe. Single-label.
SAFETY_LABELS = ["safe", "unsafe"]
| Label | Description |
|---|---|
safe |
Message does not contain harmful or policy-violating content |
unsafe |
Message contains harmful, dangerous, or policy-violating content |
NER / PII — all 32 entity types
Span extraction across 7 groups. Use labels from this list for best results — out-of-taxonomy labels may work via zero-shot generalization but are not benchmarked.
| Group | Labels |
|---|---|
| Person | person first_name last_name alias title |
| Location | country region city district street building unit postal_code landmark address |
| Organization | company government education media product |
| Contact | email phone social_account messenger |
| Identity | passport national_id document_id |
| Temporal | date_of_birth event_date |
| Financial | card_number bank_account crypto_wallet |
PII_LABELS = [
"person", "first_name", "last_name", "alias", "title",
"country", "region", "city", "district", "street",
"building", "unit", "postal_code", "landmark", "address",
"company", "government", "education", "media", "product",
"email", "phone", "social_account", "messenger",
"passport", "national_id", "document_id",
"date_of_birth", "event_date",
"card_number", "bank_account", "crypto_wallet",
]
Adversarial Detection — all 15 labels
Detects attacks against LLM-based systems. Multi-label: a single message can combine multiple attack vectors.
| Subgroup | Labels |
|---|---|
| Jailbreak | jailbreak_persona jailbreak_hypothetical jailbreak_roleplay |
| Injection | prompt_injection indirect_prompt_injection instruction_override |
| Extraction | data_exfiltration system_prompt_extraction context_manipulation token_manipulation |
| Advanced | tool_abuse social_engineering multi_turn_escalation schema_poisoning |
| Clean | none |
ADVERSARIAL_LABELS = [
"jailbreak_persona", "jailbreak_hypothetical", "jailbreak_roleplay",
"prompt_injection", "indirect_prompt_injection", "instruction_override",
"data_exfiltration", "system_prompt_extraction", "context_manipulation", "token_manipulation",
"tool_abuse", "social_engineering", "multi_turn_escalation", "schema_poisoning",
"none",
]
Harmful Content — all 30 labels
Detects harmful content categories. Multi-label: a message can belong to multiple categories simultaneously.
| Subgroup | Labels |
|---|---|
| Interpersonal | harassment hate_speech discrimination doxxing bullying |
| Violence & Danger | violence dangerous_instructions weapons drugs self_harm |
| Sexual & Exploitation | sexual_content child_exploitation grooming sextortion |
| Deception | fraud scam social_engineering impersonation |
| Sensitive Topics | profanity extremism political war espionage cybersecurity religious lgbt |
| Information | misinformation copyright_violation pii_exposure |
| Clean | none |
HARMFUL_LABELS = [
"harassment", "hate_speech", "discrimination", "doxxing", "bullying",
"violence", "dangerous_instructions", "weapons", "drugs", "self_harm",
"sexual_content", "child_exploitation", "grooming", "sextortion",
"fraud", "scam", "social_engineering", "impersonation",
"profanity", "extremism", "political", "war", "espionage", "cybersecurity", "religious", "lgbt",
"misinformation", "copyright_violation", "pii_exposure",
"none",
]
Intent — all 13 labels
Classifies the intent behind a message. Single-label.
| Labels | |
|---|---|
| Benign | informational instructional conversational persuasive creative transactional emotional_support testing |
| Ambiguous | ambiguous extractive |
| Malicious | adversarial threatening solicitation |
INTENT_LABELS = [
"informational", "instructional", "conversational", "persuasive",
"creative", "transactional", "emotional_support", "testing",
"ambiguous", "extractive",
"adversarial", "threatening", "solicitation",
]
Tone of Voice — all 10 labels
Classifies the tone of a message. Single-label.
| Label | Description |
|---|---|
neutral |
Matter-of-fact, no strong emotional coloring |
formal |
Professional or official register |
humorous |
Playful, joking, or light-hearted |
sarcastic |
Ironic or mocking tone |
distressed |
Anxious, upset, or overwhelmed |
confused |
Unclear intent, disoriented phrasing |
pleading |
Urgent requests, begging for help or compliance |
aggressive |
Hostile, confrontational, or threatening |
manipulative |
Attempts to exploit, deceive, or coerce |
deceptive |
Deliberately misleading or false framing |
TOV_LABELS = [
"neutral", "formal", "humorous", "sarcastic",
"distressed", "confused", "pleading",
"aggressive", "manipulative", "deceptive",
]
- Downloads last month
- 289
Model tree for raft-security-lab/gliner-guard-biencoder
Base model
jhu-clsp/mmBERT-small