GLiNER Guard — Unified Multitask Guardrail

One encoder model that replaces your entire guardrail stack: safety classification, PII detection, adversarial attack detection, intent and tone analysis — all in a single forward pass. GLiNER Guard architecture

145M params · GLiNER2 · biencoder · modernbert multilingual · zero-shot classification, NER and more · no LLM required

Installation

Install dependencies
(now via our fork, wi'll update installation part after PR to GLiNER2 repo)

pip install "gliner2 @ git+https://github.com/bogdanminko/GLiNER2.git@feature/bi-encoder"

Usage

Classify Harmful messages and Detect PII via single forward pass

from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("raft-security-lab/gliner-guard-biencoder")
model.config.cache_labels = True

PII_LABELS = ["person", "location", "email", "phone"]
SAFETY_LABELS = ["safe", "unsafe"]
schema = (model.create_schema()
.entities(entity_types=PII_LABELS, threshold=0.4)
.classification(task="safety", labels=SAFETY_LABELS)
)

result = model.extract(
"Send $500 to John Smith at john.smith@gmail.com or I'll leak your photos",
schema=schema
)

output:

{'entities': {'person': ['John Smith'],
  'location': [],
  'email': ['john.smith@gmail.com'],
  'phone': []},
 'safety': 'unsafe'}

Supported Tasks

GLiNER Guard is purpose-built for 6 guardrail tasks via a shared encoder — no LLM required.
Thanks to zero-shot generalization, it can also handle custom labels outside the training taxonomy.

Task Type Labels Key Labels
Safety single-label 2 safe unsafe
PII / NER span extraction 32 person email phone card_number address
Adversarial Detection multi-label 15 jailbreak_persona prompt_injection instruction_override data_exfiltration
Harmful Content multi-label 30 hate_speech violence child_exploitation fraud pii_exposure
Intent single-label 13 informational adversarial threatening solicitation
Tone of Voice single-label 10 neutral aggressive manipulative deceptive
Safety — all 2 labels

Classifies whether a message is safe or unsafe. Single-label.

SAFETY_LABELS = ["safe", "unsafe"]
Label Description
safe Message does not contain harmful or policy-violating content
unsafe Message contains harmful, dangerous, or policy-violating content
NER / PII — all 32 entity types

Span extraction across 7 groups. Use labels from this list for best results — out-of-taxonomy labels may work via zero-shot generalization but are not benchmarked.

Group Labels
Person person first_name last_name alias title
Location country region city district street building unit postal_code landmark address
Organization company government education media product
Contact email phone social_account messenger
Identity passport national_id document_id
Temporal date_of_birth event_date
Financial card_number bank_account crypto_wallet
PII_LABELS = [
    "person", "first_name", "last_name", "alias", "title",
    "country", "region", "city", "district", "street",
    "building", "unit", "postal_code", "landmark", "address",
    "company", "government", "education", "media", "product",
    "email", "phone", "social_account", "messenger",
    "passport", "national_id", "document_id",
    "date_of_birth", "event_date",
    "card_number", "bank_account", "crypto_wallet",
]
Adversarial Detection — all 15 labels

Detects attacks against LLM-based systems. Multi-label: a single message can combine multiple attack vectors.

Subgroup Labels
Jailbreak jailbreak_persona jailbreak_hypothetical jailbreak_roleplay
Injection prompt_injection indirect_prompt_injection instruction_override
Extraction data_exfiltration system_prompt_extraction context_manipulation token_manipulation
Advanced tool_abuse social_engineering multi_turn_escalation schema_poisoning
Clean none
ADVERSARIAL_LABELS = [
    "jailbreak_persona", "jailbreak_hypothetical", "jailbreak_roleplay",
    "prompt_injection", "indirect_prompt_injection", "instruction_override",
    "data_exfiltration", "system_prompt_extraction", "context_manipulation", "token_manipulation",
    "tool_abuse", "social_engineering", "multi_turn_escalation", "schema_poisoning",
    "none",
]
Harmful Content — all 30 labels

Detects harmful content categories. Multi-label: a message can belong to multiple categories simultaneously.

Subgroup Labels
Interpersonal harassment hate_speech discrimination doxxing bullying
Violence & Danger violence dangerous_instructions weapons drugs self_harm
Sexual & Exploitation sexual_content child_exploitation grooming sextortion
Deception fraud scam social_engineering impersonation
Sensitive Topics profanity extremism political war espionage cybersecurity religious lgbt
Information misinformation copyright_violation pii_exposure
Clean none
HARMFUL_LABELS = [
    "harassment", "hate_speech", "discrimination", "doxxing", "bullying",
    "violence", "dangerous_instructions", "weapons", "drugs", "self_harm",
    "sexual_content", "child_exploitation", "grooming", "sextortion",
    "fraud", "scam", "social_engineering", "impersonation",
    "profanity", "extremism", "political", "war", "espionage", "cybersecurity", "religious", "lgbt",
    "misinformation", "copyright_violation", "pii_exposure",
    "none",
]
Intent — all 13 labels

Classifies the intent behind a message. Single-label.

Labels
Benign informational instructional conversational persuasive creative transactional emotional_support testing
Ambiguous ambiguous extractive
Malicious adversarial threatening solicitation
INTENT_LABELS = [
    "informational", "instructional", "conversational", "persuasive",
    "creative", "transactional", "emotional_support", "testing",
    "ambiguous", "extractive",
    "adversarial", "threatening", "solicitation",
]
Tone of Voice — all 10 labels

Classifies the tone of a message. Single-label.

Label Description
neutral Matter-of-fact, no strong emotional coloring
formal Professional or official register
humorous Playful, joking, or light-hearted
sarcastic Ironic or mocking tone
distressed Anxious, upset, or overwhelmed
confused Unclear intent, disoriented phrasing
pleading Urgent requests, begging for help or compliance
aggressive Hostile, confrontational, or threatening
manipulative Attempts to exploit, deceive, or coerce
deceptive Deliberately misleading or false framing
TOV_LABELS = [
    "neutral", "formal", "humorous", "sarcastic",
    "distressed", "confused", "pleading",
    "aggressive", "manipulative", "deceptive",
]
Downloads last month
289
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for raft-security-lab/gliner-guard-biencoder

Finetuned
(33)
this model

Collection including raft-security-lab/gliner-guard-biencoder