GLiNER Guard — Unified Multitask Guardrail

One encoder model that replaces your entire guardrail stack: safety classification, PII detection, adversarial attack detection, intent and tone analysis — all in a single forward pass.

145M params · GLiNER2 · biencoder · modernbert multilingual · zero-shot classification, NER and more · no LLM required

Installation

Install dependencies
(now via our fork, wi'll update installation part after PR to GLiNER2 repo)

pip install "gliner2 @ git+https://github.com/bogdanminko/GLiNER2.git@feature/bi-encoder"

Usage

Classify Harmful messages and Detect PII via single forward pass

from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("raft-security-lab/gliner-guard-biencoder")
model.config.cache_labels = True

PII_LABELS = ["person", "location", "email", "phone"]
SAFETY_LABELS = ["safe", "unsafe"]
schema = (model.create_schema()
.entities(entity_types=PII_LABELS, threshold=0.4)
.classification(task="safety", labels=SAFETY_LABELS)
)

result = model.extract(
"Send $500 to John Smith at john.smith@gmail.com or I'll leak your photos",
schema=schema
)

output:

{'entities': {'person': ['John Smith'],
  'location': [],
  'email': ['john.smith@gmail.com'],
  'phone': []},
 'safety': 'unsafe'}

Supported Tasks

GLiNER Guard is purpose-built for 6 guardrail tasks via a shared encoder — no LLM required.
Thanks to zero-shot generalization, it can also handle custom labels outside the training taxonomy.

Task	Type	Labels	Key Labels
Safety	single-label	2	`safe` `unsafe`
PII / NER	span extraction	32	`person` `email` `phone` `card_number` `address`
Adversarial Detection	multi-label	15	`jailbreak_persona` `prompt_injection` `instruction_override` `data_exfiltration`
Harmful Content	multi-label	30	`hate_speech` `violence` `child_exploitation` `fraud` `pii_exposure`
Intent	single-label	13	`informational` `adversarial` `threatening` `solicitation`
Tone of Voice	single-label	10	`neutral` `aggressive` `manipulative` `deceptive`

Safety — all 2 labels

Classifies whether a message is safe or unsafe. Single-label.

SAFETY_LABELS = ["safe", "unsafe"]

Label	Description
`safe`	Message does not contain harmful or policy-violating content
`unsafe`	Message contains harmful, dangerous, or policy-violating content

NER / PII — all 32 entity types

Span extraction across 7 groups. Use labels from this list for best results — out-of-taxonomy labels may work via zero-shot generalization but are not benchmarked.

Group	Labels
Person	`person` `first_name` `last_name` `alias` `title`
Location	`country` `region` `city` `district` `street` `building` `unit` `postal_code` `landmark` `address`
Organization	`company` `government` `education` `media` `product`
Contact	`email` `phone` `social_account` `messenger`
Identity	`passport` `national_id` `document_id`
Temporal	`date_of_birth` `event_date`
Financial	`card_number` `bank_account` `crypto_wallet`

PII_LABELS = [
    "person", "first_name", "last_name", "alias", "title",
    "country", "region", "city", "district", "street",
    "building", "unit", "postal_code", "landmark", "address",
    "company", "government", "education", "media", "product",
    "email", "phone", "social_account", "messenger",
    "passport", "national_id", "document_id",
    "date_of_birth", "event_date",
    "card_number", "bank_account", "crypto_wallet",
]

Adversarial Detection — all 15 labels

Detects attacks against LLM-based systems. Multi-label: a single message can combine multiple attack vectors.

Subgroup	Labels
Jailbreak	`jailbreak_persona` `jailbreak_hypothetical` `jailbreak_roleplay`
Injection	`prompt_injection` `indirect_prompt_injection` `instruction_override`
Extraction	`data_exfiltration` `system_prompt_extraction` `context_manipulation` `token_manipulation`
Advanced	`tool_abuse` `social_engineering` `multi_turn_escalation` `schema_poisoning`
Clean	`none`

ADVERSARIAL_LABELS = [
    "jailbreak_persona", "jailbreak_hypothetical", "jailbreak_roleplay",
    "prompt_injection", "indirect_prompt_injection", "instruction_override",
    "data_exfiltration", "system_prompt_extraction", "context_manipulation", "token_manipulation",
    "tool_abuse", "social_engineering", "multi_turn_escalation", "schema_poisoning",
    "none",
]

Harmful Content — all 30 labels

Detects harmful content categories. Multi-label: a message can belong to multiple categories simultaneously.

Subgroup	Labels
Interpersonal	`harassment` `hate_speech` `discrimination` `doxxing` `bullying`
Violence & Danger	`violence` `dangerous_instructions` `weapons` `drugs` `self_harm`
Sexual & Exploitation	`sexual_content` `child_exploitation` `grooming` `sextortion`
Deception	`fraud` `scam` `social_engineering` `impersonation`
Sensitive Topics	`profanity` `extremism` `political` `war` `espionage` `cybersecurity` `religious` `lgbt`
Information	`misinformation` `copyright_violation` `pii_exposure`
Clean	`none`

HARMFUL_LABELS = [
    "harassment", "hate_speech", "discrimination", "doxxing", "bullying",
    "violence", "dangerous_instructions", "weapons", "drugs", "self_harm",
    "sexual_content", "child_exploitation", "grooming", "sextortion",
    "fraud", "scam", "social_engineering", "impersonation",
    "profanity", "extremism", "political", "war", "espionage", "cybersecurity", "religious", "lgbt",
    "misinformation", "copyright_violation", "pii_exposure",
    "none",
]

Intent — all 13 labels

Classifies the intent behind a message. Single-label.

Labels
Benign	`informational` `instructional` `conversational` `persuasive` `creative` `transactional` `emotional_support` `testing`
Ambiguous	`ambiguous` `extractive`
Malicious	`adversarial` `threatening` `solicitation`

INTENT_LABELS = [
    "informational", "instructional", "conversational", "persuasive",
    "creative", "transactional", "emotional_support", "testing",
    "ambiguous", "extractive",
    "adversarial", "threatening", "solicitation",
]

Tone of Voice — all 10 labels

Classifies the tone of a message. Single-label.

Label	Description
`neutral`	Matter-of-fact, no strong emotional coloring
`formal`	Professional or official register
`humorous`	Playful, joking, or light-hearted
`sarcastic`	Ironic or mocking tone
`distressed`	Anxious, upset, or overwhelmed
`confused`	Unclear intent, disoriented phrasing
`pleading`	Urgent requests, begging for help or compliance
`aggressive`	Hostile, confrontational, or threatening
`manipulative`	Attempts to exploit, deceive, or coerce
`deceptive`	Deliberately misleading or false framing

TOV_LABELS = [
    "neutral", "formal", "humorous", "sarcastic",
    "distressed", "confused", "pleading",
    "aggressive", "manipulative", "deceptive",
]

Downloads last month: 289

Model tree for raft-security-lab/gliner-guard-biencoder

Base model

jhu-clsp/mmBERT-small

Finetuned

(33)

this model

Collection including raft-security-lab/gliner-guard-biencoder

Gliner Guard v1

Collection

GLiNER2-based guardrail for PII, content safety classification, prompt attacks detection and more via single forward pass • 3 items • Updated 1 day ago • 2