BrowseSafe: Understanding and Preventing Prompt Injection Within User Agent Environment AI Browser Agents

Highlights

BrowseSafe is a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. It is a specialized security model designed to protect AI browser agents from prompt injection attacks embedded in real-world web content.

State-of-the-Art Detection: Achieves a 90.4% F1 score on the BrowseSafe-Bench test set.
Real-Time Latency: Optimized for agent loops, enabling async security checks without degrading user experience.
Robustness to Distractors: Specifically trained to distinguish between malicious instructions and benign, structure-rich HTML "noise" (e.g., accessibility attributes, hidden form fields) that often confuses standard detectors.
Comprehensive Coverage: Validated against 11 attack types with different security criticality levels, 9 injection strategies, 5 distractor types, 5 contextaware generation types, 5 domains, 3 linguistic styles and 5 evaluation metrics, ensuring broad-spectrum defense capabilities.

Model Overview

BrowseSafe is based on the Qwen3-30B-A3B architecture.

Type: Fine-tuned Causal Language Model (MoE) for SFT Classification
Training Stage: Post-training (Fine-tuning on BrowseSafe-Bench)
Dataset: BrowseSafe-Bench
Base Model: Qwen/Qwen3-30B-A3B-Instruct-2507
Context Length: Up to 16,384 tokens
Input: Raw HTML content
Output: Single token, "yes" or "no" classification
License: MIT

Performance

We evaluated BrowseSafe on BrowseSafe-Bench, a realistic benchmark comprising 3,691 test samples of complex HTML payloads.

Model Name	Config	F1 Score	Precision	Recall	Balanced Accuracy	Refusals
PromptGuard-2	22M	0.350	0.975	0.213	0.606	0
	86M	0.360	0.983	0.221	0.611	0
gpt-oss-safeguard	20B / Low	0.790	0.986	0.658	0.826	0
	20B / Medium	0.796	0.994	0.664	0.832	0
	120B / Low	0.730	0.994	0.577	0.788	0
	120B / Medium	0.741	0.997	0.589	0.795	0
GPT-5 mini	Minimal	0.750	0.735	0.767	0.746	0
	Low	0.854	0.949	0.776	0.868	0
	Medium	0.853	0.945	0.777	0.866	0
	High	0.852	0.957	0.768	0.868	0
GPT-5	Minimal	0.849	0.881	0.819	0.855	0
	Low	0.854	0.928	0.791	0.866	0
	Medium	0.855	0.930	0.792	0.867	0
	High	0.840	0.882	0.802	0.848	0
Haiku 4.5	No Thinking	0.810	0.760	0.866	0.798	0
	1K	0.809	0.755	0.872	0.795	0
	8K	0.805	0.751	0.868	0.792	0
	32K	0.808	0.760	0.863	0.796	0
Sonnet 4.5	No Thinking	0.807	0.763	0.855	0.796	419
	1K	0.862	0.929	0.803	0.872	613
	8K	0.863	0.931	0.805	0.873	650
	32K	0.863	0.935	0.801	0.873	669
BrowseSafe		0.904	0.978	0.841	0.912	0

Evaluation Metrics

BrowseSafe-Bench evaluates models across five metrics. Full details can be found in the paper.

Quickstart

The code of Qwen3-MoE is in the latest Hugging Face transformers library. We recommend using transformers>=4.55.4.

Below is a code snippet illustrating how to use BrowseSafe.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "perplexity-ai/browsesafe-bench"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "<html>...</html>"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(**model_inputs)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Processing Long HTML Contexts

Web pages often exceed standard context windows. To handle this, BrowseSafe utilizes a chunking strategy (as described in the paper) to process content that exceeds the model's effective context limit.

Strategy: Partition the document into non-overlapping chunks at token boundaries.
Aggregation: Apply a conservative "OR" logic—if any single chunk is classified as VIOLATES, the entire document is flagged. This ensures that malicious payloads hidden deep within long pages are not missed.

A reference implementation can be found here.

Best Practices

To achieve optimal defense performance, be sure to pass the full HTML content to the model. Running the model on extracted text may result in performance degradation.

Citation

If you use or reference this work, please cite:

@article{browsesafe2025,
  title        = {BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents},
  author       = {Kaiyuan Zhang and Mark Tenenholtz and Kyle Polley and Jerry Ma and Denis Yarats and Ninghui Li},
  eprint       = {arXiv:2511.20597},
  archivePrefix= {arXiv},
  year         = {2025}
}

Downloads last month: 10

Safetensors

Model size

5B params

Tensor type

I64

I32

BF16

Model tree for cyankiwi/browsesafe-AWQ-4bit

Base model

Qwen/Qwen3-30B-A3B-Instruct-2507

Finetuned

perplexity-ai/browsesafe

Quantized

(3)

this model

cyankiwi
/

browsesafe-AWQ-4bit