BrowseSafe: Understanding and Preventing Prompt Injection Within User Agent Environment AI Browser Agents

Highlights

BrowseSafe is a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. It is a specialized security model designed to protect AI browser agents from prompt injection attacks embedded in real-world web content.

  • State-of-the-Art Detection: Achieves a 90.4% F1 score on the BrowseSafe-Bench test set.

  • Real-Time Latency: Optimized for agent loops, enabling async security checks without degrading user experience.

  • Robustness to Distractors: Specifically trained to distinguish between malicious instructions and benign, structure-rich HTML "noise" (e.g., accessibility attributes, hidden form fields) that often confuses standard detectors.

  • Comprehensive Coverage: Validated against 11 attack types with different security criticality levels, 9 injection strategies, 5 distractor types, 5 contextaware generation types, 5 domains, 3 linguistic styles and 5 evaluation metrics, ensuring broad-spectrum defense capabilities.

Model Overview

BrowseSafe is based on the Qwen3-30B-A3B architecture.

  • Type: Fine-tuned Causal Language Model (MoE) for SFT Classification
  • Training Stage: Post-training (Fine-tuning on BrowseSafe-Bench)
  • Dataset: BrowseSafe-Bench
  • Base Model: Qwen/Qwen3-30B-A3B-Instruct-2507
  • Context Length: Up to 16,384 tokens
  • Input: Raw HTML content
  • Output: Single token, "yes" or "no" classification
  • License: MIT

Performance

We evaluated BrowseSafe on BrowseSafe-Bench, a realistic benchmark comprising 3,691 test samples of complex HTML payloads.

Model Name Config F1 Score Precision Recall Balanced
Accuracy
Refusals
PromptGuard-2 22M 0.350 0.975 0.213 0.606 0
86M 0.360 0.983 0.221 0.611 0
gpt-oss-safeguard 20B / Low 0.790 0.986 0.658 0.826 0
20B / Medium 0.796 0.994 0.664 0.832 0
120B / Low 0.730 0.994 0.577 0.788 0
120B / Medium 0.741 0.997 0.589 0.795 0
GPT-5 mini Minimal 0.750 0.735 0.767 0.746 0
Low 0.854 0.949 0.776 0.868 0
Medium 0.853 0.945 0.777 0.866 0
High 0.852 0.957 0.768 0.868 0
GPT-5 Minimal 0.849 0.881 0.819 0.855 0
Low 0.854 0.928 0.791 0.866 0
Medium 0.855 0.930 0.792 0.867 0
High 0.840 0.882 0.802 0.848 0
Haiku 4.5 No Thinking 0.810 0.760 0.866 0.798 0
1K 0.809 0.755 0.872 0.795 0
8K 0.805 0.751 0.868 0.792 0
32K 0.808 0.760 0.863 0.796 0
Sonnet 4.5 No Thinking 0.807 0.763 0.855 0.796 419
1K 0.862 0.929 0.803 0.872 613
8K 0.863 0.931 0.805 0.873 650
32K 0.863 0.935 0.801 0.873 669
BrowseSafe 0.904 0.978 0.841 0.912 0

Evaluation Metrics

BrowseSafe-Bench evaluates models across five metrics. Full details can be found in the paper.

Quickstart

The code of Qwen3-MoE is in the latest Hugging Face transformers library. We recommend using transformers>=4.55.4.

Below is a code snippet illustrating how to use BrowseSafe.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "perplexity-ai/browsesafe-bench"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "<html>...</html>"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(**model_inputs)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Processing Long HTML Contexts

Web pages often exceed standard context windows. To handle this, BrowseSafe utilizes a chunking strategy (as described in the paper) to process content that exceeds the model's effective context limit.

  • Strategy: Partition the document into non-overlapping chunks at token boundaries.
  • Aggregation: Apply a conservative "OR" logic—if any single chunk is classified as VIOLATES, the entire document is flagged. This ensures that malicious payloads hidden deep within long pages are not missed.

A reference implementation can be found here.

Best Practices

To achieve optimal defense performance, be sure to pass the full HTML content to the model. Running the model on extracted text may result in performance degradation.

Citation

If you use or reference this work, please cite:

@article{browsesafe2025,
  title        = {BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents},
  author       = {Kaiyuan Zhang and Mark Tenenholtz and Kyle Polley and Jerry Ma and Denis Yarats and Ninghui Li},
  eprint       = {arXiv:2511.20597},
  archivePrefix= {arXiv},
  year         = {2025}
}
Downloads last month
10
Safetensors
Model size
5B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cyankiwi/browsesafe-AWQ-4bit

Quantized
(3)
this model

Dataset used to train cyankiwi/browsesafe-AWQ-4bit