Ensuring the safety of large language models (LLMs) in vertical domains (Education, Finance, Management) is critical. While current alignment efforts primarily target explicit risks like bias and violence, they often fail to address deeper, domain-specific implicit risks. We introduce a comprehensive dataset categorizing risks into Green (Guide), Yellow (Reflect), and Red (Deny), and MENTOR, a framework using a Rule Evolution Cycle (REC) and Activation Steering (RV) to effectively detect and mitigate these subtle risks.
确保垂直领域(教育、金融、管理)中大模型的安全性至关重要。虽然目前的对齐工作主要针对偏见和暴力等显性风险,但往往忽略了更深层次的特定领域隐性风险。研发团队推出了一个包含多类场景的基准测试集,将风险分为引导、反思、禁止三类,以及 MENTOR 框架。该框架利用规则演化循环(REC)和激活引导(RV)技术,能够有效发现并缓解这些不易察觉的潜在风险。
A domain-specific risk evaluation benchmark covering various queries. 涵盖多类查询的特定领域风险评估基准。
Figure 1: The "Litmus Strip" framework (Partial Examples). The area below the dashed line illustrates specific implicit risks hidden deeply within vertical domains like Education, Finance, and Management, similar to chemical components detected by a litmus test.
图 1:"试纸"框架(部分示例)。图中虚线下方形象地展示了深埋于教育、金融和管理等垂直领域场景下的各类隐性风险,如同试纸检测出的潜藏成分。
Metacognition-Driven Self-Evolution for Implicit Risk Mitigation 元认知驱动的隐性风险缓解自进化机制
Figure 2: The MENTOR Architecture. 图 2:MENTOR 架构图。
Main Benchmark Performance 主要基准测试表现
| Model | Jailbreak Rate (Overall) | Jailbreak by Domain (Original) | Immunity Score | |||||
|---|---|---|---|---|---|---|---|---|
| Original | +Rules | +Meta1 | +Meta2 | Edu | Mgt | Fin | ||
| GPT-5-2025-08-07 | 0.308 | 0.098 | 0.042 | 0.027 | 0.364 | 0.370 | 0.190 | 0.855 |
| Doubao-Seed-1.6 | 0.628 | 0.055 | 0.021 | 0.011 | 0.576 | 0.616 | 0.692 | 0.680 |
| Llama-4-Maverick | 0.752 | 0.227 | 0.131 | 0.088 | 0.696 | 0.844 | 0.716 | 0.581 |
| GPT-4o | 0.834 | 0.135 | 0.061 | 0.038 | 0.804 | 0.872 | 0.826 | 0.543 |
| Qwen3-235B | 0.437 | 0.061 | 0.030 | 0.019 | 0.492 | 0.518 | 0.300 | 0.771 |
| DeepSeek-R1-0528 | 0.625 | 0.070 | 0.035 | 0.021 | 0.672 | 0.682 | 0.522 | 0.659 |
| o3-high-2025-04 | 0.473 | 0.067 | 0.020 | 0.011 | 0.608 | 0.482 | 0.328 | 0.749 |
| Llama-3.1-8B | 0.661 | 0.268 | 0.172 | 0.131 | 0.658 | 0.724 | 0.600 | 0.617 |
| Mistral-large | 0.874 | 0.148 | 0.073 | 0.059 | 0.790 | 0.920 | 0.912 | 0.496 |
| Claude Sonnet 4 | 0.208 | 0.029 | 0.009 | 0.003 | 0.280 | 0.170 | 0.174 | 0.906 |
| Grok-4 | 0.631 | 0.073 | 0.029 | 0.016 | 0.810 | 0.596 | 0.486 | 0.659 |
| Gemini-2.5-Pro | 0.440 | 0.017 | 0.003 | 0.002 | 0.418 | 0.502 | 0.400 | 0.761 |
| Kimi-K2-Instruct | 0.331 | 0.015 | 0.005 | 0.003 | 0.426 | 0.346 | 0.220 | 0.831 |