MARL: Runtime Middleware That Reduces LLM Hallucination Without Fine-Tuning
TL;DR โ MARL (Model-Agnostic Runtime Middleware for LLMs) is a middleware that inserts a multi-stage self-verification pipeline at runtime to reduce hallucination โ without touching model weights. Change one line (
base_url) and it works instantly with any OpenAI API-compatible LLM: GPT-5.4, Claude, Gemini, Llama, and more.
pip install marl-middleware
- ๐ค Demo: VIDraft/MARL
- ๐ฆ PyPI: marl-middleware
- ๐ GitHub: Vidraft/MARL
- ๐ฆ ClawHub: marl-middleware
Motivation: The Metacognitive Gap (MA-ER Gap)
MMLU has crossed 90%. GPQA is saturating. HumanEval has hit its ceiling. Yet every single one of these benchmarks shares a common blind spot: none of them has ever measured whether AI can recognize its own errors and correct them.
Cognitive psychology calls this ability metacognition โ the capacity to know what you know and what you don't. It is the real dividing line between human experts and novices, and a prerequisite for AGI.
In February 2026, we released FINAL Bench, the world's first benchmark dedicated to measuring AI metacognition. We evaluated 9 state-of-the-art models (GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, and others) across 1,800 assessment runs. The results were striking:
| Metric | Description | Mean |
|---|---|---|
| MA (Metacognitive Accuracy) | Ability to say "I might be wrong" | 0.694 |
| ER (Error Recovery) | Ability to actually find and fix errors | 0.302 |
| MA-ER Gap | The chasm between knowing and doing | 0.392 |
AI models sense that they could be wrong, but they cannot actually fix what's broken.
The structural cause is clear. Current LLMs are autoregressive: once token generation starts, each token is conditioned on the previous ones. The model cannot stop mid-stream and say "wait, I was wrong." If the initial framing is flawed, it rides that trajectory to the end โ and this is exactly why hallucinations are generated with high confidence.
MARL was built to solve both of these limitations.
Core Architecture: Multi-Agent Self-Verification Pipeline
MARL decomposes a single LLM call into multiple independent specialist roles arranged in a pipeline:
User Query
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ S1: Hypothesis โ Designs optimal approach โ
โ โ โ
โ โผ โ
โ S2: Solver โ Performs deep reasoning โ
โ โ โ
โ โผ โ
โ S3: Auditor โ Audits for gaps, contradictions โ
โ โ โ
โ โผ โ
โ S4: Verifier โ Adversarial cross-validation โ
โ โ โ
โ โผ โ
โ S5: Synthesizer โ Integrates all feedback, โ
โ generates entirely new final โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Final Response (only the refined answer reaches the user)
Inter-agent communication uses a proprietary Weighted Attention Matrix, with two mechanisms operating simultaneously:
- Cooperative Reinforcement: Knowledge accumulates and compounds across S1โS2โS3
- Adversarial Cross-Validation: S4 deliberately challenges S2's conclusions from an opposing perspective
This dual mechanism is the key. A single LLM call cannot structurally negate itself. But in MARL, the Verifier (S4) re-examines the draft for errors, and the Synthesizer (S5) produces a completely new final response that incorporates all feedback. MARL transforms "answer in one shot" into "think, doubt, correct, and rewrite."
Our FINAL Bench research showed that applying metacognitive scaffolding improved performance on the highest-difficulty tasks by over 70%, and 94.8% of that improvement came from Error Recovery โ confirming that this structural transformation works in practice.
Not Fine-Tuning. Not RAG. A Third Approach.
| Fine-Tuning | RAG | MARL | |
|---|---|---|---|
| Target | Modifies model weights | Supplements external knowledge | Restructures reasoning process |
| Cost | Tens of thousands USD in GPU | Vector DB infrastructure | One line of code |
| Time | Weeks | Days | Instant |
| Model lock-in | Tied to specific model | Model-agnostic | Model-agnostic |
| Problem solved | Domain adaptation | Knowledge gaps | Reasoning errors & hallucination |
Because MARL never touches model weights, you can switch from GPT-5.4 to Claude to open-source Llama โ and the MARL layer stays intact. For organizations running multi-LLM strategies, this means consistent quality assurance without vendor lock-in.
Quick Start
Installation
Four ways to get started immediately:
# PyPI
pip install marl-middleware
# Docker
docker pull vidraft/marl:latest
docker run -p 8080:8080 vidraft/marl:latest
# ClawHub (OpenClaw AI agent ecosystem)
clawhub install marl-middleware
# GitHub
git clone https://github.com/Vidraft/MARL.git
cd MARL && pip install -e .
Integration with Existing Code
Change one line in your existing OpenAI API code:
from openai import OpenAI
# Before
client = OpenAI(api_key="sk-...")
# After โ just add base_url
client = OpenAI(
api_key="sk-...",
base_url="http://localhost:8080/v1" # โ MARL server
)
# Everything else stays the same.
# All calls now pass through the multi-stage pipeline automatically.
response = client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": "Explain error correction in quantum computing"}]
)
Switching Emergence Engines
Append ::mode to the model name to activate any of 9 specialized engines:
# Pharmaceutical emergence engine
response = client.chat.completions.create(
model="gpt-5.4::pharma",
messages=[{"role": "user", "content": "Propose 3rd-line target candidates for EGFR-mutant NSCLC"}]
)
# Legal emergence engine
response = client.chat.completions.create(
model="claude-opus-4.6::law",
messages=[{"role": "user", "content": "Analyze copyright attribution for AI-generated content"}]
)
# Works with any LLM model name
response = client.chat.completions.create(
model="llama-3.3-70b::create",
messages=[{"role": "user", "content": "Outline a short story set in a time-traveling library"}]
)
9 Domain-Specific Emergence Engines
Beyond the default reasoning enhancement (Insight mode), MARL ships with 9 specialized emergence engines. Each engine operates on a domain-specific knowledge matrix with cross-layer combination rules that generate ideas no single LLM call can produce.
| Mode Tag | Engine | Data Scale | Cross-Layer Structure |
|---|---|---|---|
::invent |
๐ฌ Invention & Patents | 4,275 items | 6 layers ร 6 emergence rules |
::create |
๐จ General Creative | 493 seeds | 11 categories ร 5 rules |
::doc |
๐ Document Generation | 16 seeds | Precision verification optimized |
::recipe |
๐ณ Culinary | 5 layers | Cooking ร texture ร architecture ร cultural grammar |
::pharma |
๐ Drug Discovery | 172 items | Target ร mechanism ร delivery ร disease ร molecular |
::genomics |
๐งฌ Genomics & Bio | 104 items | Gene ร protein ร pathway ร phenotype ร platform |
::chemistry |
๐งช Chemistry & Materials | 135 items | Element ร bond ร structure ร property ร application |
::ecology |
๐ Ecology & Environment | 105 items | Species ร ecosystem ร service ร threat ร strategy |
::law |
โ๏ธ Legal & Regulatory | 59 items | Jurisdiction ร domain ร instrument ร mechanism ร dispute |
A total of 5,538 expert data items are cross-combined across multiple layers. Each engine has 5 independent emergence rules and 10 cross-layer bonus pairs.
Open Core: Protecting IP While Ensuring Transparency
MARL follows an Open Core model that balances intellectual property protection with developer accessibility:
| Component | Visibility | Rationale |
|---|---|---|
| Core reasoning engine (pipeline, attention matrix, agent prompts) | ๐ Compiled binary (.so) | Proprietary technology protection |
| Installation, testing, API integration interfaces | ๐ Fully open | Immediate adoption & integration |
| A/B test demo (Raw LLM vs MARL) | ๐ HF Spaces | Instant effect verification |
| Stage-by-stage reasoning logs | ๐ Transparent | Glass Box traceability |
Critically, the metacognitive process โ where AI discovers its own errors and corrects them โ is recorded transparently at each stage. If traditional LLMs are black boxes that only show results, MARL provides a Glass Box for reasoning. For the first time, users can trace why an answer was produced, where an error was caught, and how it was corrected.
Connection to FINAL Bench
MARL and FINAL Bench are twin projects born from the same research:
FINAL Bench (Diagnosis) MARL (Treatment)
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
"Quantitatively measures "Actually closes
AI metacognitive ability" that gap"
MA-ER Gap = 0.392 โ Multi-stage self-verification
strengthens ER capability
TICOS 8-type taxonomy โ Optimized verification
strategy per type
from datasets import load_dataset
dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")
print(f"Total tasks: {len(dataset)}") # 100
print(f"Domains: 15 / TICOS types: 8 / Difficulty grades: 3")
- Paper: SSRN (under review at a leading international AI venue)
- Dataset: FINAL-Bench/Metacognitive โ HuggingFace Global Trending #5
- Leaderboard: FINAL-Bench/Leaderboard โ Space of the Week
OpenClaw ClawHub Integration
MARL is officially registered on ClawHub, the skill marketplace of OpenClaw โ an AI agent platform with 260K+ GitHub stars. Among 3,200+ registered AI agent skills, MARL is the first middleware skill in the Reasoning Enhancement category.
clawhub install marl-middleware
Once installed, point your agent's baseURL to the MARL server. Before your agent sends an email, writes code, or analyzes a document, MARL's multi-stage pipeline automatically intervenes โ thinking deeply, questioning assumptions, and correcting errors before execution.
If OpenClaw agents are execution specialists, MARL is the metacognition upgrade for their brain. It structurally solves the biggest weakness of current AI agents: acting without thinking enough first.
Live Demo
Try the side-by-side A/B comparison of Raw LLM vs MARL-Enhanced on HuggingFace Spaces:
๐ VIDraft/MARL
Each reasoning stage (S1 through S5) is recorded transparently, letting you trace "why this answer was produced" for the first time.
Roadmap
- MARL Enterprise Edition: Private deployment, custom pipelines, SLA support (H1 2026)
- Academic validation: Quantitative MARL effectiveness analysis via FINAL Bench โ international journal submission
- Global expansion: US market PoC completed, localization in progress
- Multi-environment support: Python 3.10/3.11/3.12/3.13, Windows & Mac native builds rolling out
VIDRAFT Track Record
| Achievement | Details |
|---|---|
| ๐ FINAL Bench | World's first AI metacognition benchmark, HF Dataset Global Trending #5 |
| ๐ฅ FACTS Grounding | Google DeepMind Medical AI World #2 (verified by CNRS, France) |
| ๐ STAR AI TOP 12 | HuggingFace 2024, only selectee from South Korea |
| ๐ Heatmap Leaderboard | HF Global #4 |
| ๐ฅ Community | 2M monthly active users, 30M cumulative visits, 1,500+ public AI models |
| ๐ฐ Media Coverage | Seoul Shinmun, Asia Economy, IT Chosun, Bizhind |
Links
| Resource | URL |
|---|---|
| ๐ค Live Demo | VIDraft/MARL |
| ๐ฆ PyPI | marl-middleware |
| ๐ GitHub | Vidraft/MARL |
| ๐ฆ ClawHub | marl-middleware |
| ๐ FINAL Bench Leaderboard | FINAL-Bench/Leaderboard |
| ๐งฌ FINAL Bench Dataset | FINAL-Bench/Metacognitive |
| ๐ FINAL Bench Paper | SSRN |
| ๐ VIDRAFT Website | vidraft.net |
Authors: Minsik Kim (@Cutechicken99)


