MARL: Runtime Middleware That Reduces LLM Hallucination Without Fine-Tuning

Community Article Published March 9, 2026

TL;DR โ€” MARL (Model-Agnostic Runtime Middleware for LLMs) is a middleware that inserts a multi-stage self-verification pipeline at runtime to reduce hallucination โ€” without touching model weights. Change one line (base_url) and it works instantly with any OpenAI API-compatible LLM: GPT-5.4, Claude, Gemini, Llama, and more.

pip install marl-middleware

Motivation: The Metacognitive Gap (MA-ER Gap)

MMLU has crossed 90%. GPQA is saturating. HumanEval has hit its ceiling. Yet every single one of these benchmarks shares a common blind spot: none of them has ever measured whether AI can recognize its own errors and correct them.

Cognitive psychology calls this ability metacognition โ€” the capacity to know what you know and what you don't. It is the real dividing line between human experts and novices, and a prerequisite for AGI.

In February 2026, we released FINAL Bench, the world's first benchmark dedicated to measuring AI metacognition. We evaluated 9 state-of-the-art models (GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, and others) across 1,800 assessment runs. The results were striking:

Metric Description Mean
MA (Metacognitive Accuracy) Ability to say "I might be wrong" 0.694
ER (Error Recovery) Ability to actually find and fix errors 0.302
MA-ER Gap The chasm between knowing and doing 0.392

AI models sense that they could be wrong, but they cannot actually fix what's broken.

The structural cause is clear. Current LLMs are autoregressive: once token generation starts, each token is conditioned on the previous ones. The model cannot stop mid-stream and say "wait, I was wrong." If the initial framing is flawed, it rides that trajectory to the end โ€” and this is exactly why hallucinations are generated with high confidence.

MARL was built to solve both of these limitations.

MARL โ€” Model-Agnostic Runtime Middleware for LLMs - a Hugging Face Space by VIDraft - Chrome 2026-03-09 ์˜คํ›„ 1_58_07


Core Architecture: Multi-Agent Self-Verification Pipeline

MARL decomposes a single LLM call into multiple independent specialist roles arranged in a pipeline:

User Query
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  S1: Hypothesis   โ€” Designs optimal approach       โ”‚
โ”‚         โ”‚                                          โ”‚
โ”‚         โ–ผ                                          โ”‚
โ”‚  S2: Solver       โ€” Performs deep reasoning         โ”‚
โ”‚         โ”‚                                          โ”‚
โ”‚         โ–ผ                                          โ”‚
โ”‚  S3: Auditor      โ€” Audits for gaps, contradictions โ”‚
โ”‚         โ”‚                                          โ”‚
โ”‚         โ–ผ                                          โ”‚
โ”‚  S4: Verifier     โ€” Adversarial cross-validation    โ”‚
โ”‚         โ”‚                                          โ”‚
โ”‚         โ–ผ                                          โ”‚
โ”‚  S5: Synthesizer  โ€” Integrates all feedback,        โ”‚
โ”‚                     generates entirely new final    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚
    โ–ผ
Final Response (only the refined answer reaches the user)

Inter-agent communication uses a proprietary Weighted Attention Matrix, with two mechanisms operating simultaneously:

  • Cooperative Reinforcement: Knowledge accumulates and compounds across S1โ†’S2โ†’S3
  • Adversarial Cross-Validation: S4 deliberately challenges S2's conclusions from an opposing perspective

This dual mechanism is the key. A single LLM call cannot structurally negate itself. But in MARL, the Verifier (S4) re-examines the draft for errors, and the Synthesizer (S5) produces a completely new final response that incorporates all feedback. MARL transforms "answer in one shot" into "think, doubt, correct, and rewrite."

Our FINAL Bench research showed that applying metacognitive scaffolding improved performance on the highest-difficulty tasks by over 70%, and 94.8% of that improvement came from Error Recovery โ€” confirming that this structural transformation works in practice.


Not Fine-Tuning. Not RAG. A Third Approach.

Fine-Tuning RAG MARL
Target Modifies model weights Supplements external knowledge Restructures reasoning process
Cost Tens of thousands USD in GPU Vector DB infrastructure One line of code
Time Weeks Days Instant
Model lock-in Tied to specific model Model-agnostic Model-agnostic
Problem solved Domain adaptation Knowledge gaps Reasoning errors & hallucination

Because MARL never touches model weights, you can switch from GPT-5.4 to Claude to open-source Llama โ€” and the MARL layer stays intact. For organizations running multi-LLM strategies, this means consistent quality assurance without vendor lock-in.


Quick Start

Installation

Four ways to get started immediately:

# PyPI
pip install marl-middleware

# Docker
docker pull vidraft/marl:latest
docker run -p 8080:8080 vidraft/marl:latest

# ClawHub (OpenClaw AI agent ecosystem)
clawhub install marl-middleware

# GitHub
git clone https://github.com/Vidraft/MARL.git
cd MARL && pip install -e .

Integration with Existing Code

Change one line in your existing OpenAI API code:

from openai import OpenAI

# Before
client = OpenAI(api_key="sk-...")

# After โ€” just add base_url
client = OpenAI(
    api_key="sk-...",
    base_url="http://localhost:8080/v1"  # โ† MARL server
)

# Everything else stays the same.
# All calls now pass through the multi-stage pipeline automatically.
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Explain error correction in quantum computing"}]
)

Switching Emergence Engines

Append ::mode to the model name to activate any of 9 specialized engines:

# Pharmaceutical emergence engine
response = client.chat.completions.create(
    model="gpt-5.4::pharma",
    messages=[{"role": "user", "content": "Propose 3rd-line target candidates for EGFR-mutant NSCLC"}]
)

# Legal emergence engine
response = client.chat.completions.create(
    model="claude-opus-4.6::law",
    messages=[{"role": "user", "content": "Analyze copyright attribution for AI-generated content"}]
)

# Works with any LLM model name
response = client.chat.completions.create(
    model="llama-3.3-70b::create",
    messages=[{"role": "user", "content": "Outline a short story set in a time-traveling library"}]
)

MARL โ€” Model-Agnostic Runtime Middleware for LLMs - a Hugging Face Space by VIDraft - Chrome 2026-03-09 ์˜คํ›„ 1_58_34

9 Domain-Specific Emergence Engines

Beyond the default reasoning enhancement (Insight mode), MARL ships with 9 specialized emergence engines. Each engine operates on a domain-specific knowledge matrix with cross-layer combination rules that generate ideas no single LLM call can produce.

Mode Tag Engine Data Scale Cross-Layer Structure
::invent ๐Ÿ”ฌ Invention & Patents 4,275 items 6 layers ร— 6 emergence rules
::create ๐ŸŽจ General Creative 493 seeds 11 categories ร— 5 rules
::doc ๐Ÿ“ Document Generation 16 seeds Precision verification optimized
::recipe ๐Ÿณ Culinary 5 layers Cooking ร— texture ร— architecture ร— cultural grammar
::pharma ๐Ÿ’Š Drug Discovery 172 items Target ร— mechanism ร— delivery ร— disease ร— molecular
::genomics ๐Ÿงฌ Genomics & Bio 104 items Gene ร— protein ร— pathway ร— phenotype ร— platform
::chemistry ๐Ÿงช Chemistry & Materials 135 items Element ร— bond ร— structure ร— property ร— application
::ecology ๐ŸŒ Ecology & Environment 105 items Species ร— ecosystem ร— service ร— threat ร— strategy
::law โš–๏ธ Legal & Regulatory 59 items Jurisdiction ร— domain ร— instrument ร— mechanism ร— dispute

A total of 5,538 expert data items are cross-combined across multiple layers. Each engine has 5 independent emergence rules and 10 cross-layer bonus pairs.


Open Core: Protecting IP While Ensuring Transparency

MARL follows an Open Core model that balances intellectual property protection with developer accessibility:

Component Visibility Rationale
Core reasoning engine (pipeline, attention matrix, agent prompts) ๐Ÿ”’ Compiled binary (.so) Proprietary technology protection
Installation, testing, API integration interfaces ๐Ÿ”“ Fully open Immediate adoption & integration
A/B test demo (Raw LLM vs MARL) ๐Ÿ”“ HF Spaces Instant effect verification
Stage-by-stage reasoning logs ๐Ÿ”“ Transparent Glass Box traceability

Critically, the metacognitive process โ€” where AI discovers its own errors and corrects them โ€” is recorded transparently at each stage. If traditional LLMs are black boxes that only show results, MARL provides a Glass Box for reasoning. For the first time, users can trace why an answer was produced, where an error was caught, and how it was corrected.


Connection to FINAL Bench

MARL and FINAL Bench are twin projects born from the same research:

FINAL Bench (Diagnosis)         MARL (Treatment)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€         โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
"Quantitatively measures        "Actually closes
 AI metacognitive ability"       that gap"

MA-ER Gap = 0.392        โ†’     Multi-stage self-verification
                                strengthens ER capability

TICOS 8-type taxonomy    โ†’     Optimized verification
                                strategy per type
from datasets import load_dataset
dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")
print(f"Total tasks: {len(dataset)}")  # 100
print(f"Domains: 15 / TICOS types: 8 / Difficulty grades: 3")

MARL โ€” Model-Agnostic Runtime Middleware for LLMs - a Hugging Face Space by VIDraft - Chrome 2026-03-09 ์˜คํ›„ 1_59_26

OpenClaw ClawHub Integration

MARL is officially registered on ClawHub, the skill marketplace of OpenClaw โ€” an AI agent platform with 260K+ GitHub stars. Among 3,200+ registered AI agent skills, MARL is the first middleware skill in the Reasoning Enhancement category.

clawhub install marl-middleware

Once installed, point your agent's baseURL to the MARL server. Before your agent sends an email, writes code, or analyzes a document, MARL's multi-stage pipeline automatically intervenes โ€” thinking deeply, questioning assumptions, and correcting errors before execution.

If OpenClaw agents are execution specialists, MARL is the metacognition upgrade for their brain. It structurally solves the biggest weakness of current AI agents: acting without thinking enough first.


Live Demo

Try the side-by-side A/B comparison of Raw LLM vs MARL-Enhanced on HuggingFace Spaces:

๐Ÿ‘‰ VIDraft/MARL

Each reasoning stage (S1 through S5) is recorded transparently, letting you trace "why this answer was produced" for the first time.


Roadmap

  • MARL Enterprise Edition: Private deployment, custom pipelines, SLA support (H1 2026)
  • Academic validation: Quantitative MARL effectiveness analysis via FINAL Bench โ†’ international journal submission
  • Global expansion: US market PoC completed, localization in progress
  • Multi-environment support: Python 3.10/3.11/3.12/3.13, Windows & Mac native builds rolling out

VIDRAFT Track Record

Achievement Details
๐Ÿ† FINAL Bench World's first AI metacognition benchmark, HF Dataset Global Trending #5
๐Ÿฅˆ FACTS Grounding Google DeepMind Medical AI World #2 (verified by CNRS, France)
๐ŸŒŸ STAR AI TOP 12 HuggingFace 2024, only selectee from South Korea
๐Ÿ“Š Heatmap Leaderboard HF Global #4
๐Ÿ‘ฅ Community 2M monthly active users, 30M cumulative visits, 1,500+ public AI models
๐Ÿ“ฐ Media Coverage Seoul Shinmun, Asia Economy, IT Chosun, Bizhind

Links

Resource URL
๐Ÿค— Live Demo VIDraft/MARL
๐Ÿ“ฆ PyPI marl-middleware
๐Ÿ™ GitHub Vidraft/MARL
๐Ÿฆ€ ClawHub marl-middleware
๐Ÿ“Š FINAL Bench Leaderboard FINAL-Bench/Leaderboard
๐Ÿงฌ FINAL Bench Dataset FINAL-Bench/Metacognitive
๐Ÿ“„ FINAL Bench Paper SSRN
๐ŸŒ VIDRAFT Website vidraft.net

Authors: Minsik Kim (@Cutechicken99)

Community

Sign up or log in to comment