MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
Abstract
MMDeepResearch-Bench evaluates multimodal research agents on report generation with visual evidence, revealing trade-offs between prose quality, citation accuracy, and visual grounding.
Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.
Community
Introducing MMDeepResearch-Bench, a benchmark for multimodal deep research agents.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report (2026)
- DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation (2026)
- SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature (2026)
- DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation (2025)
- DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing (2026)
- A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos (2025)
- FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper