SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding
Abstract
A comprehensive benchmark for evaluating multimodal large language models on sequential audio-video data across real-world conversational domains with human-verified annotations and demographic metadata.
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard
Community
SONIC-O1: A Real-World Benchmark for Evaluating Multimodal LLMs on Audio-Video Understanding
SONIC-O1 is a fully human-verified benchmark for real-world audio–video conversations: 13 conversational domains, 4,958 annotated instances, plus demographic metadata for group-wise analysis. It evaluates models on open-ended summarization, MCQ QA, and temporal localization with supporting rationales—and we find temporal grounding is still a major pain point (e.g., a 22.6% gap between the best closed vs. open model families on temporal localization).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation (2025)
- A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos (2025)
- FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs (2026)
- AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding (2025)
- Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning (2026)
- QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models (2026)
- VNU-Bench: A Benchmarking Dataset for Multi-Source Multimodal News Video Understanding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper