arxiv:2603.19466

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Published on Mar 19

· Submitted by

Authors:

Abstract

MLLMs demonstrate limited proactive behavior in requesting user interventions for challenging tasks, with performance hindered by conversational context and in-context learning biases, though reinforcement learning fine-tuning shows potential for learning such behaviors.

AI-generated summary

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.

View arXiv page View PDF Project page GitHub 23 Add to collection

Community

tdemin16

Paper submitter 2 days ago

We introduce ProactiveBench, a benchmark to evaluate whether MLLMs can ask for help when faced with unanswerable visual queries, e.g., suggesting to move an occluding object rather than hallucinating or abstaining. We repurpose 7 datasets into 7 distinct proactive scenarios (occlusion removal, camera movement, image quality enhancement, sketch completion, and more), totaling 108k+ images across 18k samples.

Evaluating 22 MLLMs, we find that models broadly lack proactiveness regardless of size, and that hinting, conversation history, and in-context learning offer only marginal or even negative gains. Encouragingly, we show that proactiveness can be learned via RL fine-tuning (GRPO) and generalizes to unseen scenarios. We release ProactiveBench as a first step toward building more collaborative multimodal models.

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

about 2 hours ago

the part that actually clicked for me is the reinforcement-learning fine-tuning recipe they use to teach proactiveness; it's a lightweight post-training loop rather than a full-blown RLAIF pipeline, which makes the results surprisingly generalizable. they claim proactiveness can generalize to unseen domains, but that hinges a lot on the reward design and how you balance asking for help vs just guessing. the finding that conversation history and chat prompts bias models toward unnecessary proactivity and actually hurt accuracy is a nice sanity check. btw, arxivlens has a solid walkthrough that dissects the RL setup and evaluation subtleties; the breakdown at https://arxivlens.com/PaperView/Details/proactivebench-benchmarking-proactiveness-in-multimodal-large-language-models-4766-fa4798d2 helped me parse this part. one quick question for the authors: did you try ablations with alternative reward signals or compare to a supervised proactiveness objective on the same data?

tdemin16

13 minutes ago

Thank you very much for your interest in our work! 😁

The RL fine-tuning experiment was conceived as a proof of concept for future research on this task. For this reason, we deliberately focused on a simplified setting to assess whether the approach is viable.

To address your questions:

In addition to the main setup, we experimented with scaling rewards for correct class predictions and proactive suggestions based on the model’s confidence, as well as on the increase in confidence following a proactive suggestion. This approach generally led to improved performance. However, we considered it somewhat overengineered for inclusion as a baseline in a benchmark paper.
Providing explicit supervision is quite challenging, as it would require specifying, for each frame, the correct response the model should produce. This, in turn, demands annotations that are difficult to obtain and inherently subjective. Different models may exhibit varying levels of visual perception, enabling some to make accurate predictions earlier in the sequence than others. For this reason, we opted for a reinforcement learning approach, allowing each model to learn a policy tailored to its own visual capabilities.