LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
Abstract
LongVPO is a two-stage Direct Preference Optimization framework that enables short-context vision-language models to understand ultra-long videos through synthetic preference triples and recursive captioning, achieving state-of-the-art performance with minimal human annotation.
We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model's scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model's preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning (2026)
- VideoWeave: A Data-Centric Approach for Efficient Video Understanding (2026)
- Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning (2026)
- Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding (2026)
- A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos (2025)
- CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models (2026)
- What Happens Next? Next Scene Prediction with a Unified Video Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper