Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Abstract
MMHNet enables long-form audio generation from video by integrating hierarchical methods and non-causal Mamba, achieving superior performance over existing video-to-audio approaches.
Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LTX-2: Efficient Joint Audio-Visual Foundation Model (2026)
- Apollo: Unified Multi-Task Audio-Video Joint Generation (2026)
- MOVA: Towards Scalable and Synchronized Video-Audio Generation (2026)
- JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation (2026)
- GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining (2026)
- SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model (2026)
- MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
