arxiv:2601.20499

Efficient Autoregressive Video Diffusion with Dummy Head

Published on Jan 28

· Submitted by

Hang Guo on Feb 5

Microsoft Research

Upvote

Authors:

Hang Guo ,

Abstract

Autoregressive video diffusion models suffer from inefficient attention mechanisms that underutilize historical frames, but a new method called Dummy Forcing improves efficiency through heterogeneous memory allocation and dynamic head programming while maintaining quality.

AI-generated summary

The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their KV caches incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive cache compression. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at https://csguoh.github.io/project/DummyForcing/.

View arXiv page View PDF Project page GitHub 27 Add to collection

Community

HangGuo

Paper author Paper submitter about 3 hours ago

Dummy Forcing is built on the observation that about 25% attention heads in existing autoregressive video diffusion models are "dummy", attending almost exclusively to the current frame despite access to historical context. Based on this observation, Dummy Forcing develops a technique to automatically identifies dummy heads and allocates varying context. Leveraging this "dummy property", we can enable 1. Efficient Video Generation at 24.3FPS real-time speed. 2. High-resolution Video Generation which supports 720P&1080P with 2.0x speedup. 3. Long-context Video Gneration to enlarge the context window by 6.58x without losing efficiency.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.20499 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.20499 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.20499 in a Space README.md to link it from this page.