iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

arXiv Project Page GitHub

iMontage Teaser

iMontage is a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, maintaining strong cross-image contextual consistency and generating scenes with extraordinary dynamics.

πŸ“¦ Features

  • ⚑ High-dynamic, high-consistency image generation from flexible inputs
  • πŸŽ›οΈ Robust instruction following across heterogeneous tasks
  • πŸŒ€ Video-like temporal coherence, even for non-video image sets
  • πŸ† SOTA results across different tasks

πŸš€ Sample Usage

For detailed installation instructions and more complex inference examples, please refer to the GitHub repository.

Here is a simple Python code snippet demonstrating image generation with iMontage:

from inference_solver import FlexARInferenceSolver
from PIL import Image

# ******************** Image Generation ********************
inference_solver = FlexARInferenceSolver(
    model_path="Kr1sJ/iMontage", # Use this Hugging Face model
    precision="bf16",
    target_size=768, # Ensure target_size is consistent with the checkpoint
)

q1 = f"Generate an image of 768x768 according to the following prompt:\
" \
     f"Image of a dog playing water, and a waterfall is in the background."

# generated: tuple of (generated response, list of generated images)
generated = inference_solver.generate(
    images=[],
    qas=[[q1, None]],
    max_gen_len=8192,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)

a1, new_image = generated[0], generated[1][0]

# Display or save the image
new_image.show()
# new_image.save("generated_dog.png")

The model supports various tasks as illustrated below:

Task Type Input Prompt Output
image_editing Change the material of the lava to silver.
cref Confucius from the first image, Moses from the second…
conditioned_cref depth
sref (empty)
multiview 1. Shift left; 2. Look up; 3. Zoom out.
storyboard Vintage film: 1. Hepburn carrying the yellow bag…

πŸ’– Acknowledgment

We sincerely thank the open-source community for providing strong foundations that enabled this work.
In particular, we acknowledge the following projects for their models, datasets, and valuable insights:

  • HunyuanVideo-T2V, HunyuanVideo-I2V – Provided base generative model designs and code.
  • FastVideo – Contributed key components and open-source utilities that supported our development.

These contributions have greatly influenced our research and helped shape the design of iMontage.


πŸ“ Citation

If you find iMontage useful for your research or applications, please consider starring ⭐ the repo and citing our paper:

@article{fu2025iMontage,
  title={iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation}, 
  author={Zhoujie Fu and Xianfang Zeng and Jinghong Lan and Xinyao Liao and Cheng Chen and Junyi Chen and Jiacheng Wei and Wei Cheng and Shiyu Liu and Yunuo Chen and Gang Yu and Guosheng Lin},
  journal={arXiv preprint arXiv:2511.20635},
  year={2025},   
}
Downloads last month
270
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support