iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
iMontage is a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, maintaining strong cross-image contextual consistency and generating scenes with extraordinary dynamics.
π¦ Features
- β‘ High-dynamic, high-consistency image generation from flexible inputs
- ποΈ Robust instruction following across heterogeneous tasks
- π Video-like temporal coherence, even for non-video image sets
- π SOTA results across different tasks
π Sample Usage
For detailed installation instructions and more complex inference examples, please refer to the GitHub repository.
Here is a simple Python code snippet demonstrating image generation with iMontage:
from inference_solver import FlexARInferenceSolver
from PIL import Image
# ******************** Image Generation ********************
inference_solver = FlexARInferenceSolver(
model_path="Kr1sJ/iMontage", # Use this Hugging Face model
precision="bf16",
target_size=768, # Ensure target_size is consistent with the checkpoint
)
q1 = f"Generate an image of 768x768 according to the following prompt:\
" \
f"Image of a dog playing water, and a waterfall is in the background."
# generated: tuple of (generated response, list of generated images)
generated = inference_solver.generate(
images=[],
qas=[[q1, None]],
max_gen_len=8192,
temperature=1.0,
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)
a1, new_image = generated[0], generated[1][0]
# Display or save the image
new_image.show()
# new_image.save("generated_dog.png")
The model supports various tasks as illustrated below:
| Task Type | Input | Prompt | Output |
|---|---|---|---|
| image_editing | ![]() |
Change the material of the lava to silver. | ![]() |
| cref | ![]() |
Confucius from the first image, Moses from the second⦠| ![]() |
| conditioned_cref | ![]() |
depth | ![]() |
| sref | ![]() |
(empty) | ![]() |
| multiview | ![]() |
1. Shift left; 2. Look up; 3. Zoom out. | ![]() |
| storyboard | ![]() |
Vintage film: 1. Hepburn carrying the yellow bag⦠| ![]() |
π Acknowledgment
We sincerely thank the open-source community for providing strong foundations that enabled this work.
In particular, we acknowledge the following projects for their models, datasets, and valuable insights:
- HunyuanVideo-T2V, HunyuanVideo-I2V β Provided base generative model designs and code.
- FastVideo β Contributed key components and open-source utilities that supported our development.
These contributions have greatly influenced our research and helped shape the design of iMontage.
π Citation
If you find iMontage useful for your research or applications, please consider starring β the repo and citing our paper:
@article{fu2025iMontage,
title={iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation},
author={Zhoujie Fu and Xianfang Zeng and Jinghong Lan and Xinyao Liao and Cheng Chen and Junyi Chen and Jiacheng Wei and Wei Cheng and Shiyu Liu and Yunuo Chen and Gang Yu and Guosheng Lin},
journal={arXiv preprint arXiv:2511.20635},
year={2025},
}
- Downloads last month
- 270











