AMoE: Agglomerative MoE Vision Foundation Models
Collection
CVPR 2026. A family of vision encoders distilled from DINOv3 and SigLIP2, available in MoE and dense variants. • 4 items • Updated
• 1
YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
A vision encoder distilled from DINOv3 and SigLIP2 teachers, supporting multi-resolution image understanding with Mixture-of-Experts (MoE) architecture.
AMOE is an MoE vision foundation model distilled from DINOv3 and SigLIP2 teachers.
pip install torch transformers einops pillow
import torch
from PIL import Image
from transformers import AutoModel, AutoImageProcessor
# Load model and processor
model_id = "tiiuae/amoe"
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to("cuda", dtype=torch.bfloat16)
processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=True)
# Preprocess image
image = Image.open("image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt").to("cuda")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
# Inference
with torch.no_grad():
outputs = model(**inputs)
# Access specialized features
# Options: 'amoe' (768d), 'siglip2' (1152d), 'dinov3' (1024d)
patch_features = outputs["patch_features"]["amoe"] # (Batch, Tokens, 768)
summary_features = outputs["summary_features"]["siglip2"] # (Batch, 1152)
If you use AMoE in your research, please cite:
@article{chaybouti2025amoe,
title={AMOE: Agglomerative Mixture-of-Experts Vision Foundation Models},
author={Chaybouti, Sofian and Narayan, Sanath and Dahou, Yasser and Le Khac, Phuc H. and Singh, Ankit and Huynh, Ngoc Dung and Para, Wamiq Reyaz and Kuehne, Hilde and Hacid, Hakim},
journal={arXiv preprint arXiv:2512.20157},
year={2025}
}