Beyond Language Modeling: An Exploration of Multimodal Pretraining Paper • 2603.03276 • Published 1 day ago • 61
Solaris: Building a Multiplayer Video World Model in Minecraft Paper • 2602.22208 • Published 8 days ago • 27
Solaris-Models Collection Model weights for Solaris: Building a Multiplayer Video World Model in Minecraft • 1 item • Updated 3 days ago • 3
Solaris-Data Collection Training and evaluation datasets collected for Solaris: Building a Multiplayer Video World Model in Minecraft • 2 items • Updated 10 days ago • 3
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders Paper • 2601.16208 • Published Jan 22 • 53
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding Paper • 2511.04668 • Published Nov 6, 2025 • 5
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts Paper • 2511.04655 • Published Nov 6, 2025 • 8
Energy-Based Transformers are Scalable Learners and Thinkers Paper • 2507.02092 • Published Jul 2, 2025 • 69
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models Paper • 2412.07755 • Published Dec 10, 2024 • 2
Cosmos World Foundation Model Platform for Physical AI Paper • 2501.03575 • Published Jan 7, 2025 • 82
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published Dec 13, 2024 • 108
Adaptive Length Image Tokenization via Recurrent Allocation Paper • 2411.02393 • Published Nov 4, 2024 • 13
view article Article A failed experiment: Infini-Attention, and why we should keep trying? +1 Aug 14, 2024 • 75
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24, 2024 • 63
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models Paper • 2308.01390 • Published Aug 2, 2023 • 34