World Craft: Agentic Framework to Create Visualizable Worlds via Text
Abstract
World Craft enables non-expert users to create executable and visualizable AI environments through textual descriptions by combining structured scaffolding and multi-agent intent analysis.
Large Language Models (LLMs) motivate generative agent simulation (e.g., AI Town) to create a ``dynamic world'', holding immense value across entertainment and research. However, for non-experts, especially those without programming skills, it isn't easy to customize a visualizable environment by themselves. In this paper, we introduce World Craft, an agentic world creation framework to create an executable and visualizable AI Town via user textual descriptions. It consists of two main modules, World Scaffold and World Guild. World Scaffold is a structured and concise standardization to develop interactive game scenes, serving as an efficient scaffolding for LLMs to customize an executable AI Town-like environment. World Guild is a multi-agent framework to progressively analyze users' intents from rough descriptions, and synthesizes required structured contents (\eg environment layout and assets) for World Scaffold . Moreover, we construct a high-quality error-correction dataset via reverse engineering to enhance spatial knowledge and improve the stability and controllability of layout generation, while reporting multi-dimensional evaluation metrics for further analysis. Extensive experiments demonstrate that our framework significantly outperforms existing commercial code agents (Cursor and Antigravity) and LLMs (Qwen3 and Gemini-3-Pro). in scene construction and narrative intent conveyance, providing a scalable solution for the democratization of environment creation.
Community
https://github.com/HerzogFL/World-Craft
Large Language Models (LLMs) motivate generative agent simulation (e.g., AI Town) to create a "dynamic world'', holding immense value across entertainment and research. However, for non-experts, especially those without programming skills, it isn't easy to customize a visualizable environment by themselves. In this paper, we introduce World Craft, an agentic world creation framework to create an executable and visualizable AI Town via user textual descriptions. It consists of two main modules, World Scaffold and World Guild. World Scaffold is a structured and concise standardization to develop interactive game scenes, serving as an efficient scaffolding for LLMs to customize an executable AI Town-like environment. World Guild is a multi-agent framework to progressively analyze users' intents from rough descriptions, and synthesizes required structured contents (e.g., environment layout and assets) for World Scaffold . Moreover, we construct a high-quality error-correction dataset via reverse engineering to enhance spatial knowledge and improve the stability and controllability of layout generation, while reporting multi-dimensional evaluation metrics for further analysis. Extensive experiments demonstrate that our framework significantly outperforms existing commercial code agents (Cursor and Antigravity) and LLMs (Qwen3 and Gemini-3-Pro). in scene construction and narrative intent conveyance, providing a scalable solution for the democratization of environment creation.
arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/world-craft-agentic-framework-to-create-visualizable-worlds-via-text-9863-5e0c2f14
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing (2025)
- TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning (2026)
- V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks (2026)
- Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation (2025)
- LogicEnvGen: Task-Logic Driven Generation of Diverse Simulated Environments for Embodied AI (2026)
- The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation (2026)
- A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper