OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
Abstract
OpenVoxel enables open-vocabulary 3D scene understanding through training-free grouping and captioning of sparse voxels using Vision Language Models and Multi-modal Large Language Models.
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
Community
OpenVoxel provides training-free grouping and captioning of sparse voxels for open-vocabulary 3D scene understanding using VLMs/MLLMs and text search, enabling RES and OVS without CLIP embeddings.
Thank you for submitting the paper here!
project page: https://peterjohnsonhuang.github.io/openvoxel-pages/
- Compared to current training-based method such as ReferSplat, our OpenVoxel achieves a notable 13.2% improvement (29.2% -> 42.4%) of mIoU on the Ref-LeRF RES dataset.
- Also, in terms of equipping the reconstructed 3D scene (3DGS, SVRaster) with the capability of text-guided scene understanding, our training-free OpenVoxel is 10x faster then the training-based methods.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SegSplat: Feed-forward Gaussian Splatting and Open-Set Semantic Segmentation (2025)
- OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation (2025)
- UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning (2025)
- Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation (2025)
- FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views (2025)
- Unified Semantic Transformer for 3D Scene Understanding (2025)
- LLaVA3: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper