Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training
Abstract
AutoQ-VIS achieves state-of-the-art results in unsupervised Video Instance Segmentation using quality-guided self-training to bridge the synthetic-to-real domain gap.
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 AP_{50} on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.
Community
Accepted at WACV'26!
Keywords: Video Instance Segmentation; Unsupervised Learning; Segmentation Quality Assessment
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images (2025)
- 3D Weakly Supervised Semantic Segmentation via Class-Aware and Geometry-Guided Pseudo-Label Refinement (2025)
- ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark (2025)
- DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation (2025)
- LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization (2025)
- S3OD: Towards Generalizable Salient Object Detection with Synthetic Data (2025)
- LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper