VRSight / README.md
dkillough's picture
Update README.md
f795353 verified
---
license: cc-by-4.0
datasets:
- UWMadAbility/DISCOVR
language:
- en
base_model:
- Ultralytics/YOLOv8
pipeline_tag: object-detection
tags:
- yolo
- yolov8
- object-detection
- accessibility
- vr
- virtual-reality
- social-vr
- screen-reader
library_name: ultralytics
---
# VRSight Object Detection Model
Fine-tuned YOLOv8n model for detecting UI elements and interactive objects in virtual reality environments. This model powers the [VRSight system](https://github.com/MadisonAbilityLab/VRSight), a post hoc 3D screen reader for blind and low vision VR users.
**Model Weights:** `best.pt` (available in the Files tab)
**Full System:** [github.com/MadisonAbilityLab/VRSight](https://github.com/MadisonAbilityLab/VRSight)
**Paper:** [VRSight (UIST 2025)](https://dl.acm.org/doi/full/10.1145/3746059.3747641)
**Training Dataset:** [UWMadAbility/DISCOVR](https://huggingface.co/datasets/UWMadAbility/DISCOVR)
**Developed by:** Daniel Killough, Justin Feng, Zheng Xue Ching, Daniel Wang, Rithvik Dyava, Yapeng Tian*, Yuhang Zhao
**Affiliations:** University of Wisconsin-Madison, *University of Texas at Dallas
## Quick Start
### Installation & Download
```bash
pip install ultralytics
# Download model weights
wget -O best.pt https://huggingface.co/UWMadAbility/VRSight/resolve/main/best.pt
```
### Basic Usage
```python
from ultralytics import YOLO
# Load model
model = YOLO('best.pt')
# Run inference on VR screenshot
results = model('vr_screenshot.jpg')
# Process results
for result in results:
boxes = result.boxes
for box in boxes:
class_id = int(box.cls[0])
confidence = float(box.conf[0])
bbox = box.xyxy[0].tolist()
print(f"Class: {model.names[class_id]}")
print(f"Confidence: {confidence:.2f}")
print(f"BBox: {bbox}")
```
### Batch Processing
```python
results = model.predict(
source='vr_screenshots/',
save=True,
conf=0.25,
device='0' # GPU 0, or 'cpu'
)
```
## Model Details
### Architecture
- **Base:** YOLOv8n (Nano variant - optimized for real-time performance)
- **Input:** 640×640 pixels
- **Output:** Bounding boxes with class predictions and confidence scores
- **Classes:** 30 VR object types across 6 categories
### Performance
| Metric | Test Set |
|--------|----------|
| **mAP@50** | **67.3%** |
| **mAP@75** | 49.5% |
| **mAP** | 46.3% |
| **Inference Speed** | ~20-30+ FPS |
**Key Finding:** Base YOLOv8n trained on COCO rarely detected VR objects, demonstrating the necessity of VR-specific training data. See Table 1 in the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for per-class metrics.
### Object Classes (30 Total)
The model detects 6 categories of VR objects:
**Avatars:** avatar, avatar-nonhuman, chat-bubble, chat-box
**Informational:** sign-text, ui-text, sign-graphic, menu, ui-graphic, progress-bar, hud, indicator-mute
**Interactables:** interactable, button, target, portal, writing-utensil, watch, writing-surface, spawner
**Safety:** guardian, out-of-bounds
**Seating:** seat-single, table, seat-multiple, campfire
**VR System:** hand, controller, dashboard, locomotion-target
See the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) (Table 1) for detailed descriptions and per-class performance.
## Training Details
### Dataset
- **DISCOVR:** 17,691 labeled images from 17 social VR apps
- **Train:** 15,207 images | **Val:** 1,645 images | **Test:** 839 images
- **Augmentation:** Horizontal/vertical flips, rotation, scaling, shearing, HSV jittering
### Training Configuration
- **GPU:** NVIDIA A100
- **Epochs:** 250
- **Image Size:** 640×640
- **Method:** Fine-tuned from YOLOv8n pretrained weights
## VRSight System Integration
This model is one component of the complete VRSight system, which combines:
- **This object detection model** (detects VR objects)
- Depth estimation (DepthAnythingV2)
- GPT-4o (scene atmosphere and detailed descriptions)
- OCR (text reading)
- Spatial audio (TTS -> WebVR app e.g., PlayCanvas)
**To use the full VRSight system**, see the [GitHub repository](https://github.com/MadisonAbilityLab/VRSight).
## Limitations
- **VR-specific:** Trained on social VR apps - performance varies on other VR types
- **Lighting:** Reduced accuracy in dark environments
- **Coverage:** 30 classes cover common social VR objects but not all possible VR elements
- **Application types:** Best performance in social VR; may struggle with faster-paced games
See Section 7.2 of the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for detailed discussion.
## Citation
Please cite use of this model, the DISCOVR dataset, or the fine-tuned object detection model using the VRSight paper:
```bibtex
@inproceedings{killough2025vrsight,
title={VRSight: An AI-Driven Scene Description System to Improve Virtual Reality Accessibility for Blind People},
author={Killough, Daniel and Feng, Justin and Ching, Zheng Xue and Wang, Daniel and Dyava, Rithvik and Tian, Yapeng and Zhao, Yuhang},
booktitle={Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology},
pages={1--17},
year={2025},
publisher={ACM},
address={Busan, Republic of Korea},
doi={10.1145/3746059.3747641}
}
```
## License
CC BY 4.0 - Free to use with attribution
## Contact
- **GitHub Issues:** [github.com/MadisonAbilityLab/VRSight/issues](https://github.com/MadisonAbilityLab/VRSight/issues)
- **Paper:** [dl.acm.org/doi/full/10.1145/3746059.3747641](https://dl.acm.org/doi/full/10.1145/3746059.3747641)
- **Lead Author:** Daniel Killough (UW-Madison MadAbility Lab)
## Related Resources
- **[VRSight GitHub](https://github.com/MadisonAbilityLab/VRSight)** - Complete system implementation
- **[DISCOVR Dataset](https://huggingface.co/datasets/UWMadAbility/DISCOVR)** - Training data
- **[UIST 2025 Paper](https://dl.acm.org/doi/full/10.1145/3746059.3747641)** - Research paper
- **[Video Demo](https://x.com/i/status/1969153746337665262)** - System in action