|
|
--- |
|
|
license: cc-by-4.0 |
|
|
datasets: |
|
|
- UWMadAbility/DISCOVR |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Ultralytics/YOLOv8 |
|
|
pipeline_tag: object-detection |
|
|
tags: |
|
|
- yolo |
|
|
- yolov8 |
|
|
- object-detection |
|
|
- accessibility |
|
|
- vr |
|
|
- virtual-reality |
|
|
- social-vr |
|
|
- screen-reader |
|
|
library_name: ultralytics |
|
|
--- |
|
|
|
|
|
# VRSight Object Detection Model |
|
|
|
|
|
Fine-tuned YOLOv8n model for detecting UI elements and interactive objects in virtual reality environments. This model powers the [VRSight system](https://github.com/MadisonAbilityLab/VRSight), a post hoc 3D screen reader for blind and low vision VR users. |
|
|
|
|
|
**Model Weights:** `best.pt` (available in the Files tab) |
|
|
**Full System:** [github.com/MadisonAbilityLab/VRSight](https://github.com/MadisonAbilityLab/VRSight) |
|
|
**Paper:** [VRSight (UIST 2025)](https://dl.acm.org/doi/full/10.1145/3746059.3747641) |
|
|
**Training Dataset:** [UWMadAbility/DISCOVR](https://huggingface.co/datasets/UWMadAbility/DISCOVR) |
|
|
|
|
|
**Developed by:** Daniel Killough, Justin Feng, Zheng Xue Ching, Daniel Wang, Rithvik Dyava, Yapeng Tian*, Yuhang Zhao |
|
|
**Affiliations:** University of Wisconsin-Madison, *University of Texas at Dallas |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation & Download |
|
|
```bash |
|
|
pip install ultralytics |
|
|
|
|
|
# Download model weights |
|
|
wget -O best.pt https://huggingface.co/UWMadAbility/VRSight/resolve/main/best.pt |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
```python |
|
|
from ultralytics import YOLO |
|
|
|
|
|
# Load model |
|
|
model = YOLO('best.pt') |
|
|
|
|
|
# Run inference on VR screenshot |
|
|
results = model('vr_screenshot.jpg') |
|
|
|
|
|
# Process results |
|
|
for result in results: |
|
|
boxes = result.boxes |
|
|
for box in boxes: |
|
|
class_id = int(box.cls[0]) |
|
|
confidence = float(box.conf[0]) |
|
|
bbox = box.xyxy[0].tolist() |
|
|
|
|
|
print(f"Class: {model.names[class_id]}") |
|
|
print(f"Confidence: {confidence:.2f}") |
|
|
print(f"BBox: {bbox}") |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
```python |
|
|
results = model.predict( |
|
|
source='vr_screenshots/', |
|
|
save=True, |
|
|
conf=0.25, |
|
|
device='0' # GPU 0, or 'cpu' |
|
|
) |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Architecture |
|
|
- **Base:** YOLOv8n (Nano variant - optimized for real-time performance) |
|
|
- **Input:** 640×640 pixels |
|
|
- **Output:** Bounding boxes with class predictions and confidence scores |
|
|
- **Classes:** 30 VR object types across 6 categories |
|
|
|
|
|
### Performance |
|
|
|
|
|
| Metric | Test Set | |
|
|
|--------|----------| |
|
|
| **mAP@50** | **67.3%** | |
|
|
| **mAP@75** | 49.5% | |
|
|
| **mAP** | 46.3% | |
|
|
| **Inference Speed** | ~20-30+ FPS | |
|
|
|
|
|
**Key Finding:** Base YOLOv8n trained on COCO rarely detected VR objects, demonstrating the necessity of VR-specific training data. See Table 1 in the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for per-class metrics. |
|
|
|
|
|
### Object Classes (30 Total) |
|
|
|
|
|
The model detects 6 categories of VR objects: |
|
|
|
|
|
**Avatars:** avatar, avatar-nonhuman, chat-bubble, chat-box |
|
|
**Informational:** sign-text, ui-text, sign-graphic, menu, ui-graphic, progress-bar, hud, indicator-mute |
|
|
**Interactables:** interactable, button, target, portal, writing-utensil, watch, writing-surface, spawner |
|
|
**Safety:** guardian, out-of-bounds |
|
|
**Seating:** seat-single, table, seat-multiple, campfire |
|
|
**VR System:** hand, controller, dashboard, locomotion-target |
|
|
|
|
|
See the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) (Table 1) for detailed descriptions and per-class performance. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
- **DISCOVR:** 17,691 labeled images from 17 social VR apps |
|
|
- **Train:** 15,207 images | **Val:** 1,645 images | **Test:** 839 images |
|
|
- **Augmentation:** Horizontal/vertical flips, rotation, scaling, shearing, HSV jittering |
|
|
|
|
|
### Training Configuration |
|
|
- **GPU:** NVIDIA A100 |
|
|
- **Epochs:** 250 |
|
|
- **Image Size:** 640×640 |
|
|
- **Method:** Fine-tuned from YOLOv8n pretrained weights |
|
|
|
|
|
## VRSight System Integration |
|
|
|
|
|
This model is one component of the complete VRSight system, which combines: |
|
|
- **This object detection model** (detects VR objects) |
|
|
- Depth estimation (DepthAnythingV2) |
|
|
- GPT-4o (scene atmosphere and detailed descriptions) |
|
|
- OCR (text reading) |
|
|
- Spatial audio (TTS -> WebVR app e.g., PlayCanvas) |
|
|
|
|
|
**To use the full VRSight system**, see the [GitHub repository](https://github.com/MadisonAbilityLab/VRSight). |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **VR-specific:** Trained on social VR apps - performance varies on other VR types |
|
|
- **Lighting:** Reduced accuracy in dark environments |
|
|
- **Coverage:** 30 classes cover common social VR objects but not all possible VR elements |
|
|
- **Application types:** Best performance in social VR; may struggle with faster-paced games |
|
|
|
|
|
See Section 7.2 of the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for detailed discussion. |
|
|
|
|
|
## Citation |
|
|
Please cite use of this model, the DISCOVR dataset, or the fine-tuned object detection model using the VRSight paper: |
|
|
```bibtex |
|
|
@inproceedings{killough2025vrsight, |
|
|
title={VRSight: An AI-Driven Scene Description System to Improve Virtual Reality Accessibility for Blind People}, |
|
|
author={Killough, Daniel and Feng, Justin and Ching, Zheng Xue and Wang, Daniel and Dyava, Rithvik and Tian, Yapeng and Zhao, Yuhang}, |
|
|
booktitle={Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology}, |
|
|
pages={1--17}, |
|
|
year={2025}, |
|
|
publisher={ACM}, |
|
|
address={Busan, Republic of Korea}, |
|
|
doi={10.1145/3746059.3747641} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
CC BY 4.0 - Free to use with attribution |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **GitHub Issues:** [github.com/MadisonAbilityLab/VRSight/issues](https://github.com/MadisonAbilityLab/VRSight/issues) |
|
|
- **Paper:** [dl.acm.org/doi/full/10.1145/3746059.3747641](https://dl.acm.org/doi/full/10.1145/3746059.3747641) |
|
|
- **Lead Author:** Daniel Killough (UW-Madison MadAbility Lab) |
|
|
|
|
|
## Related Resources |
|
|
|
|
|
- **[VRSight GitHub](https://github.com/MadisonAbilityLab/VRSight)** - Complete system implementation |
|
|
- **[DISCOVR Dataset](https://huggingface.co/datasets/UWMadAbility/DISCOVR)** - Training data |
|
|
- **[UIST 2025 Paper](https://dl.acm.org/doi/full/10.1145/3746059.3747641)** - Research paper |
|
|
- **[Video Demo](https://x.com/i/status/1969153746337665262)** - System in action |