VRSight / README.md

Update README.md

f795353 verified 3 months ago

6 kB

	---
	license: cc-by-4.0
	datasets:
	- UWMadAbility/DISCOVR
	language:
	- en
	base_model:
	- Ultralytics/YOLOv8
	pipeline_tag: object-detection
	tags:
	- yolo
	- yolov8
	- object-detection
	- accessibility
	- vr
	- virtual-reality
	- social-vr
	- screen-reader
	library_name: ultralytics
	---

	# VRSight Object Detection Model

	Fine-tuned YOLOv8n model for detecting UI elements and interactive objects in virtual reality environments. This model powers the [VRSight system](https://github.com/MadisonAbilityLab/VRSight), a post hoc 3D screen reader for blind and low vision VR users.

	Model Weights: `best.pt` (available in the Files tab)
	Full System: [github.com/MadisonAbilityLab/VRSight](https://github.com/MadisonAbilityLab/VRSight)
	Paper: [VRSight (UIST 2025)](https://dl.acm.org/doi/full/10.1145/3746059.3747641)
	Training Dataset: [UWMadAbility/DISCOVR](https://huggingface.co/datasets/UWMadAbility/DISCOVR)

	Developed by: Daniel Killough, Justin Feng, Zheng Xue Ching, Daniel Wang, Rithvik Dyava, Yapeng Tian*, Yuhang Zhao
	Affiliations: University of Wisconsin-Madison, *University of Texas at Dallas

	## Quick Start

	### Installation & Download
	```bash
	pip install ultralytics

	# Download model weights
	wget -O best.pt https://huggingface.co/UWMadAbility/VRSight/resolve/main/best.pt
	```

	### Basic Usage
	```python
	from ultralytics import YOLO

	# Load model
	model = YOLO('best.pt')

	# Run inference on VR screenshot
	results = model('vr_screenshot.jpg')

	# Process results
	for result in results:
	boxes = result.boxes
	for box in boxes:
	class_id = int(box.cls[0])
	confidence = float(box.conf[0])
	bbox = box.xyxy[0].tolist()

	print(f"Class: {model.names[class_id]}")
	print(f"Confidence: {confidence:.2f}")
	print(f"BBox: {bbox}")
	```

	### Batch Processing
	```python
	results = model.predict(
	source='vr_screenshots/',
	save=True,
	conf=0.25,
	device='0' # GPU 0, or 'cpu'
	)
	```

	## Model Details

	### Architecture
	- Base: YOLOv8n (Nano variant - optimized for real-time performance)
	- Input: 640×640 pixels
	- Output: Bounding boxes with class predictions and confidence scores
	- Classes: 30 VR object types across 6 categories

	### Performance

	\| Metric \| Test Set \|
	\|--------\|----------\|
	\| mAP@50 \| 67.3% \|
	\| mAP@75 \| 49.5% \|
	\| mAP \| 46.3% \|
	\| Inference Speed \| ~20-30+ FPS \|

	Key Finding: Base YOLOv8n trained on COCO rarely detected VR objects, demonstrating the necessity of VR-specific training data. See Table 1 in the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for per-class metrics.

	### Object Classes (30 Total)

	The model detects 6 categories of VR objects:

	Avatars: avatar, avatar-nonhuman, chat-bubble, chat-box
	Informational: sign-text, ui-text, sign-graphic, menu, ui-graphic, progress-bar, hud, indicator-mute
	Interactables: interactable, button, target, portal, writing-utensil, watch, writing-surface, spawner
	Safety: guardian, out-of-bounds
	Seating: seat-single, table, seat-multiple, campfire
	VR System: hand, controller, dashboard, locomotion-target

	See the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) (Table 1) for detailed descriptions and per-class performance.

	## Training Details

	### Dataset
	- DISCOVR: 17,691 labeled images from 17 social VR apps
	- Train: 15,207 images \| Val: 1,645 images \| Test: 839 images
	- Augmentation: Horizontal/vertical flips, rotation, scaling, shearing, HSV jittering

	### Training Configuration
	- GPU: NVIDIA A100
	- Epochs: 250
	- Image Size: 640×640
	- Method: Fine-tuned from YOLOv8n pretrained weights

	## VRSight System Integration

	This model is one component of the complete VRSight system, which combines:
	- This object detection model (detects VR objects)
	- Depth estimation (DepthAnythingV2)
	- GPT-4o (scene atmosphere and detailed descriptions)
	- OCR (text reading)
	- Spatial audio (TTS -> WebVR app e.g., PlayCanvas)

	To use the full VRSight system, see the [GitHub repository](https://github.com/MadisonAbilityLab/VRSight).

	## Limitations

	- VR-specific: Trained on social VR apps - performance varies on other VR types
	- Lighting: Reduced accuracy in dark environments
	- Coverage: 30 classes cover common social VR objects but not all possible VR elements
	- Application types: Best performance in social VR; may struggle with faster-paced games

	See Section 7.2 of the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for detailed discussion.

	## Citation
	Please cite use of this model, the DISCOVR dataset, or the fine-tuned object detection model using the VRSight paper:
	```bibtex
	@inproceedings{killough2025vrsight,
	title={VRSight: An AI-Driven Scene Description System to Improve Virtual Reality Accessibility for Blind People},
	author={Killough, Daniel and Feng, Justin and Ching, Zheng Xue and Wang, Daniel and Dyava, Rithvik and Tian, Yapeng and Zhao, Yuhang},
	booktitle={Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology},
	pages={1--17},
	year={2025},
	publisher={ACM},
	address={Busan, Republic of Korea},
	doi={10.1145/3746059.3747641}
	}
	```

	## License

	CC BY 4.0 - Free to use with attribution

	## Contact

	- GitHub Issues: [github.com/MadisonAbilityLab/VRSight/issues](https://github.com/MadisonAbilityLab/VRSight/issues)
	- Paper: [dl.acm.org/doi/full/10.1145/3746059.3747641](https://dl.acm.org/doi/full/10.1145/3746059.3747641)
	- Lead Author: Daniel Killough (UW-Madison MadAbility Lab)

	## Related Resources

	- [VRSight GitHub](https://github.com/MadisonAbilityLab/VRSight) - Complete system implementation
	- [DISCOVR Dataset](https://huggingface.co/datasets/UWMadAbility/DISCOVR) - Training data
	- [UIST 2025 Paper](https://dl.acm.org/doi/full/10.1145/3746059.3747641) - Research paper
	- [Video Demo](https://x.com/i/status/1969153746337665262) - System in action