Improve model card: add metadata, links and sample usage
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,3 +1,57 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-feature-extraction
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
|
| 8 |
+
|
| 9 |
+
OmniStream is a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), the model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache.
|
| 10 |
+
|
| 11 |
+
- **Paper:** [OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams](https://huggingface.co/papers/2603.12265)
|
| 12 |
+
- **Project Page:** [https://go2heart.github.io/omnistream/](https://go2heart.github.io/omnistream/)
|
| 13 |
+
- **Repository:** [https://github.com/Go2Heart/OmniStream](https://github.com/Go2Heart/OmniStream)
|
| 14 |
+
|
| 15 |
+
## Sample Usage
|
| 16 |
+
|
| 17 |
+
The following code snippet demonstrates how to use OmniStream for feature extraction. Note that this requires the `model.py` file from the official repository to be present in your environment.
|
| 18 |
+
|
| 19 |
+
```python
|
| 20 |
+
from model import OmnistreamMultiFrameTransformer
|
| 21 |
+
from transformers import AutoImageProcessor
|
| 22 |
+
import torch
|
| 23 |
+
import numpy as np
|
| 24 |
+
|
| 25 |
+
# Load processor and model
|
| 26 |
+
processor = AutoImageProcessor.from_pretrained("StreamFormer/OmniStream")
|
| 27 |
+
model = OmnistreamMultiFrameTransformer.from_pretrained("StreamFormer/OmniStream").to("cuda")
|
| 28 |
+
|
| 29 |
+
model.eval()
|
| 30 |
+
|
| 31 |
+
# Prepare dummy input: 16 frames of 512x512 RGB images (Batch x Time, Height, Width, Channels)
|
| 32 |
+
fake_pixel = np.random.randn(16, 512, 512, 3)
|
| 33 |
+
fake_input = processor(images=fake_pixel, return_tensors="pt").to("cuda")
|
| 34 |
+
|
| 35 |
+
# Reshape to (Batch, Time, Channels, Height, Width)
|
| 36 |
+
fake_input["pixel_values"] = fake_input["pixel_values"].unsqueeze(0).float()
|
| 37 |
+
|
| 38 |
+
with torch.no_grad():
|
| 39 |
+
output = model(**fake_input, return_dict=True)
|
| 40 |
+
|
| 41 |
+
print(output.keys())
|
| 42 |
+
print(output["last_hidden_state"].shape) # last layer's hidden states
|
| 43 |
+
print(output["pooler_output"].shape) # cls token
|
| 44 |
+
print(output["patch_start_idx"]) # index of the first patch of each frame
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Citation
|
| 48 |
+
|
| 49 |
+
```bibtex
|
| 50 |
+
@article{yan2026omnistream,
|
| 51 |
+
title={OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams},
|
| 52 |
+
author={Yibin Yan and Jilan Xu and Shangzhe Di and Haoning Wu and Weidi Xie},
|
| 53 |
+
journal={arXiv preprint arXiv:2603.12265},
|
| 54 |
+
year={2026},
|
| 55 |
+
url={https://arxiv.org/abs/2603.12265}
|
| 56 |
+
}
|
| 57 |
+
```
|