Improve model card: add metadata, links and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +57 -3
README.md CHANGED
@@ -1,3 +1,57 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: image-feature-extraction
5
+ ---
6
+
7
+ # OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
8
+
9
+ OmniStream is a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), the model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache.
10
+
11
+ - **Paper:** [OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams](https://huggingface.co/papers/2603.12265)
12
+ - **Project Page:** [https://go2heart.github.io/omnistream/](https://go2heart.github.io/omnistream/)
13
+ - **Repository:** [https://github.com/Go2Heart/OmniStream](https://github.com/Go2Heart/OmniStream)
14
+
15
+ ## Sample Usage
16
+
17
+ The following code snippet demonstrates how to use OmniStream for feature extraction. Note that this requires the `model.py` file from the official repository to be present in your environment.
18
+
19
+ ```python
20
+ from model import OmnistreamMultiFrameTransformer
21
+ from transformers import AutoImageProcessor
22
+ import torch
23
+ import numpy as np
24
+
25
+ # Load processor and model
26
+ processor = AutoImageProcessor.from_pretrained("StreamFormer/OmniStream")
27
+ model = OmnistreamMultiFrameTransformer.from_pretrained("StreamFormer/OmniStream").to("cuda")
28
+
29
+ model.eval()
30
+
31
+ # Prepare dummy input: 16 frames of 512x512 RGB images (Batch x Time, Height, Width, Channels)
32
+ fake_pixel = np.random.randn(16, 512, 512, 3)
33
+ fake_input = processor(images=fake_pixel, return_tensors="pt").to("cuda")
34
+
35
+ # Reshape to (Batch, Time, Channels, Height, Width)
36
+ fake_input["pixel_values"] = fake_input["pixel_values"].unsqueeze(0).float()
37
+
38
+ with torch.no_grad():
39
+ output = model(**fake_input, return_dict=True)
40
+
41
+ print(output.keys())
42
+ print(output["last_hidden_state"].shape) # last layer's hidden states
43
+ print(output["pooler_output"].shape) # cls token
44
+ print(output["patch_start_idx"]) # index of the first patch of each frame
45
+ ```
46
+
47
+ ## Citation
48
+
49
+ ```bibtex
50
+ @article{yan2026omnistream,
51
+ title={OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams},
52
+ author={Yibin Yan and Jilan Xu and Shangzhe Di and Haoning Wu and Weidi Xie},
53
+ journal={arXiv preprint arXiv:2603.12265},
54
+ year={2026},
55
+ url={https://arxiv.org/abs/2603.12265}
56
+ }
57
+ ```