OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

OmniStream is a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), the model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache.

Paper: OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
Project Page: https://go2heart.github.io/omnistream/
Repository: https://github.com/Go2Heart/OmniStream

Sample Usage

The following code snippet demonstrates how to use OmniStream for feature extraction. Note that this requires the model.py file from the official repository to be present in your environment.

from model import OmnistreamMultiFrameTransformer
from transformers import AutoImageProcessor
import torch
import numpy as np

# Load processor and model
processor = AutoImageProcessor.from_pretrained("StreamFormer/OmniStream")
model = OmnistreamMultiFrameTransformer.from_pretrained("StreamFormer/OmniStream").to("cuda")

model.eval()

# Prepare dummy input: 16 frames of 512x512 RGB images (Batch x Time, Height, Width, Channels)
fake_pixel = np.random.randn(16, 512, 512, 3) 
fake_input = processor(images=fake_pixel, return_tensors="pt").to("cuda") 

# Reshape to (Batch, Time, Channels, Height, Width)
fake_input["pixel_values"] = fake_input["pixel_values"].unsqueeze(0).float() 

with torch.no_grad():
    output = model(**fake_input, return_dict=True)

print(output.keys())
print(output["last_hidden_state"].shape) # last layer's hidden states
print(output["pooler_output"].shape)      # cls token
print(output["patch_start_idx"])         # index of the first patch of each frame

Citation

@article{yan2026omnistream,
  title={OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams}, 
  author={Yibin Yan and Jilan Xu and Shangzhe Di and Haoning Wu and Weidi Xie},
  journal={arXiv preprint arXiv:2603.12265},
  year={2026},
  url={https://arxiv.org/abs/2603.12265}
}

Downloads last month: 43

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for StreamFormer/OmniStream

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

Paper • 2603.12265 • Published 3 days ago • 11