anjaneyasharma
/

GeoOSSVision

Image-Text-to-Text

Model card Files Files and versions

anjaneyasharma commited on 16 days ago

Commit

2d14ac8

·

verified ·

1 Parent(s): eb88986

Update README.md

Files changed (1) hide show

README.md +0 -26

README.md CHANGED Viewed

@@ -17,29 +17,3 @@ tags:
 # GeoOSSVision
-## Introduction
-**GeoOSSVision** is a state-of-the-art open-source multimodal model that advances versatility, reasoning capability, and inference efficiency through key innovations:
-- **Cascade Reinforcement Learning (Cascade RL)** – two-stage RL framework (offline RL → online RL) for superior reasoning
-- **Visual Resolution Router (ViR)** – dynamically adjusts visual token resolution per image patch
-- **Decoupled Vision-Language Deployment (DvD)** – separates vision encoder and LLM for **4.05× inference speedup**
-Achieves SOTA results among open-source multimodal models across general understanding, reasoning, OCR, agentic tasks, GUI interaction, and embodied intelligence.
-## Model Architecture
-GeoOSSVision follows the "ViT → MLP → LLM" paradigm:
-**Components:**
-- **Vision encoder**: InternViT-300M or InternViT-6B
-- **Language model**: OSS (GPT-OSS merge)
-- **Dynamic high-resolution tiling** (improved from prior work)
-- **Visual Resolution Router (ViR)** – optional for **GeoOSSVision-Flash** (50% fewer visual tokens, near-zero performance loss)
-**ViR Architecture Details:**
-In standard mode: 1024 → 256 visual tokens per patch
-In Flash mode: Additional 256 → 64 compression path
-Patch router selects optimal compression rate based on semantic richness
-![Architecture Diagram](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/architecture.jpg)


17
18	# GeoOSSVision
19