Update README.md
Browse files
README.md
CHANGED
|
@@ -17,29 +17,3 @@ tags:
|
|
| 17 |
|
| 18 |
# GeoOSSVision
|
| 19 |
|
| 20 |
-
## Introduction
|
| 21 |
-
|
| 22 |
-
**GeoOSSVision** is a state-of-the-art open-source multimodal model that advances versatility, reasoning capability, and inference efficiency through key innovations:
|
| 23 |
-
|
| 24 |
-
- **Cascade Reinforcement Learning (Cascade RL)** β two-stage RL framework (offline RL β online RL) for superior reasoning
|
| 25 |
-
- **Visual Resolution Router (ViR)** β dynamically adjusts visual token resolution per image patch
|
| 26 |
-
- **Decoupled Vision-Language Deployment (DvD)** β separates vision encoder and LLM for **4.05Γ inference speedup**
|
| 27 |
-
|
| 28 |
-
Achieves SOTA results among open-source multimodal models across general understanding, reasoning, OCR, agentic tasks, GUI interaction, and embodied intelligence.
|
| 29 |
-
|
| 30 |
-
## Model Architecture
|
| 31 |
-
|
| 32 |
-
GeoOSSVision follows the "ViT β MLP β LLM" paradigm:
|
| 33 |
-
|
| 34 |
-
**Components:**
|
| 35 |
-
- **Vision encoder**: InternViT-300M or InternViT-6B
|
| 36 |
-
- **Language model**: OSS (GPT-OSS merge)
|
| 37 |
-
- **Dynamic high-resolution tiling** (improved from prior work)
|
| 38 |
-
- **Visual Resolution Router (ViR)** β optional for **GeoOSSVision-Flash** (50% fewer visual tokens, near-zero performance loss)
|
| 39 |
-
|
| 40 |
-
**ViR Architecture Details:**
|
| 41 |
-
In standard mode: 1024 β 256 visual tokens per patch
|
| 42 |
-
In Flash mode: Additional 256 β 64 compression path
|
| 43 |
-
Patch router selects optimal compression rate based on semantic richness
|
| 44 |
-
|
| 45 |
-

|
|
|
|
| 17 |
|
| 18 |
# GeoOSSVision
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|