anjaneyasharma commited on
Commit
2d14ac8
Β·
verified Β·
1 Parent(s): eb88986

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -26
README.md CHANGED
@@ -17,29 +17,3 @@ tags:
17
 
18
  # GeoOSSVision
19
 
20
- ## Introduction
21
-
22
- **GeoOSSVision** is a state-of-the-art open-source multimodal model that advances versatility, reasoning capability, and inference efficiency through key innovations:
23
-
24
- - **Cascade Reinforcement Learning (Cascade RL)** – two-stage RL framework (offline RL β†’ online RL) for superior reasoning
25
- - **Visual Resolution Router (ViR)** – dynamically adjusts visual token resolution per image patch
26
- - **Decoupled Vision-Language Deployment (DvD)** – separates vision encoder and LLM for **4.05Γ— inference speedup**
27
-
28
- Achieves SOTA results among open-source multimodal models across general understanding, reasoning, OCR, agentic tasks, GUI interaction, and embodied intelligence.
29
-
30
- ## Model Architecture
31
-
32
- GeoOSSVision follows the "ViT β†’ MLP β†’ LLM" paradigm:
33
-
34
- **Components:**
35
- - **Vision encoder**: InternViT-300M or InternViT-6B
36
- - **Language model**: OSS (GPT-OSS merge)
37
- - **Dynamic high-resolution tiling** (improved from prior work)
38
- - **Visual Resolution Router (ViR)** – optional for **GeoOSSVision-Flash** (50% fewer visual tokens, near-zero performance loss)
39
-
40
- **ViR Architecture Details:**
41
- In standard mode: 1024 β†’ 256 visual tokens per patch
42
- In Flash mode: Additional 256 β†’ 64 compression path
43
- Patch router selects optimal compression rate based on semantic richness
44
-
45
- ![Architecture Diagram](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/architecture.jpg)
 
17
 
18
  # GeoOSSVision
19