This model was converted to AWQ format from sensenova/SenseNova-SI-InternVL3-8B using lmdeploy via the following way:
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy "transformers>=4.53.0,<4.54" timm datasets
lmdeploy lite auto_awq sensenova/SenseNova-SI-InternVL3-8B --work-dir SenseNova-SI-InternVL3-8B-AWQ --dtype bfloat16
Then modify the config.json, put quantization_config at the outermost level.
{
"architectures": [
"InternVLChatModel"
],
"auto_map": {
"AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
"AutoModel": "modeling_internvl_chat.InternVLChatModel",
"AutoModelForCausalLM": "sensenova/SenseNova-SI-InternVL3-8B--modeling_internvl_chat.InternVLChatModel"
},
"downsample_ratio": 0.5,
"dynamic_image_size": true,
"force_image_size": 448,
"hidden_size": 3584,
"image_fold": null,
"llm_config": {
"_name_or_path": "./pretrained/Qwen2.5-32B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"attn_implementation": "eager",
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"moe_config": null,
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"factor": 2.0,
"rope_type": "dynamic",
"type": "dynamic"
},
"rope_theta": 1000000.0,
"sliding_window": null,
"torch_dtype": "bfloat16",
"use_bfloat16": true,
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 151674
},
"quantization_config": {
"bits": 4,
"group_size": 128,
"quant_method": "awq",
"version": "gemm",
"zero_point": true,
"modules_to_not_convert": ["vision_model", "mlp1"]
},
"max_dynamic_patch": 12,
"min_dynamic_patch": 1,
"model_type": "internvl_chat",
"output_attentions": false,
"pad2square": false,
"ps_version": "v2",
"select_layer": -1,
"system_message": null,
"template": "internvl2_5",
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": null,
"use_backbone_lora": 0,
"use_llm_lora": 0,
"use_thumbnail": true,
"vision_config": {
"_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
"architectures": [
"InternVisionModel"
],
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_intern_vit.InternVisionConfig",
"AutoModel": "modeling_intern_vit.InternVisionModel"
},
"capacity_factor": 1.2,
"drop_path_rate": 0.0,
"dropout": 0.0,
"eval_capacity_factor": 1.4,
"hidden_act": "gelu",
"hidden_size": 1024,
"image_size": 448,
"initializer_factor": 0.1,
"initializer_range": 1e-10,
"intermediate_size": 4096,
"laux_allreduce": "all_nodes",
"layer_norm_eps": 1e-06,
"model_type": "intern_vit_6b",
"moe_coeff_ratio": 0.5,
"moe_intermediate_size": 768,
"moe_output_scale": 4.0,
"noisy_gate_policy": "RSample_before",
"norm_type": "layer_norm",
"num_attention_heads": 16,
"num_channels": 3,
"num_experts": 8,
"num_hidden_layers": 24,
"num_routed_experts": 4,
"num_shared_experts": 4,
"patch_size": 14,
"qk_normalization": false,
"qkv_bias": true,
"shared_expert_intermediate_size": 3072,
"torch_dtype": "bfloat16",
"use_bfloat16": true,
"use_flash_attn": false,
"use_moe": false,
"use_residual": true,
"use_rts": false,
"use_weighted_residual": false
}
}
SenseNova-SI: Scaling Spatial Intelligence with Multimodal Foundation Models
[EASI Codebase] [EASI Leaderboard]
Overview
Despite remarkable progress, leading multimodal models still exhibit notable deficiencies in spatial intelligence: the ability to make metric estimations, understand spatial relationships, handle viewpoint changes, and integrate information across complex scenes. We take a scaling perspective: constructing and curating a large-scale, comprehensive collection of spatial intelligence data, and through continued training on powerful multimodal foundations, cultivating multi-faceted spatial understanding within the SenseNova-SI family of models. In the future, SenseNova-SI will be integrated with larger-scale in-house models.
Release Information
Currently, we build SenseNova-SI upon popular open-source foundation models to maximize compatibility with existing research pipelines. In this release, we present SenseNova-SI-InternVL3-2B and SenseNova-SI-InternVL3-8B, which achieve state-of-the-art performance among open-source models of comparable size across four recent spatial intelligence benchmarks: VSI, MMSI, MindCube, and ViewSpatial.
| Model | VSI | MMSI | MindCube-Tiny | ViewSpatial |
|---|---|---|---|---|
| Open-source Models (~2B) | ||||
| InternVL3-2B | 32.98 | 26.50 | 37.50 | 32.56 |
| Qwen3-VL-2B-Instruct | 50.36 | 28.90 | 34.52 | 36.97 |
| MindCube-3B-RawQA-SFT | 17.24 | 1.70 | 51.73 | 24.14 |
| MindCube-3B-Aug-CGMap-FFR-Out-SFT | 29.60 | 29.10 | 41.06 | 30.90 |
| MindCube-3B-Plain-CGMap-FFR-Out-SFT | 29.93 | 30.40 | 39.90 | 31.20 |
| SpatialLadder-3B | 44.86 | 27.40 | 43.46 | 39.85 |
| SpatialMLLM-4B | 45.98 | 26.10 | 33.46 | 34.66 |
| SenseNova-SI-InternVL3-2B | 58.47 | 35.50 | 71.35 | 40.62 |
| Open-source Models (~8B) | ||||
| InternVL3-8B | 42.14 | 28.00 | 41.54 | 38.66 |
| Qwen3-VL-8B-Instruct | 57.90 | 31.10 | 29.42 | 42.20 |
| BAGEL-7B | 30.90 | 33.10 | 34.71 | 41.32 |
| SpaceR-7B | 36.29 | 27.40 | 37.98 | 35.85 |
| ViLaSR-7B | 44.63 | 30.20 | 35.10 | 35.71 |
| SenseNova-SI-InternVL3-8B | 62.80 | 37.90 | 89.33 | 53.92 |
| Proprietary Models | ||||
| Gemini-2.5-pro-2025-06 | 53.57 | 38.00 | 57.60 | 46.06 |
| Grok-4-2025-07-09 | 47.92 | 37.80 | 63.56 | 43.23 |
| GPT-5-2025-08-07 | 55.03 | 41.80 | 56.30 | 45.59 |
- Downloads last month
- 2
Model tree for ELVISIO/SenseNova-SI-InternVL3-8B-AWQ
Base model
OpenGVLab/InternVL3-8B-Pretrained