mratsim commited on
Commit
e60e696
·
verified ·
1 Parent(s): ce1ab37

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - TheDrummer/Behemoth-X-123B-v2
4
+ datasets:
5
+ - neuralmagic/calibration
6
+ - HuggingFaceH4/ultrachat_200k
7
+ - nvidia/OpenCodeInstruct
8
+ - CSJianYang/CodeArena
9
+ - nvidia/OpenScienceReasoning-2
10
+ - MegaScience/MegaScience
11
+ - Gryphe/Opus-WritingPrompts
12
+ - ServiceNow-AI/M2Lingual
13
+ - anthracite-org/stheno-filtered-v1.1
14
+ - zerofata/Roleplay-Anime-Characters
15
+ - zerofata/Instruct-Anime
16
+ - zerofata/Instruct-Anime-CreativeWriting
17
+ - sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
18
+ - nvidia/OpenMathInstruct-2
19
+ - fka/awesome-chatgpt-prompts
20
+ - databricks/databricks-dolly-15k
21
+ - FreedomIntelligence/SocraticChat
22
+ - ruggsea/stanford-encyclopedia-of-philosophy_instruct
23
+ - mlfoundations-dev/stackexchange_philosophy
24
+ - theoldmandthesea/17k_business_book
25
+ - anthracite-org/nopm_claude_writing_fixed
26
+ - PJMixers/grimulkan_physical-reasoning-ShareGPT
27
+ - PJMixers/grimulkan_theory-of-mind-ShareGPT
28
+ - HuggingFaceH4/no_robots
29
+ - nvidia/HelpSteer
30
+ - garage-bAInd/Open-Platypus
31
+ - AquaV/US-Army-Survival-Sharegpt
32
+ - AquaV/Interrogation-Sharegpt
33
+ - AquaV/Multi-Environment-Operations-Sharegpt
34
+ - AquaV/Resistance-Sharegpt
35
+ - PocketDoc/Dans-Kinomaxx-VanillaBackrooms
36
+ - PocketDoc/Dans-Prosemaxx-Adventure
37
+ pipeline_tag: text-generation
38
+ tags:
39
+ - text adventure
40
+ - roleplay
41
+ - rpg
42
+ - creative writing
43
+ - nvfp4
44
+ - vllm
45
+ - conversational
46
+ ---
47
+ # Behemoth-X-123B-v2 (NVFP4 quant)
48
+
49
+ This repo contains Behemoth-X-123B-v2 quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia RTX 5000s series GPUs.
50
+
51
+ > ℹ️ This model is limited to Hopper and Blackwell family of GPUs and will not work with RTX 3000s and RTX 4000s GPUs.
52
+ > Please use the NVFP4A16 model otherwise OR enable slow emulation `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
53
+
54
+ - Original Model:
55
+ - [TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)
56
+ - RTX 3000s and 4000s GPUs fallback model:
57
+ - TBD
58
+
59
+ NVFP4 writeups:
60
+ - https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
61
+ - https://arxiv.org/pdf/2509.25149
62
+
63
+ ## 📥 Usage & Running Instructions
64
+
65
+ The model was tested with vLLM + 1x RTX Pro 6000.
66
+
67
+ ### Hardware
68
+
69
+ As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
70
+ Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
71
+
72
+ You may still run this model with emulation albeit slowly by setting `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
73
+ otherwise use the alternative [mratsim/Behemoth-X-123B-v2-NVFP4A16](https://huggingface.co/mratsim/Behemoth-X-123B-v2-NVFP4A16)
74
+
75
+ ### Recommendations
76
+
77
+ It is however recommended to use at most 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87).
78
+
79
+ This model is recommended with "min-p" sampling, this sampling is available through
80
+ both the oldest Text completions API and the Chat completions API (and there is a new Response API),
81
+ however most LLM frontends only support modifying min-p when using Text completions.
82
+ You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults)
83
+
84
+ ### Running script
85
+
86
+ ```bash
87
+ # Model configuration (Mandatory)
88
+ MODEL="mratsim/Behemoth-X-123B-v2-NVFP4"
89
+ MODELNAME="Behemoth-X-123B-v2"
90
+ CONTEXT_SIZE=65536
91
+ GPU_UTIL=0.95
92
+
93
+ # Sampling configuration (Optional, if departing from `generation_config.json`)
94
+ # Using default vLLM values
95
+ SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0, "top_p": 1, "repetition_penalty": 1}'
96
+
97
+ # Prevent vLLM from using 100% CPU when idle (Very Recommended)
98
+ export VLLM_SLEEP_WHEN_IDLE=1
99
+
100
+ # Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
101
+ export VLLM_ATTENTION_BACKEND=FLASHINFER
102
+
103
+ vllm serve "${MODEL}" \
104
+ --served-model-name "${MODELNAME}" \
105
+ --gpu-memory-utilization ${GPU_UTIL} \
106
+ --max-model-len "${CONTEXT_SIZE}" \
107
+ --override-generation-config "${SAMPLER_OVERRIDE}"
108
+ ```
109
+
110
+ > ℹ️ The FlashInfer backend may fail with an error similar to
111
+ > `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`.
112
+ >
113
+ > A workaround is running a sed replacement command within vllm install to increase buffer space
114
+ > ```bash
115
+ > sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
116
+ > ```
117
+ > This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344
118
+
119
+ ## 🔬 Quantization method
120
+
121
+ The llmcompressor library was used with the following recipe:
122
+
123
+ ```yaml
124
+ default_stage:
125
+ default_modifiers:
126
+ QuantizationModifier:
127
+ targets: [Linear]
128
+ ignore: [lm_head]
129
+ scheme: NVFP4
130
+ ```
131
+
132
+ and calibrated on 3 samples per the following datasets (total 96), 8192 sequence length:
133
+ - [neuralmagic/calibration](https://huggingface.co/datasets/neuralmagic/calibration)
134
+ - [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)
135
+ - [nvidia/OpenCodeInstruct](https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
136
+ - [CSJianYang/CodeArena](https://huggingface.co/datasets/CSJianYang/CodeArena)
137
+ - [nvidia/OpenScienceReasoning-2](https://huggingface.co/datasets/nvidia/OpenScienceReasoning-2)
138
+ - [MegaScience/MegaScience](https://huggingface.co/datasets/MegaScience/MegaScience)
139
+ - [Gryphe/Opus-WritingPrompts](https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts)
140
+ - [ServiceNow-AI/M2Lingual](https://huggingface.co/datasets/ServiceNow-AI/M2Lingual)
141
+ - [anthracite-org/stheno-filtered-v1.1](https://huggingface.co/datasets/anthracite-org/stheno-filtered-v1.1)
142
+ - [zerofata/Roleplay-Anime-Characters](https://huggingface.co/datasets/zerofata/Roleplay-Anime-Characters)
143
+ - [zerofata/Instruct-Anime](https://huggingface.co/datasets/zerofata/Instruct-Anime)
144
+ - [zerofata/Instruct-Anime-CreativeWriting](https://huggingface.co/datasets/zerofata/Instruct-Anime-CreativeWriting)
145
+ - [sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo](https://huggingface.co/datasets/sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo)
146
+ - [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
147
+ - [fka/awesome-chatgpt-prompts](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts)
148
+ - [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
149
+ - [FreedomIntelligence/SocraticChat](https://huggingface.co/datasets/FreedomIntelligence/SocraticChat)
150
+ - [ruggsea/stanford-encyclopedia-of-philosophy_instruct](https://huggingface.co/datasets/ruggsea/stanford-encyclopedia-of-philosophy_instruct)
151
+ - [mlfoundations-dev/stackexchange_philosophy](https://huggingface.co/datasets/mlfoundations-dev/stackexchange_philosophy)
152
+ - [theoldmandthesea/17k_business_book](https://huggingface.co/datasets/theoldmandthesea/17k_business_book)
153
+ - [anthracite-org/nopm_claude_writing_fixed](https://huggingface.co/datasets/anthracite-org/nopm_claude_writing_fixed)
154
+ - [PJMixers/grimulkan_physical-reasoning-ShareGPT](https://huggingface.co/datasets/PJMixers/grimulkan_physical-reasoning-ShareGPT)
155
+ - [PJMixers/grimulkan_theory-of-mind-ShareGPT](https://huggingface.co/datasets/PJMixers/grimulkan_theory-of-mind-ShareGPT)
156
+ - [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
157
+ - [nvidia/HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)
158
+ - [garage-bAInd/Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus)
159
+ - [AquaV/US-Army-Survival-Sharegpt](https://huggingface.co/datasets/AquaV/US-Army-Survival-Sharegpt)
160
+ - [AquaV/Interrogation-Sharegpt](https://huggingface.co/datasets/AquaV/Interrogation-Sharegpt)
161
+ - [AquaV/Multi-Environment-Operations-Sharegpt](https://huggingface.co/datasets/AquaV/Multi-Environment-Operations-Sharegpt)
162
+ - [AquaV/Resistance-Sharegpt](https://huggingface.co/datasets/AquaV/Resistance-Sharegpt)
163
+ - [PocketDoc/Dans-Kinomaxx-VanillaBackrooms](https://huggingface.co/datasets/PocketDoc/Dans-Kinomaxx-VanillaBackrooms)
164
+ - [PocketDoc/Dans-Prosemaxx-Adventure](https://huggingface.co/datasets/PocketDoc/Dans-Prosemaxx-Adventure)
165
+
166
+ NVFP4 quantization requires very few number of samples, llmcompressor uses 20 in their examples.
167
+ Comparatively 512 is recommended for GPTQ and 64 for AWQ (https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf)
chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ {{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST]' }}{% elif message['role'] == 'system' %}{{ '[SYSTEM_PROMPT] ' + message['content'] + '[/SYSTEM_PROMPT]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token }}{% else %}{{ raise_exception('Only user, system and assistant roles are supported!') }}{% endif %}{% endfor %}
config.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MistralForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 1,
7
+ "dtype": "bfloat16",
8
+ "eos_token_id": 2,
9
+ "head_dim": 128,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 12288,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 28672,
14
+ "max_position_embeddings": 131072,
15
+ "model_type": "mistral",
16
+ "num_attention_heads": 96,
17
+ "num_hidden_layers": 88,
18
+ "num_key_value_heads": 8,
19
+ "quantization_config": {
20
+ "config_groups": {
21
+ "group_0": {
22
+ "format": "nvfp4-pack-quantized",
23
+ "input_activations": {
24
+ "actorder": null,
25
+ "block_structure": null,
26
+ "dynamic": "local",
27
+ "group_size": 16,
28
+ "num_bits": 4,
29
+ "observer": "minmax",
30
+ "observer_kwargs": {},
31
+ "strategy": "tensor_group",
32
+ "symmetric": true,
33
+ "type": "float"
34
+ },
35
+ "output_activations": null,
36
+ "targets": [
37
+ "Linear"
38
+ ],
39
+ "weights": {
40
+ "actorder": null,
41
+ "block_structure": null,
42
+ "dynamic": false,
43
+ "group_size": 16,
44
+ "num_bits": 4,
45
+ "observer": "minmax",
46
+ "observer_kwargs": {},
47
+ "strategy": "tensor_group",
48
+ "symmetric": true,
49
+ "type": "float"
50
+ }
51
+ }
52
+ },
53
+ "format": "nvfp4-pack-quantized",
54
+ "global_compression_ratio": null,
55
+ "ignore": [
56
+ "lm_head"
57
+ ],
58
+ "kv_cache_scheme": null,
59
+ "quant_method": "compressed-tensors",
60
+ "quantization_status": "compressed",
61
+ "sparsity_config": {},
62
+ "transform_config": {},
63
+ "version": "0.12.2"
64
+ },
65
+ "rms_norm_eps": 1e-05,
66
+ "rope_theta": 1000000.0,
67
+ "sliding_window": null,
68
+ "tie_word_embeddings": false,
69
+ "transformers_version": "4.56.2",
70
+ "use_cache": true,
71
+ "vocab_size": 32768
72
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "do_sample": true,
5
+ "eos_token_id": 2,
6
+ "transformers_version": "4.56.2"
7
+ }
model-00001-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:77ffe9e33e22caf9e88013e6ad0880a14f59fb919e71fb4a183ae00b0f159e0f
3
+ size 4882434912
model-00002-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59d8745131f5ac223f2fb0cdd4215e41cbb412cacaa151c967f77356be63c716
3
+ size 4869903000
model-00003-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8beb485aa1bcbd4d835209c400e87c4bc2a4febecf7dafe18c30018192c6fb1d
3
+ size 4869903136
model-00004-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e6b8753b198d246928a1931fc6b5aa4c2f506d35bef17987479d3e782a0ab47
3
+ size 4969044352
model-00005-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9a5bc1b84c4da8ad7aa8866a8d65ee4839dcc1c9068316998794d2323c8a3080
3
+ size 4954838264
model-00006-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a1b18838bfe67c377ba85196992eeeb9ef802e3510f9430aa4281162ad0a968
3
+ size 4869903136
model-00007-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a12355e837e0ca3916a8ebcf61419b2892c489a610dd7397aab40f3405816892
3
+ size 4969044352
model-00008-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:768685d30ab4ed8432bc6c55c131af3adf654d000d84bf6318f64c4856bf067d
3
+ size 4954838264
model-00009-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18909c1c2d74c76e56fc07f79bb3384df1e282fed27296b5d7df68514e3d8f1c
3
+ size 4869903136
model-00010-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c6dd72e56067c65fb84aa63fce4ac83f7dd05a175cb00ba929474941c54b685c
3
+ size 4969044352
model-00011-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4dbb727a07bb719b4e7b08f7917d888b612ec21ef54a19f1bedb78cc833a0ddd
3
+ size 4954838264
model-00012-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:58605fa82f1a658acf0352d1c93671f33398c0e56eca504749ee2e9c4f0908b3
3
+ size 4869903136
model-00013-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de39d447715891861389a27f90fa80503d91a9973fd63cd6beab4f4b3e4e9607
3
+ size 4969044352
model-00014-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f417deed8ca90ad4d60a3d50abef96c031d7658457477566c879aa5ee65397a
3
+ size 4954838264
model-00015-of-00015.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f14589b0368fb74769f1ca151d4a5105a8d3a48c4b33b5e8c7aa8ae99115d102
3
+ size 1201743176
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
recipe.yaml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ default_stage:
2
+ default_modifiers:
3
+ QuantizationModifier:
4
+ targets: [Linear]
5
+ ignore: [lm_head]
6
+ scheme: NVFP4
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b968b8dc352f42192367337c78ccc61e1eaddc6d641a579372d4f20694beb7a
3
+ size 587562
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff