mratsim commited on Nov 8

Commit

e60e696

verified ·

1 Parent(s): ce1ab37

Upload folder using huggingface_hub

Browse files

Files changed (25) hide show

README.md +167 -0
chat_template.jinja +1 -0
config.json +72 -0
generation_config.json +7 -0
model-00001-of-00015.safetensors +3 -0
model-00002-of-00015.safetensors +3 -0
model-00003-of-00015.safetensors +3 -0
model-00004-of-00015.safetensors +3 -0
model-00005-of-00015.safetensors +3 -0
model-00006-of-00015.safetensors +3 -0
model-00007-of-00015.safetensors +3 -0
model-00008-of-00015.safetensors +3 -0
model-00009-of-00015.safetensors +3 -0
model-00010-of-00015.safetensors +3 -0
model-00011-of-00015.safetensors +3 -0
model-00012-of-00015.safetensors +3 -0
model-00013-of-00015.safetensors +3 -0
model-00014-of-00015.safetensors +3 -0
model-00015-of-00015.safetensors +3 -0
model.safetensors.index.json +0 -0
recipe.yaml +6 -0
special_tokens_map.json +30 -0
tokenizer.json +0 -0
tokenizer.model +3 -0
tokenizer_config.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,167 @@

+---
+base_model:
+- TheDrummer/Behemoth-X-123B-v2
+datasets:
+- neuralmagic/calibration
+- HuggingFaceH4/ultrachat_200k
+- nvidia/OpenCodeInstruct
+- CSJianYang/CodeArena
+- nvidia/OpenScienceReasoning-2
+- MegaScience/MegaScience
+- Gryphe/Opus-WritingPrompts
+- ServiceNow-AI/M2Lingual
+- anthracite-org/stheno-filtered-v1.1
+- zerofata/Roleplay-Anime-Characters
+- zerofata/Instruct-Anime
+- zerofata/Instruct-Anime-CreativeWriting
+- sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
+- nvidia/OpenMathInstruct-2
+- fka/awesome-chatgpt-prompts
+- databricks/databricks-dolly-15k
+- FreedomIntelligence/SocraticChat
+- ruggsea/stanford-encyclopedia-of-philosophy_instruct
+- mlfoundations-dev/stackexchange_philosophy
+- theoldmandthesea/17k_business_book
+- anthracite-org/nopm_claude_writing_fixed
+- PJMixers/grimulkan_physical-reasoning-ShareGPT
+- PJMixers/grimulkan_theory-of-mind-ShareGPT
+- HuggingFaceH4/no_robots
+- nvidia/HelpSteer
+- garage-bAInd/Open-Platypus
+- AquaV/US-Army-Survival-Sharegpt
+- AquaV/Interrogation-Sharegpt
+- AquaV/Multi-Environment-Operations-Sharegpt
+- AquaV/Resistance-Sharegpt
+- PocketDoc/Dans-Kinomaxx-VanillaBackrooms
+- PocketDoc/Dans-Prosemaxx-Adventure
+pipeline_tag: text-generation
+tags:
+- text adventure
+- roleplay
+- rpg
+- creative writing
+- nvfp4
+- vllm
+- conversational
+---
+# Behemoth-X-123B-v2 (NVFP4 quant)
+This repo contains Behemoth-X-123B-v2 quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia RTX 5000s series GPUs.
+> ℹ️ This model is limited to Hopper and Blackwell family of GPUs and will not work with RTX 3000s and RTX 4000s GPUs.
+> Please use the NVFP4A16 model otherwise OR enable slow emulation `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
+- Original Model:
+  - [TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)
+- RTX 3000s and 4000s GPUs fallback model:
+  - TBD
+NVFP4 writeups:
+- https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
+- https://arxiv.org/pdf/2509.25149
+## 📥 Usage & Running Instructions
+The model was tested with vLLM + 1x RTX Pro 6000.
+### Hardware
+As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
+Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
+You may still run this model with emulation albeit slowly by setting `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
+otherwise use the alternative [mratsim/Behemoth-X-123B-v2-NVFP4A16](https://huggingface.co/mratsim/Behemoth-X-123B-v2-NVFP4A16)
+### Recommendations
+It is however recommended to use at most 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87).
+This model is recommended with "min-p" sampling, this sampling is available through
+both the oldest Text completions API and the Chat completions API (and there is a new Response API),
+however most LLM frontends only support modifying min-p when using Text completions.
+You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults)
+### Running script
+```bash
+# Model configuration (Mandatory)
+MODEL="mratsim/Behemoth-X-123B-v2-NVFP4"
+MODELNAME="Behemoth-X-123B-v2"
+CONTEXT_SIZE=65536
+GPU_UTIL=0.95
+# Sampling configuration (Optional, if departing from `generation_config.json`)
+# Using default vLLM values
+SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0, "top_p": 1, "repetition_penalty": 1}'
+# Prevent vLLM from using 100% CPU when idle (Very Recommended)
+export VLLM_SLEEP_WHEN_IDLE=1
+# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
+export VLLM_ATTENTION_BACKEND=FLASHINFER
+vllm serve "${MODEL}" \
+  --served-model-name "${MODELNAME}" \
+  --gpu-memory-utilization ${GPU_UTIL} \
+  --max-model-len "${CONTEXT_SIZE}" \
+  --override-generation-config "${SAMPLER_OVERRIDE}"
+```
+> ℹ️ The FlashInfer backend may fail with an error similar to
+> `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`.
+>
+> A workaround is running a sed replacement command within vllm install to increase buffer space
+> ```bash
+> sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
+> ```
+> This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344
+## 🔬 Quantization method
+The llmcompressor library was used with the following recipe:
+```yaml
+default_stage:
+  default_modifiers:
+    QuantizationModifier:
+      targets: [Linear]
+      ignore: [lm_head]
+      scheme: NVFP4
+```
+and calibrated on 3 samples per the following datasets (total 96), 8192 sequence length:
+- [neuralmagic/calibration](https://huggingface.co/datasets/neuralmagic/calibration)
+- [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)
+- [nvidia/OpenCodeInstruct](https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
+- [CSJianYang/CodeArena](https://huggingface.co/datasets/CSJianYang/CodeArena)
+- [nvidia/OpenScienceReasoning-2](https://huggingface.co/datasets/nvidia/OpenScienceReasoning-2)
+- [MegaScience/MegaScience](https://huggingface.co/datasets/MegaScience/MegaScience)
+- [Gryphe/Opus-WritingPrompts](https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts)
+- [ServiceNow-AI/M2Lingual](https://huggingface.co/datasets/ServiceNow-AI/M2Lingual)
+- [anthracite-org/stheno-filtered-v1.1](https://huggingface.co/datasets/anthracite-org/stheno-filtered-v1.1)
+- [zerofata/Roleplay-Anime-Characters](https://huggingface.co/datasets/zerofata/Roleplay-Anime-Characters)
+- [zerofata/Instruct-Anime](https://huggingface.co/datasets/zerofata/Instruct-Anime)
+- [zerofata/Instruct-Anime-CreativeWriting](https://huggingface.co/datasets/zerofata/Instruct-Anime-CreativeWriting)
+- [sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo](https://huggingface.co/datasets/sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo)
+- [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
+- [fka/awesome-chatgpt-prompts](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts)
+- [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
+- [FreedomIntelligence/SocraticChat](https://huggingface.co/datasets/FreedomIntelligence/SocraticChat)
+- [ruggsea/stanford-encyclopedia-of-philosophy_instruct](https://huggingface.co/datasets/ruggsea/stanford-encyclopedia-of-philosophy_instruct)
+- [mlfoundations-dev/stackexchange_philosophy](https://huggingface.co/datasets/mlfoundations-dev/stackexchange_philosophy)
+- [theoldmandthesea/17k_business_book](https://huggingface.co/datasets/theoldmandthesea/17k_business_book)
+- [anthracite-org/nopm_claude_writing_fixed](https://huggingface.co/datasets/anthracite-org/nopm_claude_writing_fixed)
+- [PJMixers/grimulkan_physical-reasoning-ShareGPT](https://huggingface.co/datasets/PJMixers/grimulkan_physical-reasoning-ShareGPT)
+- [PJMixers/grimulkan_theory-of-mind-ShareGPT](https://huggingface.co/datasets/PJMixers/grimulkan_theory-of-mind-ShareGPT)
+- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
+- [nvidia/HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)
+- [garage-bAInd/Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus)
+- [AquaV/US-Army-Survival-Sharegpt](https://huggingface.co/datasets/AquaV/US-Army-Survival-Sharegpt)
+- [AquaV/Interrogation-Sharegpt](https://huggingface.co/datasets/AquaV/Interrogation-Sharegpt)
+- [AquaV/Multi-Environment-Operations-Sharegpt](https://huggingface.co/datasets/AquaV/Multi-Environment-Operations-Sharegpt)
+- [AquaV/Resistance-Sharegpt](https://huggingface.co/datasets/AquaV/Resistance-Sharegpt)
+- [PocketDoc/Dans-Kinomaxx-VanillaBackrooms](https://huggingface.co/datasets/PocketDoc/Dans-Kinomaxx-VanillaBackrooms)
+- [PocketDoc/Dans-Prosemaxx-Adventure](https://huggingface.co/datasets/PocketDoc/Dans-Prosemaxx-Adventure)
+NVFP4 quantization requires very few number of samples, llmcompressor uses 20 in their examples.
+Comparatively 512 is recommended for GPTQ and 64 for AWQ (https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf)

chat_template.jinja ADDED Viewed

	@@ -0,0 +1 @@

+ {{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST]' }}{% elif message['role'] == 'system' %}{{ '[SYSTEM_PROMPT] ' + message['content'] + '[/SYSTEM_PROMPT]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token }}{% else %}{{ raise_exception('Only user, system and assistant roles are supported!') }}{% endif %}{% endfor %}

config.json ADDED Viewed

	@@ -0,0 +1,72 @@

+{
+  "architectures": [
+    "MistralForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "dtype": "bfloat16",
+  "eos_token_id": 2,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 12288,
+  "initializer_range": 0.02,
+  "intermediate_size": 28672,
+  "max_position_embeddings": 131072,
+  "model_type": "mistral",
+  "num_attention_heads": 96,
+  "num_hidden_layers": 88,
+  "num_key_value_heads": 8,
+  "quantization_config": {
+    "config_groups": {
+      "group_0": {
+        "format": "nvfp4-pack-quantized",
+        "input_activations": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": "local",
+          "group_size": 16,
+          "num_bits": 4,
+          "observer": "minmax",
+          "observer_kwargs": {},
+          "strategy": "tensor_group",
+          "symmetric": true,
+          "type": "float"
+        },
+        "output_activations": null,
+        "targets": [
+          "Linear"
+        ],
+        "weights": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": false,
+          "group_size": 16,
+          "num_bits": 4,
+          "observer": "minmax",
+          "observer_kwargs": {},
+          "strategy": "tensor_group",
+          "symmetric": true,
+          "type": "float"
+        }
+      }
+    },
+    "format": "nvfp4-pack-quantized",
+    "global_compression_ratio": null,
+    "ignore": [
+      "lm_head"
+    ],
+    "kv_cache_scheme": null,
+    "quant_method": "compressed-tensors",
+    "quantization_status": "compressed",
+    "sparsity_config": {},
+    "transform_config": {},
+    "version": "0.12.2"
+  },
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 1000000.0,
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.56.2",
+  "use_cache": true,
+  "vocab_size": 32768
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "do_sample": true,
+  "eos_token_id": 2,
+  "transformers_version": "4.56.2"
+}

model-00001-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:77ffe9e33e22caf9e88013e6ad0880a14f59fb919e71fb4a183ae00b0f159e0f
+size 4882434912

model-00002-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:59d8745131f5ac223f2fb0cdd4215e41cbb412cacaa151c967f77356be63c716
+size 4869903000

model-00003-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8beb485aa1bcbd4d835209c400e87c4bc2a4febecf7dafe18c30018192c6fb1d
+size 4869903136

model-00004-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e6b8753b198d246928a1931fc6b5aa4c2f506d35bef17987479d3e782a0ab47
+size 4969044352

model-00005-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9a5bc1b84c4da8ad7aa8866a8d65ee4839dcc1c9068316998794d2323c8a3080
+size 4954838264

model-00006-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6a1b18838bfe67c377ba85196992eeeb9ef802e3510f9430aa4281162ad0a968
+size 4869903136

model-00007-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a12355e837e0ca3916a8ebcf61419b2892c489a610dd7397aab40f3405816892
+size 4969044352

model-00008-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:768685d30ab4ed8432bc6c55c131af3adf654d000d84bf6318f64c4856bf067d
+size 4954838264

model-00009-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:18909c1c2d74c76e56fc07f79bb3384df1e282fed27296b5d7df68514e3d8f1c
+size 4869903136

model-00010-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c6dd72e56067c65fb84aa63fce4ac83f7dd05a175cb00ba929474941c54b685c
+size 4969044352

model-00011-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4dbb727a07bb719b4e7b08f7917d888b612ec21ef54a19f1bedb78cc833a0ddd
+size 4954838264

model-00012-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58605fa82f1a658acf0352d1c93671f33398c0e56eca504749ee2e9c4f0908b3
+size 4869903136

model-00013-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:de39d447715891861389a27f90fa80503d91a9973fd63cd6beab4f4b3e4e9607
+size 4969044352

model-00014-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9f417deed8ca90ad4d60a3d50abef96c031d7658457477566c879aa5ee65397a
+size 4954838264

model-00015-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f14589b0368fb74769f1ca151d4a5105a8d3a48c4b33b5e8c7aa8ae99115d102
+size 1201743176

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

recipe.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+default_stage:
+  default_modifiers:
+    QuantizationModifier:
+      targets: [Linear]
+      ignore: [lm_head]
+      scheme: NVFP4

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1b968b8dc352f42192367337c78ccc61e1eaddc6d641a579372d4f20694beb7a
+size 587562

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff