nvidia
/

NVIDIA-Nemotron-Nano-9B-v2

@@ -136,7 +136,7 @@ Our models are designed and optimized to run on NVIDIA GPU-accelerated systems.
 The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3).
-```
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -152,7 +152,7 @@ model = AutoModelForCausalLM.from_pretrained(
 Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True`
-```
 messages = [
     {"role": "system", "content": "/think"},
     {"role": "user", "content": "Write a haiku about GPUs"},
@@ -161,7 +161,7 @@ messages = [
 Case 2: `/no_think` is provided, reasoning will be set to `False`
-```
 messages = [
     {"role": "system", "content": "/no_think"},
     {"role": "user", "content": "Write a haiku about GPUs"},
@@ -172,7 +172,7 @@ Note: `/think` or `/no_think` keywords can also be provided in “user” messag
 The rest of the inference snippet remains the same
-```
 tokenized_chat = tokenizer.apply_chat_template(
     messages,
     tokenize=True,
@@ -194,7 +194,7 @@ We recommend setting `temperature` to `0.6`, `top_p` to `0.95` for reasoning Tru
 The snippet below shows how to use this model with TRT-LLM. We tested this on the following [commit](https://github.com/NVIDIA/TensorRT-LLM/tree/46c5a564446673cdd0f56bcda938d53025b6d04e) and followed these [instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/46c5a564446673cdd0f56bcda938d53025b6d04e/docs/source/installation/build-from-source-linux.md#option-2-build-tensorrt-llm-step-by-step) to build and install TRT-LLM in a docker container.
-```
 from tensorrt_llm import SamplingParams
 from tensorrt_llm._torch import LLM
 from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
@@ -208,7 +208,7 @@ kv_cache_config = KvCacheConfig(
 )
 ```
-```
 model_id = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -260,7 +260,7 @@ Note:
 Alternativly, you can use Docker to launch a vLLM server.
-```
 export TP_SIZE=1  # Adjust this value based on the number of GPUs you want to use
 docker run --runtime nvidia --gpus all \
            -v ~/.cache/huggingface:/root/.cache/huggingface \
@@ -498,7 +498,7 @@ Okay, let's see. The user has a bill of $100 and wants to know the amount for an
 We follow the jinja chat template provided below. This template conditionally adds `<think>\n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds `<think></think>` to the start of the Assistant response if `/no_think` is found in the system prompt. Thus enforcing reasoning on/off behavior.
-```
 {%- set ns = namespace(enable_thinking = true) %}
 {%- for message in messages -%}

 The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3).
+```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True`
+```python
 messages = [
     {"role": "system", "content": "/think"},
     {"role": "user", "content": "Write a haiku about GPUs"},
 Case 2: `/no_think` is provided, reasoning will be set to `False`
+```python
 messages = [
     {"role": "system", "content": "/no_think"},
     {"role": "user", "content": "Write a haiku about GPUs"},
 The rest of the inference snippet remains the same
+```python
 tokenized_chat = tokenizer.apply_chat_template(
     messages,
     tokenize=True,
 The snippet below shows how to use this model with TRT-LLM. We tested this on the following [commit](https://github.com/NVIDIA/TensorRT-LLM/tree/46c5a564446673cdd0f56bcda938d53025b6d04e) and followed these [instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/46c5a564446673cdd0f56bcda938d53025b6d04e/docs/source/installation/build-from-source-linux.md#option-2-build-tensorrt-llm-step-by-step) to build and install TRT-LLM in a docker container.
+```python
 from tensorrt_llm import SamplingParams
 from tensorrt_llm._torch import LLM
 from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
 )
 ```
+```python
 model_id = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 Alternativly, you can use Docker to launch a vLLM server.
+```bash
 export TP_SIZE=1  # Adjust this value based on the number of GPUs you want to use
 docker run --runtime nvidia --gpus all \
            -v ~/.cache/huggingface:/root/.cache/huggingface \
 We follow the jinja chat template provided below. This template conditionally adds `<think>\n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds `<think></think>` to the start of the Assistant response if `/no_think` is found in the system prompt. Thus enforcing reasoning on/off behavior.
+```jinja2
 {%- set ns = namespace(enable_thinking = true) %}
 {%- for message in messages -%}