Update README.md
Browse files
README.md
CHANGED
|
@@ -136,7 +136,7 @@ Our models are designed and optimized to run on NVIDIA GPU-accelerated systems.
|
|
| 136 |
|
| 137 |
The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3).
|
| 138 |
|
| 139 |
-
```
|
| 140 |
import torch
|
| 141 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 142 |
|
|
@@ -152,7 +152,7 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 152 |
|
| 153 |
Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True`
|
| 154 |
|
| 155 |
-
```
|
| 156 |
messages = [
|
| 157 |
{"role": "system", "content": "/think"},
|
| 158 |
{"role": "user", "content": "Write a haiku about GPUs"},
|
|
@@ -161,7 +161,7 @@ messages = [
|
|
| 161 |
|
| 162 |
Case 2: `/no_think` is provided, reasoning will be set to `False`
|
| 163 |
|
| 164 |
-
```
|
| 165 |
messages = [
|
| 166 |
{"role": "system", "content": "/no_think"},
|
| 167 |
{"role": "user", "content": "Write a haiku about GPUs"},
|
|
@@ -172,7 +172,7 @@ Note: `/think` or `/no_think` keywords can also be provided in “user” messag
|
|
| 172 |
|
| 173 |
The rest of the inference snippet remains the same
|
| 174 |
|
| 175 |
-
```
|
| 176 |
tokenized_chat = tokenizer.apply_chat_template(
|
| 177 |
messages,
|
| 178 |
tokenize=True,
|
|
@@ -194,7 +194,7 @@ We recommend setting `temperature` to `0.6`, `top_p` to `0.95` for reasoning Tru
|
|
| 194 |
|
| 195 |
The snippet below shows how to use this model with TRT-LLM. We tested this on the following [commit](https://github.com/NVIDIA/TensorRT-LLM/tree/46c5a564446673cdd0f56bcda938d53025b6d04e) and followed these [instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/46c5a564446673cdd0f56bcda938d53025b6d04e/docs/source/installation/build-from-source-linux.md#option-2-build-tensorrt-llm-step-by-step) to build and install TRT-LLM in a docker container.
|
| 196 |
|
| 197 |
-
```
|
| 198 |
from tensorrt_llm import SamplingParams
|
| 199 |
from tensorrt_llm._torch import LLM
|
| 200 |
from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
|
|
@@ -208,7 +208,7 @@ kv_cache_config = KvCacheConfig(
|
|
| 208 |
)
|
| 209 |
```
|
| 210 |
|
| 211 |
-
```
|
| 212 |
model_id = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
|
| 213 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 214 |
|
|
@@ -260,7 +260,7 @@ Note:
|
|
| 260 |
|
| 261 |
Alternativly, you can use Docker to launch a vLLM server.
|
| 262 |
|
| 263 |
-
```
|
| 264 |
export TP_SIZE=1 # Adjust this value based on the number of GPUs you want to use
|
| 265 |
docker run --runtime nvidia --gpus all \
|
| 266 |
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
@@ -498,7 +498,7 @@ Okay, let's see. The user has a bill of $100 and wants to know the amount for an
|
|
| 498 |
|
| 499 |
We follow the jinja chat template provided below. This template conditionally adds `<think>\n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds `<think></think>` to the start of the Assistant response if `/no_think` is found in the system prompt. Thus enforcing reasoning on/off behavior.
|
| 500 |
|
| 501 |
-
```
|
| 502 |
{%- set ns = namespace(enable_thinking = true) %}
|
| 503 |
|
| 504 |
{%- for message in messages -%}
|
|
|
|
| 136 |
|
| 137 |
The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3).
|
| 138 |
|
| 139 |
+
```python
|
| 140 |
import torch
|
| 141 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 142 |
|
|
|
|
| 152 |
|
| 153 |
Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True`
|
| 154 |
|
| 155 |
+
```python
|
| 156 |
messages = [
|
| 157 |
{"role": "system", "content": "/think"},
|
| 158 |
{"role": "user", "content": "Write a haiku about GPUs"},
|
|
|
|
| 161 |
|
| 162 |
Case 2: `/no_think` is provided, reasoning will be set to `False`
|
| 163 |
|
| 164 |
+
```python
|
| 165 |
messages = [
|
| 166 |
{"role": "system", "content": "/no_think"},
|
| 167 |
{"role": "user", "content": "Write a haiku about GPUs"},
|
|
|
|
| 172 |
|
| 173 |
The rest of the inference snippet remains the same
|
| 174 |
|
| 175 |
+
```python
|
| 176 |
tokenized_chat = tokenizer.apply_chat_template(
|
| 177 |
messages,
|
| 178 |
tokenize=True,
|
|
|
|
| 194 |
|
| 195 |
The snippet below shows how to use this model with TRT-LLM. We tested this on the following [commit](https://github.com/NVIDIA/TensorRT-LLM/tree/46c5a564446673cdd0f56bcda938d53025b6d04e) and followed these [instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/46c5a564446673cdd0f56bcda938d53025b6d04e/docs/source/installation/build-from-source-linux.md#option-2-build-tensorrt-llm-step-by-step) to build and install TRT-LLM in a docker container.
|
| 196 |
|
| 197 |
+
```python
|
| 198 |
from tensorrt_llm import SamplingParams
|
| 199 |
from tensorrt_llm._torch import LLM
|
| 200 |
from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
|
|
|
|
| 208 |
)
|
| 209 |
```
|
| 210 |
|
| 211 |
+
```python
|
| 212 |
model_id = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
|
| 213 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 214 |
|
|
|
|
| 260 |
|
| 261 |
Alternativly, you can use Docker to launch a vLLM server.
|
| 262 |
|
| 263 |
+
```bash
|
| 264 |
export TP_SIZE=1 # Adjust this value based on the number of GPUs you want to use
|
| 265 |
docker run --runtime nvidia --gpus all \
|
| 266 |
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
|
|
| 498 |
|
| 499 |
We follow the jinja chat template provided below. This template conditionally adds `<think>\n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds `<think></think>` to the start of the Assistant response if `/no_think` is found in the system prompt. Thus enforcing reasoning on/off behavior.
|
| 500 |
|
| 501 |
+
```jinja2
|
| 502 |
{%- set ns = namespace(enable_thinking = true) %}
|
| 503 |
|
| 504 |
{%- for message in messages -%}
|