Text Generation
Transformers
Safetensors
PyTorch
nvidia
conversational
sudoping01 commited on
Commit
0e77ec8
·
verified ·
1 Parent(s): dc376c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -136,7 +136,7 @@ Our models are designed and optimized to run on NVIDIA GPU-accelerated systems.
136
 
137
  The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3).
138
 
139
- ```
140
  import torch
141
  from transformers import AutoTokenizer, AutoModelForCausalLM
142
 
@@ -152,7 +152,7 @@ model = AutoModelForCausalLM.from_pretrained(
152
 
153
  Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True`
154
 
155
- ```
156
  messages = [
157
  {"role": "system", "content": "/think"},
158
  {"role": "user", "content": "Write a haiku about GPUs"},
@@ -161,7 +161,7 @@ messages = [
161
 
162
  Case 2: `/no_think` is provided, reasoning will be set to `False`
163
 
164
- ```
165
  messages = [
166
  {"role": "system", "content": "/no_think"},
167
  {"role": "user", "content": "Write a haiku about GPUs"},
@@ -172,7 +172,7 @@ Note: `/think` or `/no_think` keywords can also be provided in “user” messag
172
 
173
  The rest of the inference snippet remains the same
174
 
175
- ```
176
  tokenized_chat = tokenizer.apply_chat_template(
177
  messages,
178
  tokenize=True,
@@ -194,7 +194,7 @@ We recommend setting `temperature` to `0.6`, `top_p` to `0.95` for reasoning Tru
194
 
195
  The snippet below shows how to use this model with TRT-LLM. We tested this on the following [commit](https://github.com/NVIDIA/TensorRT-LLM/tree/46c5a564446673cdd0f56bcda938d53025b6d04e) and followed these [instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/46c5a564446673cdd0f56bcda938d53025b6d04e/docs/source/installation/build-from-source-linux.md#option-2-build-tensorrt-llm-step-by-step) to build and install TRT-LLM in a docker container.
196
 
197
- ```
198
  from tensorrt_llm import SamplingParams
199
  from tensorrt_llm._torch import LLM
200
  from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
@@ -208,7 +208,7 @@ kv_cache_config = KvCacheConfig(
208
  )
209
  ```
210
 
211
- ```
212
  model_id = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
213
  tokenizer = AutoTokenizer.from_pretrained(model_id)
214
 
@@ -260,7 +260,7 @@ Note:
260
 
261
  Alternativly, you can use Docker to launch a vLLM server.
262
 
263
- ```
264
  export TP_SIZE=1 # Adjust this value based on the number of GPUs you want to use
265
  docker run --runtime nvidia --gpus all \
266
  -v ~/.cache/huggingface:/root/.cache/huggingface \
@@ -498,7 +498,7 @@ Okay, let's see. The user has a bill of $100 and wants to know the amount for an
498
 
499
  We follow the jinja chat template provided below. This template conditionally adds `<think>\n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds `<think></think>` to the start of the Assistant response if `/no_think` is found in the system prompt. Thus enforcing reasoning on/off behavior.
500
 
501
- ```
502
  {%- set ns = namespace(enable_thinking = true) %}
503
 
504
  {%- for message in messages -%}
 
136
 
137
  The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3).
138
 
139
+ ```python
140
  import torch
141
  from transformers import AutoTokenizer, AutoModelForCausalLM
142
 
 
152
 
153
  Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True`
154
 
155
+ ```python
156
  messages = [
157
  {"role": "system", "content": "/think"},
158
  {"role": "user", "content": "Write a haiku about GPUs"},
 
161
 
162
  Case 2: `/no_think` is provided, reasoning will be set to `False`
163
 
164
+ ```python
165
  messages = [
166
  {"role": "system", "content": "/no_think"},
167
  {"role": "user", "content": "Write a haiku about GPUs"},
 
172
 
173
  The rest of the inference snippet remains the same
174
 
175
+ ```python
176
  tokenized_chat = tokenizer.apply_chat_template(
177
  messages,
178
  tokenize=True,
 
194
 
195
  The snippet below shows how to use this model with TRT-LLM. We tested this on the following [commit](https://github.com/NVIDIA/TensorRT-LLM/tree/46c5a564446673cdd0f56bcda938d53025b6d04e) and followed these [instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/46c5a564446673cdd0f56bcda938d53025b6d04e/docs/source/installation/build-from-source-linux.md#option-2-build-tensorrt-llm-step-by-step) to build and install TRT-LLM in a docker container.
196
 
197
+ ```python
198
  from tensorrt_llm import SamplingParams
199
  from tensorrt_llm._torch import LLM
200
  from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
 
208
  )
209
  ```
210
 
211
+ ```python
212
  model_id = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
213
  tokenizer = AutoTokenizer.from_pretrained(model_id)
214
 
 
260
 
261
  Alternativly, you can use Docker to launch a vLLM server.
262
 
263
+ ```bash
264
  export TP_SIZE=1 # Adjust this value based on the number of GPUs you want to use
265
  docker run --runtime nvidia --gpus all \
266
  -v ~/.cache/huggingface:/root/.cache/huggingface \
 
498
 
499
  We follow the jinja chat template provided below. This template conditionally adds `<think>\n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds `<think></think>` to the start of the Assistant response if `/no_think` is found in the system prompt. Thus enforcing reasoning on/off behavior.
500
 
501
+ ```jinja2
502
  {%- set ns = namespace(enable_thinking = true) %}
503
 
504
  {%- for message in messages -%}