burtenshaw commited on
Commit
7607c06
·
1 Parent(s): 43ec400

remove ticks and backslashes from titles

Browse files
app/src/content/article.mdx CHANGED
@@ -58,7 +58,7 @@ Learn more about how transformers achieves modularity in the [modular transforme
58
 
59
  Now, let's tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.
60
 
61
- ## What is `nanochat`?
62
 
63
  <Sidenote>
64
 
@@ -80,7 +80,7 @@ Karpathy had painstakingly implemented an end-to-end build of an LLM system with
80
 
81
  Personally, I found the process to be one of the most educational I can remember.
82
 
83
- ## What is `transformers` and how is it educational?
84
 
85
  <Sidenote>
86
 
@@ -135,7 +135,7 @@ class NanoChatRMSNorm(torch.nn.Module):
135
 
136
  If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model's implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.
137
 
138
- ## Why do we need nanochat in `transformers`?
139
 
140
  It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
141
 
@@ -157,8 +157,8 @@ Learn about [model quantization](https://huggingface.co/docs/transformers/en/qua
157
  - Quantize models in llama.cpp ($0)
158
  - Integrate models into the browser and WebGPU ($0)
159
  - SFT training in TRL/torch on Google Colab ($0)
160
- - RL training TRL/torch on Google Colab ($0 \- $9)
161
- - Agentic RL in TRL on Google Colab ($0 \- $9)
162
 
163
 
164
  Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
@@ -169,7 +169,7 @@ The [speedrun.sh](https://github.com/karpathy/nanochat/blob/master/speedrun.sh)
169
 
170
  </Sidenote>
171
 
172
- In short, let's unlock more opportunities for education\!
173
 
174
  ## The nanochat architecture
175
 
@@ -181,7 +181,7 @@ The original [gpt.py](https://github.com/karpathy/nanochat/blob/master/nanochat/
181
 
182
  As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
183
 
184
- The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128\. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
185
 
186
  The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:
187
 
 
58
 
59
  Now, let's tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.
60
 
61
+ ## What is nanochat?
62
 
63
  <Sidenote>
64
 
 
80
 
81
  Personally, I found the process to be one of the most educational I can remember.
82
 
83
+ ## What is transformers?
84
 
85
  <Sidenote>
86
 
 
135
 
136
  If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model's implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.
137
 
138
+ ## Why do we need nanochat in transformers?
139
 
140
  It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
141
 
 
157
  - Quantize models in llama.cpp ($0)
158
  - Integrate models into the browser and WebGPU ($0)
159
  - SFT training in TRL/torch on Google Colab ($0)
160
+ - RL training TRL/torch on Google Colab ($0 - $9)
161
+ - Agentic RL in TRL on Google Colab ($0 - $9)
162
 
163
 
164
  Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
 
169
 
170
  </Sidenote>
171
 
172
+ In short, let's unlock more opportunities for education!
173
 
174
  ## The nanochat architecture
175
 
 
181
 
182
  As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
183
 
184
+ The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
185
 
186
  The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:
187
 
app/src/content/chapters/grpo.mdx CHANGED
@@ -1,4 +1,4 @@
1
- # [BONUS 3] Group Relative Policy Optimization in `torch`
2
 
3
  - [GRPO](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/grpo.ipynb)
4
 
 
1
+ # [BONUS 3] Group Relative Policy Optimization in vanilla torch
2
 
3
  - [GRPO](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/grpo.ipynb)
4
 
app/src/content/chapters/inference.mdx CHANGED
@@ -27,7 +27,7 @@ outputs = model.generate(**inputs, max_new_tokens=100)
27
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
28
  ```
29
 
30
- ### Inference in `transformers` with `vLLM`
31
 
32
  Next, let's use `transformers` as a backend for `vLLM` to serve the model for optimized inference.
33
 
@@ -56,7 +56,7 @@ curl -X POST "http://localhost:8000/v1/completions" \
56
  }'
57
  ```
58
 
59
- ### Inference on your trained `nanochat` weights
60
 
61
  Let's say you've followed the nanochat repo and used it to train a model. The you can add transformer compatibility to your model and use it in other libraries.
62
 
 
27
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
28
  ```
29
 
30
+ ### Inference in transformers with vLLM
31
 
32
  Next, let's use `transformers` as a backend for `vLLM` to serve the model for optimized inference.
33
 
 
56
  }'
57
  ```
58
 
59
+ ### Inference on your trained nanochat weights
60
 
61
  Let's say you've followed the nanochat repo and used it to train a model. The you can add transformer compatibility to your model and use it in other libraries.
62