burtenshaw
commited on
Commit
·
7607c06
1
Parent(s):
43ec400
remove ticks and backslashes from titles
Browse files
app/src/content/article.mdx
CHANGED
|
@@ -58,7 +58,7 @@ Learn more about how transformers achieves modularity in the [modular transforme
|
|
| 58 |
|
| 59 |
Now, let's tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.
|
| 60 |
|
| 61 |
-
## What is
|
| 62 |
|
| 63 |
<Sidenote>
|
| 64 |
|
|
@@ -80,7 +80,7 @@ Karpathy had painstakingly implemented an end-to-end build of an LLM system with
|
|
| 80 |
|
| 81 |
Personally, I found the process to be one of the most educational I can remember.
|
| 82 |
|
| 83 |
-
## What is
|
| 84 |
|
| 85 |
<Sidenote>
|
| 86 |
|
|
@@ -135,7 +135,7 @@ class NanoChatRMSNorm(torch.nn.Module):
|
|
| 135 |
|
| 136 |
If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model's implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.
|
| 137 |
|
| 138 |
-
## Why do we need nanochat in
|
| 139 |
|
| 140 |
It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
|
| 141 |
|
|
@@ -157,8 +157,8 @@ Learn about [model quantization](https://huggingface.co/docs/transformers/en/qua
|
|
| 157 |
- Quantize models in llama.cpp ($0)
|
| 158 |
- Integrate models into the browser and WebGPU ($0)
|
| 159 |
- SFT training in TRL/torch on Google Colab ($0)
|
| 160 |
-
- RL training TRL/torch on Google Colab ($0
|
| 161 |
-
- Agentic RL in TRL on Google Colab ($0
|
| 162 |
|
| 163 |
|
| 164 |
Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
|
|
@@ -169,7 +169,7 @@ The [speedrun.sh](https://github.com/karpathy/nanochat/blob/master/speedrun.sh)
|
|
| 169 |
|
| 170 |
</Sidenote>
|
| 171 |
|
| 172 |
-
In short, let's unlock more opportunities for education
|
| 173 |
|
| 174 |
## The nanochat architecture
|
| 175 |
|
|
@@ -181,7 +181,7 @@ The original [gpt.py](https://github.com/karpathy/nanochat/blob/master/nanochat/
|
|
| 181 |
|
| 182 |
As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
|
| 183 |
|
| 184 |
-
The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128
|
| 185 |
|
| 186 |
The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:
|
| 187 |
|
|
|
|
| 58 |
|
| 59 |
Now, let's tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.
|
| 60 |
|
| 61 |
+
## What is nanochat?
|
| 62 |
|
| 63 |
<Sidenote>
|
| 64 |
|
|
|
|
| 80 |
|
| 81 |
Personally, I found the process to be one of the most educational I can remember.
|
| 82 |
|
| 83 |
+
## What is transformers?
|
| 84 |
|
| 85 |
<Sidenote>
|
| 86 |
|
|
|
|
| 135 |
|
| 136 |
If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model's implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.
|
| 137 |
|
| 138 |
+
## Why do we need nanochat in transformers?
|
| 139 |
|
| 140 |
It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
|
| 141 |
|
|
|
|
| 157 |
- Quantize models in llama.cpp ($0)
|
| 158 |
- Integrate models into the browser and WebGPU ($0)
|
| 159 |
- SFT training in TRL/torch on Google Colab ($0)
|
| 160 |
+
- RL training TRL/torch on Google Colab ($0 - $9)
|
| 161 |
+
- Agentic RL in TRL on Google Colab ($0 - $9)
|
| 162 |
|
| 163 |
|
| 164 |
Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
|
|
|
|
| 169 |
|
| 170 |
</Sidenote>
|
| 171 |
|
| 172 |
+
In short, let's unlock more opportunities for education!
|
| 173 |
|
| 174 |
## The nanochat architecture
|
| 175 |
|
|
|
|
| 181 |
|
| 182 |
As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
|
| 183 |
|
| 184 |
+
The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
|
| 185 |
|
| 186 |
The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:
|
| 187 |
|
app/src/content/chapters/grpo.mdx
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
# [BONUS 3] Group Relative Policy Optimization in
|
| 2 |
|
| 3 |
- [GRPO](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/grpo.ipynb)
|
| 4 |
|
|
|
|
| 1 |
+
# [BONUS 3] Group Relative Policy Optimization in vanilla torch
|
| 2 |
|
| 3 |
- [GRPO](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/grpo.ipynb)
|
| 4 |
|
app/src/content/chapters/inference.mdx
CHANGED
|
@@ -27,7 +27,7 @@ outputs = model.generate(**inputs, max_new_tokens=100)
|
|
| 27 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 28 |
```
|
| 29 |
|
| 30 |
-
### Inference in
|
| 31 |
|
| 32 |
Next, let's use `transformers` as a backend for `vLLM` to serve the model for optimized inference.
|
| 33 |
|
|
@@ -56,7 +56,7 @@ curl -X POST "http://localhost:8000/v1/completions" \
|
|
| 56 |
}'
|
| 57 |
```
|
| 58 |
|
| 59 |
-
### Inference on your trained
|
| 60 |
|
| 61 |
Let's say you've followed the nanochat repo and used it to train a model. The you can add transformer compatibility to your model and use it in other libraries.
|
| 62 |
|
|
|
|
| 27 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 28 |
```
|
| 29 |
|
| 30 |
+
### Inference in transformers with vLLM
|
| 31 |
|
| 32 |
Next, let's use `transformers` as a backend for `vLLM` to serve the model for optimized inference.
|
| 33 |
|
|
|
|
| 56 |
}'
|
| 57 |
```
|
| 58 |
|
| 59 |
+
### Inference on your trained nanochat weights
|
| 60 |
|
| 61 |
Let's say you've followed the nanochat repo and used it to train a model. The you can add transformer compatibility to your model and use it in other libraries.
|
| 62 |
|