transformers

Running

App Files Files Community

burtenshaw commited on 25 days ago

Commit

7607c06

1 Parent(s): 43ec400

remove ticks and backslashes from titles

Browse files

Files changed (3) hide show

app/src/content/article.mdx +7 -7
app/src/content/chapters/grpo.mdx +1 -1
app/src/content/chapters/inference.mdx +2 -2

app/src/content/article.mdx CHANGED Viewed

@@ -58,7 +58,7 @@ Learn more about how transformers achieves modularity in the [modular transforme
 Now, let's tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.
-## What is `nanochat`?
 <Sidenote>
@@ -80,7 +80,7 @@ Karpathy had painstakingly implemented an end-to-end build of an LLM system with
 Personally, I found the process to be one of the most educational I can remember.
-## What is `transformers` and how is it educational?
 <Sidenote>
@@ -135,7 +135,7 @@ class NanoChatRMSNorm(torch.nn.Module):
 If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model's implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.
-## Why do we need nanochat in `transformers`?
 It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
@@ -157,8 +157,8 @@ Learn about [model quantization](https://huggingface.co/docs/transformers/en/qua
 - Quantize models in llama.cpp ($0)
 - Integrate models into the browser and WebGPU ($0)
 - SFT training in TRL/torch on Google Colab ($0)
-- RL training TRL/torch on Google Colab ($0 \- $9)
-- Agentic RL in TRL on Google Colab ($0 \- $9)
 Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
@@ -169,7 +169,7 @@ The [speedrun.sh](https://github.com/karpathy/nanochat/blob/master/speedrun.sh)
 </Sidenote>
-In short, let's unlock more opportunities for education\!
 ## The nanochat architecture
@@ -181,7 +181,7 @@ The original [gpt.py](https://github.com/karpathy/nanochat/blob/master/nanochat/
 As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
-The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128\. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
 The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:

 Now, let's tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.
+## What is nanochat?
 <Sidenote>
 Personally, I found the process to be one of the most educational I can remember.
+## What is transformers?
 <Sidenote>
 If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model's implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.
+## Why do we need nanochat in transformers?
 It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
 - Quantize models in llama.cpp ($0)
 - Integrate models into the browser and WebGPU ($0)
 - SFT training in TRL/torch on Google Colab ($0)
+- RL training TRL/torch on Google Colab ($0 - $9)
+- Agentic RL in TRL on Google Colab ($0 - $9)
 Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
 </Sidenote>
+In short, let's unlock more opportunities for education!
 ## The nanochat architecture
 As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
+The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
 The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:

app/src/content/chapters/grpo.mdx CHANGED Viewed

@@ -1,4 +1,4 @@
-# [BONUS 3] Group Relative Policy Optimization in `torch`
 - [GRPO](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/grpo.ipynb)


1	+ # [BONUS 3] Group Relative Policy Optimization in vanilla torch
2
3	- [GRPO](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/grpo.ipynb)
4

app/src/content/chapters/inference.mdx CHANGED Viewed

@@ -27,7 +27,7 @@ outputs = model.generate(**inputs, max_new_tokens=100)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-### Inference in `transformers` with `vLLM`
 Next, let's use `transformers` as a backend for `vLLM` to serve the model for optimized inference.
@@ -56,7 +56,7 @@ curl -X POST "http://localhost:8000/v1/completions" \
 	}'
 ```
-### Inference on your trained `nanochat` weights
 Let's say you've followed the nanochat repo and used it to train a model. The you can add transformer compatibility to your model and use it in other libraries.

 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+### Inference in transformers with vLLM
 Next, let's use `transformers` as a backend for `vLLM` to serve the model for optimized inference.
 	}'
 ```
+### Inference on your trained nanochat weights
 Let's say you've followed the nanochat repo and used it to train a model. The you can add transformer compatibility to your model and use it in other libraries.