burtenshaw commited on
Commit
f6fedae
·
1 Parent(s): 25932c2

add real content

Browse files
Files changed (2) hide show
  1. README.md +443 -65
  2. _README.md +122 -0
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: 'Bringing paper to life: A modern template for scientific writing'
3
- short_desc: 'A practical journey behind training SOTA LLMs'
4
  emoji: 📝
5
  colorFrom: blue
6
  colorTo: indigo
@@ -10,9 +10,7 @@ header: mini
10
  app_port: 8080
11
  tags:
12
  - research-article-template
13
- - research paper
14
- - scientific paper
15
- - data visualization
16
  thumbnail: https://HuggingFaceTB-smol-training-playbook.hf.space/thumb.png
17
  ---
18
  <div align="center">
@@ -31,92 +29,472 @@ thumbnail: https://HuggingFaceTB-smol-training-playbook.hf.space/thumb.png
31
 
32
  </div>
33
 
34
- ## 🚀 Quick Start
35
 
36
- ### Option 1: Duplicate on Hugging Face (Recommended)
37
 
38
- 1. Visit **[🤗 Research Article Template](https://huggingface.co/spaces/tfrere/research-article-template)**
39
- 2. Click **"Duplicate this Space"**
40
- 3. Clone your new repository:
41
- ```bash
42
- git clone git@hf.co:spaces/<your-username>/<your-space>
43
- cd <your-space>
44
- ```
45
 
46
- ### Option 2: Clone Directly
47
 
48
- ```bash
49
- git clone https://github.com/tfrere/research-article-template.git
50
- cd research-article-template
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ```
52
 
53
- ### Installation
54
 
55
- ```bash
56
- # Install Node.js 20+ (use nvm for version management)
57
- nvm install 20
58
- nvm use 20
 
59
 
60
- # Install Git LFS and pull assets
61
- git lfs install
62
- git lfs pull
63
 
64
- # Install dependencies
65
- cd app
66
- npm install
67
 
68
- # Start development server
69
- npm run dev
70
  ```
71
 
72
- Visit `http://localhost:4321` to see your site!
73
 
74
- ## 🎯 Who This Is For
75
 
76
- - **Scientists** writing modern, web-native research papers
77
- - **Educators** creating interactive, explorable lessons
78
- - **Researchers** who want to focus on ideas, not infrastructure
79
- - **Anyone** who values clear, engaging technical communication
80
 
81
- ## 🌟 Inspired by Distill
82
 
83
- This template carries forward the spirit of [Distill](https://distill.pub/) (2016–2021), pushing interactive scientific writing even further with:
84
- - Accessible, high-quality explanations
85
- - Reproducible, production-ready demos
86
- - Modern web technologies and best practices
87
 
88
- ## 🤝 Contributing
89
 
90
- We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
91
 
92
- ### Ways to Contribute
 
 
 
 
93
 
94
- - **Report bugs** - Open an issue with detailed information
95
- - **Suggest features** - Share ideas for improvements
96
- - **Improve documentation** - Help others get started
97
- - **Submit code** - Fix bugs or add features
98
- - **Join discussions** - Share feedback and ideas
99
 
100
- ## 📄 License
101
 
102
- This project is licensed under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
103
 
104
- - **Diagrams and text**: CC-BY 4.0
105
- - **Source code**: Available on [Hugging Face](https://huggingface.co/spaces/tfrere/research-article-template)
106
- - **Third-party figures**: Excluded and marked in captions
107
 
108
- ## 🙏 Acknowledgments
109
 
110
- - Inspired by [Distill](https://distill.pub/) and the interactive scientific writing movement
111
- - Built with [Astro](https://astro.build/), [MDX](https://mdxjs.com/), and modern web technologies
112
- - Community feedback and contributions from researchers worldwide
113
 
114
- ## 📞 Support
115
 
116
- - **[Community Discussions](https://huggingface.co/spaces/tfrere/research-article-template/discussions)** - Ask questions and share ideas
117
- - **[Report Issues](https://huggingface.co/spaces/tfrere/research-article-template/discussions?status=open&type=issue)** - Bug reports and feature requests
118
- - **Contact**: [@tfrere](https://huggingface.co/tfrere) on Hugging Face
119
 
120
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
- **Made with ❤️ for the scientific community**
 
 
1
  ---
2
+ title: 'Porting nanochat to Transformers: an AI modeling history lesson'
3
+ short_desc: 'An educational tour of Andrej Karpathy's nanochat and Hugging Face transformers'
4
  emoji: 📝
5
  colorFrom: blue
6
  colorTo: indigo
 
10
  app_port: 8080
11
  tags:
12
  - research-article-template
13
+ - educational paper
 
 
14
  thumbnail: https://HuggingFaceTB-smol-training-playbook.hf.space/thumb.png
15
  ---
16
  <div align="center">
 
29
 
30
  </div>
31
 
32
+ # Porting nanochat to Transformers: an AI modeling history lesson
33
 
34
+ **tldr:** There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture.
35
 
36
+ Recently I was working on helping students of the [nanochat](https://huggingface.co/nanochat-students) project to share their models and discuss their learning on Hugging Face. In the process, I thought it would be useful if the model was integrated into the `transformers` library. This would allow others to use their nanochat models for inference in loads of downstream libraries like vLLM for inference or TRL for post-training.
 
 
 
 
 
 
37
 
38
+ You can now use nanochat models in transformers and tap into all those educational gains across the ecosystem. But along the way, I uncovered a further treasure trove of education about how canonical models relate to each other, and the components they share.
39
 
40
+ I received the lesson from the simple teacher of class inheritance and transformers modular philosophy. If you want to learn more about that, check out this [guide here](https://huggingface.co/docs/transformers/v4.48.0/modular_transformers).
41
+
42
+ Here, let’s tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.
43
+
44
+ ## What is `nanochat`?
45
+
46
+ On October 13th 2025, Andrej Karpathy unceremoniously [dropped](https://x.com/karpathy/status/1977755427569111362) the nanochat [repo](https://github.com/karpathy/nanochat) into the unsuspecting AI world. To hype seekers, this was just a small and pretty average LLM. To ML devotees, this was nirvana. A raw unadulterated chance to tinker, fiddle, and play with a transformer model defined in pure pytorch. Nothing was hidden away in fancy `torch` methods or inherited from complex class structures. It was all there in a simple file.
47
+
48
+ ![][image1]
49
+
50
+ Karpathy had painstakingly implemented an end-to-end build of an LLM system without the use of most major libraries. Even though in real world situations most rely on transformers, tokenizers, datasets, trl, etc. This back to basics approach gives us the chance to genuinely learn and understand something from the ground up.
51
+
52
+ Personally, I found the process to be one of the most educational I can remember.
53
+
54
+ ## What is `transformers` and how is it educational?
55
+
56
+ Most of know the `transformers` library as the backbone of modern machine learning, but if we dig a little deeper, it’s a powerful piece of education.
57
+
58
+ If you don’t know… transformers is the de facto implementation of modern AI models that bear the same name; ‘transformers’ like models in GPT, DeepSeek, Claude, series. `transformers` is a special project because it contains the implementation of all major open model architecture and those model architectures are modularized to reuse functionality from each other. If you want to explore the philosophy and lineage behind transformers’ modularity, check out this [guide here](https://huggingface.co/docs/transformers/v4.48.0/modular_transformers).
59
+
60
+ In general, scientists at AI research labs design, implement, and train their models in their framework of choice, be that torch, JAX, etc. When they come to share their open model with the community, they will open a PR on transformers and refactor their code to use relevant modules.
61
+
62
+ Because `transformers` contain most major model implementations, researchers have to inherent model architecture attributes from other canonical models. This is in every sense a ‘single source of truth’.
63
+
64
+ This practical feature of the library has an amazingly educational quality to it. We can read a model implementation as a series of references to other usages of those architectural features. For example, when one model uses a certain type of [RMSNorm](https://github.com/huggingface/transformers/blob/9f5b2d1b8995daa539b757e28c337e36408055e6/src/transformers/models/nanochat/modular_nanochat.py#L44), we can plainly see that it is the same implementation as another model because it inherits that class entirely. For example, check out nanochat’s RMSNorm:
65
+
66
+ ```py
67
+ class NanoChatRMSNorm(Llama4TextL2Norm):
68
+ pass
69
  ```
70
 
71
+ The `transformers` library then converts the `modular_*` implementation into a `modeling_*` implementation, which contains the complete `torch` native implementation:
72
 
73
+ ```py
74
+ class NanoChatRMSNorm(torch.nn.Module):
75
+ def __init__(self, eps: float = 1e-6):
76
+ super().__init__()
77
+ self.eps = eps
78
 
79
+ def _norm(self, x):
80
+ return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
 
81
 
82
+ def forward(self, x):
83
+ return self._norm(x.float()).type_as(x)
 
84
 
85
+ def extra_repr(self):
86
+ return f"eps={self.eps}"
87
  ```
88
 
89
+ If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model’s implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.
90
 
91
+ ## Why do we need nanochat in `transformers`?
92
 
93
+ It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat’s benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or Olmo3.
 
 
 
94
 
95
+ Nanochat was never really intended as a production grade model. It was meant as an educational tool, and that’s the same reason why we need it in transformers. There are four main reasons:
96
 
97
+ - `transformers` as a single source of truth teaches us about `nanochat`’s lineage.
98
+ - use the `nanochat` model in other libraries.
99
+ - save money by reusing nanochat checkpoints for fine-tuning.
100
+ - compare nanochat fine-tuning with other open model checkpoints.
101
 
102
+ Firstly, as mentioned above`transformers` teaches us about the modeling conventions that Karpathy uses from other canonical implementations.
103
 
104
+ Secondly, because transformers is a standard within the ecosystem, it unlocks more downstream learning in post training libraries, quantisation tools, inference libraries, and device integrations. In practical terms, here are some examples nanochat students could learn on top of `transformers`:
105
 
106
+ - Quantize models in llama.cpp ($0)
107
+ - Integrate models into the browser and WebGPU ($0)
108
+ - SFT training in TRL/torch on Google Colab ($0)
109
+ - RL training TRL/torch on Google Colab ($0 \- $9)
110
+ - Agentic RL in TRL on Google Colab ($0 \- $9)
111
 
112
+ Finally, training AI models is expensive. Running the `nanochat` [`speedrun.sh`](https://github.com/karpathy/nanochat/blob/master/speedrun.sh) costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
 
 
 
 
113
 
114
+ In short, let’s unlock more opportunities for education\!
115
 
116
+ ## The nanochat architecture
117
 
118
+ As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works.
 
 
119
 
120
+ The core model implementation ([`nanochat/gpt.py`](http://gpt.py), 291 lines) demonstrates modern transformer architecture, with every design decision documented and justified.
121
 
122
+ The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128\. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
 
 
123
 
124
+ The architecture incorporates five key improvements over vanilla transformers. Let’s work through the components of this architecture and compare them across implementation:
125
 
126
+ #### Forward pass based on the Llama Architecture
 
 
127
 
128
+ The forward pass in nanochat handles both training and generation. We can simply read that the input `x` is embedded and then updated by each layer then the head. During training, a loss is calculated and returned instead of the logits themselves.
129
+
130
+ ```py
131
+ def forward(self, x, targets=None, loss_reduction='mean'):
132
+ x = self.token_emb(x)
133
+ for layer in self.layers:
134
+ x = layer(x)
135
+ x = self.ln_f(x)
136
+ logits = self.lm_head(x)
137
+
138
+ if targets is not None:
139
+ loss = F.cross_entropy(
140
+ logits.view(-1, self.vocab_size),
141
+ targets.view(-1),
142
+ ignore_index=-1,
143
+ reduction=loss_reduction
144
+ )
145
+ return loss
146
+ return logits
147
+ ```
148
+
149
+ By returning loss directly when targets are provided, the training loop becomes trivial. No separate loss computation, no manual masking logic—just `loss = model(inputs, targets)` followed by `loss.backward()`.
150
+
151
+ `transformers` has to make things a bit more complex to facilitate the downstream ecosystem that uses logits in a broad spectrum of ways. Therefore, loss calculation is dealt with in training-specific code, and the `forward` function returns `BaseModelOutputWithPast`.
152
+
153
+ ```py
154
+ class NanoChatModel(LlamaModel):
155
+ def __init__(self, config: NanoChatConfig):
156
+ super().__init__(config)
157
+
158
+ self.initial_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
159
+ self.norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
160
+
161
+ def forward(
162
+ self,
163
+ input_ids: Optional[torch.LongTensor] = None,
164
+ attention_mask: Optional[torch.Tensor] = None,
165
+ position_ids: Optional[torch.LongTensor] = None,
166
+ past_key_values: Optional[Cache] = None,
167
+ inputs_embeds: Optional[torch.FloatTensor] = None,
168
+ cache_position: Optional[torch.LongTensor] = None,
169
+ use_cache: Optional[bool] = None,
170
+ **kwargs: Unpack[TransformersKwargs],
171
+ ) -> BaseModelOutputWithPast:
172
+ if (input_ids is None) ^ (inputs_embeds is not None):
173
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
174
+
175
+ if inputs_embeds is None:
176
+ inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)
177
+
178
+ if use_cache and past_key_values is None:
179
+ past_key_values = DynamicCache(config=self.config)
180
+
181
+ if cache_position is None:
182
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
183
+ cache_position: torch.Tensor = torch.arange(
184
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
185
+ )
186
+
187
+ if position_ids is None:
188
+ position_ids = cache_position.unsqueeze(0)
189
+
190
+ causal_mask = create_causal_mask(
191
+ config=self.config,
192
+ input_embeds=inputs_embeds,
193
+ attention_mask=attention_mask,
194
+ cache_position=cache_position,
195
+ past_key_values=past_key_values,
196
+ position_ids=position_ids,
197
+ )
198
+
199
+ hidden_states = inputs_embeds
200
+ position_embeddings = self.rotary_emb(hidden_states, position_ids=position_ids)
201
+
202
+ hidden_states = self.initial_norm(hidden_states) # Additional norm before the layers
203
+ for decoder_layer in self.layers[: self.config.num_hidden_layers]:
204
+ hidden_states = decoder_layer(
205
+ hidden_states,
206
+ attention_mask=causal_mask,
207
+ position_embeddings=position_embeddings,
208
+ position_ids=position_ids,
209
+ past_key_values=past_key_values,
210
+ cache_position=cache_position,
211
+ **kwargs,
212
+ )
213
+
214
+ hidden_states = self.norm(hidden_states)
215
+ return BaseModelOutputWithPast(
216
+ last_hidden_state=hidden_states,
217
+ past_key_values=past_key_values,
218
+ )
219
+
220
+ ```
221
+
222
+ #### Rotary Position Embeddings (RoPE)
223
+
224
+ Rotary Position Embeddings (RoPE) replace learned positional encodings by rotating query and key vectors using precomputed sin/cos frequencies:
225
+
226
+ ```py
227
+ def apply_rope(x, cos, sin):
228
+ x1, x2 = x[..., ::2], x[..., 1::2]
229
+ y1 = x1 * cos - x2 * sin
230
+ y2 = x1 * sin + x2 * cos
231
+ return torch.stack([y1, y2], dim=-1).flatten(-2)
232
+ ```
233
+
234
+ In transformers, the rotary embeddings are implemented like so:
235
+
236
+ ```py
237
+ from ..llama.modeling_llama import (
238
+ LlamaDecoderLayer,
239
+ LlamaModel,
240
+ LlamaPreTrainedModel,
241
+ LlamaRotaryEmbedding,
242
+ apply_rotary_pos_emb,
243
+ eager_attention_forward,
244
+ )
245
+
246
+
247
+ class NanoChatRotaryEmbedding(LlamaRotaryEmbedding):
248
+ pass
249
+
250
+
251
+ def rotate_half(x):
252
+ """Rotates half the hidden dims of the input with flipped signs for NanoChat."""
253
+ x1 = x[..., : x.shape[-1] // 2]
254
+ x2 = x[..., x.shape[-1] // 2 :]
255
+ return torch.cat((x2, -x1), dim=-1)
256
+ ```
257
+
258
+ `NanoChatRotaryEmbedding` almost entirely inherits from the original Llama series, except for a sign inversion in `rotate_half`**.**
259
+
260
+ ### **QK Normalization**
261
+
262
+ NanoChat applies RMSNorm to queries and keys before computing attention to stabilize training.
263
+
264
+ In the original gpt.py, this is achieved via a functional norm helper applied directly inside the attention forward pass:
265
+
266
+ ```py
267
+ def norm(x):
268
+ # Purely functional rmsnorm with no learnable params
269
+ return F.rms_norm(x, (x.size(-1),))
270
+
271
+ class CausalSelfAttention(nn.Module):
272
+ ...
273
+ def forward(self, x, cos_sin, kv_cache):
274
+ B, T, C = x.size()
275
+
276
+ # Project the input to get queries, keys, and values
277
+ q = self.c_q(x).view(B, T, self.n_head, self.head_dim)
278
+ k = self.c_k(x).view(B, T, self.n_kv_head, self.head_dim)
279
+ v = self.c_v(x).view(B, T, self.n_kv_head, self.head_dim)
280
+
281
+ # Apply Rotary Embeddings to queries and keys to get relative positional encoding
282
+ cos, sin = cos_sin
283
+ q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin) # QK rotary embedding
284
+ q, k = norm(q), norm(k) # QK norm
285
+ q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) # make head be batch dim, i.e. (B, T, H, D) -> (B, H, T, D)
286
+ ...
287
+ ```
288
+
289
+ In the modular transformers implementation, we see a fascinating mix of lineages. The `NanoChatRMSNorm` inherits directly from `Llama4TextL2Norm`, while the attention mechanism inherits from `Qwen3Attention`. We simply inject the QK normalization into the Qwen3 logic:
290
+
291
+ ```py
292
+
293
+ class NanoChatRMSNorm(Llama4TextL2Norm):
294
+ pass
295
+
296
+ class NanoChatAttention(Qwen3Attention):
297
+ def __init__(self, config: NanoChatConfig, layer_idx: int):
298
+ super().__init__(config, layer_idx)
299
+ del self.sliding_window
300
+ del self.layer_type
301
+
302
+ self.q_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
303
+ self.k_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
304
+
305
+ def forward(
306
+ self,
307
+ hidden_states: torch.Tensor,
308
+ position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
309
+ attention_mask: Optional[torch.Tensor] = None,
310
+ past_key_values: Optional[Cache] = None,
311
+ cache_position: Optional[torch.LongTensor] = None,
312
+ **kwargs: Unpack[TransformersKwargs],
313
+ ) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
314
+ input_shape = hidden_states.shape[:-1]
315
+ hidden_shape = (*input_shape, -1, self.head_dim)
316
+
317
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
318
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
319
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
320
+
321
+ cos, sin = position_embeddings
322
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
323
+
324
+ # RoPE -> Norm (instead of usual Norm -> RoPE)
325
+ query_states = self.q_norm(query_states)
326
+ key_states = self.k_norm(key_states)
327
+
328
+ if past_key_values is not None:
329
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
330
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
331
+ key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
332
+
333
+ attention_interface: Callable = eager_attention_forward
334
+ if self.config._attn_implementation != "eager":
335
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
336
+
337
+ attn_output, attn_weights = attention_interface(
338
+ self,
339
+ query_states,
340
+ key_states,
341
+ value_states,
342
+ attention_mask,
343
+ dropout=0.0 if not self.training else self.attention_dropout,
344
+ scaling=self.scaling,
345
+ **kwargs,
346
+ )
347
+
348
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
349
+ attn_output = self.o_proj(attn_output)
350
+ return attn_output, attn_weights
351
+ ```
352
+
353
+ ### **Untied Weights**
354
+
355
+ Karpathy's implementation deliberately unties the weights between the token embedding and the language model head to provide the model with more flexibility. In gpt.py, these are initialized as two completely separate modules:
356
+
357
+ ```py
358
+ class GPT(nn.Module):
359
+ def __init__(self, config):
360
+ super().__init__()
361
+ self.config = config
362
+ self.transformer = nn.ModuleDict({
363
+ "wte": nn.Embedding(config.vocab_size, config.n_embd),
364
+ "h": nn.ModuleList([Block(config, layer_idx) for layer_idx in range(config.n_layer)]),
365
+ })
366
+ self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
367
+ # ... (rest of init)
368
+ ```
369
+
370
+ In the modular implementation, we inherit from `Gemma2ForCausalLM`. This is a powerful simplification—Gemma 2 also supports untied weights and advanced output structures. By simply inheriting the class, we pull in all the necessary machinery for causal generation, while the configuration object (defined elsewhere) ensures the weights remain untied:
371
+
372
+ ```py
373
+ class NanoChatForCausalLM(Gemma2ForCausalLM):
374
+ def forward(self, **super_kwargs) -> CausalLMOutputWithPast:
375
+ super().forward(**super_kwargs)
376
+ ```
377
+
378
+ ###
379
+
380
+ ### **ReLU² Activation**
381
+
382
+ The original implementation replaces the standard GELU activation with ReLU², which is simply ReLU squared. This provides a faster alternative without performance loss. In gpt.py, this is hardcoded into the MLP block:
383
+
384
+ ```py
385
+ class MLP(nn.Module):
386
+ def __init__(self, config):
387
+ super().__init__()
388
+ self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
389
+ self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
390
+ def forward(self, x):
391
+ x = self.c_fc(x)
392
+ x = F.relu(x).square()
393
+ x = self.c_proj(x)
394
+ return x
395
+ ```
396
+
397
+ In the modular file, we see another surprising inheritance: `CLIPMLP`. The CLIP architecture uses a structure that fits our needs perfectly, so we inherit the structural definition from CLIP and let the configuration drive the specific activation function (ReLU2):
398
+
399
+ ```py
400
+ class NanoChatMLP(CLIPMLP):
401
+ def __init__(self, config):
402
+ super().__init__(config)
403
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
404
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
405
+ ```
406
+
407
+ ### **Multi-Query Attention (MQA)**
408
+
409
+ NanoChat uses Multi-Query Attention (MQA) to reduce the memory footprint of the KV cache, using 10 query heads but only 4 key/value heads (in the default config).
410
+
411
+ In gpt.py, this logic is handled by passing distinct head counts and relying on PyTorch's functional attention to handle the broadcasting (or explicitly handling it during inference):
412
+
413
+ ```py
414
+ class CausalSelfAttention(nn.Module):
415
+ # ...
416
+ def forward(self, x, cos_sin, kv_cache):
417
+ # ...
418
+ # Attention: queries attend to keys/values autoregressively. A few cases to handle:
419
+ enable_gqa = self.n_head != self.n_kv_head # Group Query Attention (GQA): duplicate key/value heads to match query heads if desired
420
+ if kv_cache is None or Tq == Tk:
421
+ # During training (no KV cache), attend as usual with causal attention
422
+ # And even if there is KV cache, we can still use this simple version when Tq == Tk
423
+ y = F.scaled_dot_product_attention(q, k, v, is_causal=True, enable_gqa=enable_gqa)
424
+ elif Tq == 1:
425
+ # During inference but with a single query in this forward pass:
426
+ # The query has to attend to all the keys/values in the cache
427
+ y = F.scaled_dot_product_attention(q, k, v, is_causal=False, enable_gqa=enable_gqa)
428
+ else:
429
+ # During inference AND we have a chunk of queries in this forward pass:
430
+ # First, each query attends to all the cached keys/values (i.e. full prefix)
431
+ attn_mask = torch.zeros((Tq, Tk), dtype=torch.bool, device=q.device) # True = keep, False = mask
432
+ prefix_len = Tk - Tq
433
+ if prefix_len > 0: # can't be negative but could be zero
434
+ attn_mask[:, :prefix_len] = True
435
+ # Then, causal attention within this chunk
436
+ attn_mask[:, prefix_len:] = torch.tril(torch.ones((Tq, Tq), dtype=torch.bool, device=q.device))
437
+ y = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask, enable_gqa=enable_gqa)
438
+ # ...
439
+ ```
440
+
441
+ ###
442
+
443
+ In `modular_nanochat.py`, we don't need to write this logic at all. As seen in the QK Normalization section above, `NanoChatAttention` inherits from `Qwen3Attention`. The Qwen3 implementation is robust and fully supports GQA/MQA out of the box. By using this parent class, we get production-grade attention implementation "for free," allowing us to focus solely on the unique normalizations required by NanoChat.
444
+
445
+ ## Conclusion
446
+
447
+ It’s very clear that Andrej Karpathy’s implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.
448
+
449
+ ## Use Nanochat in Transformers
450
+
451
+ If you’d like to try out your own nanochat models in `transformers`
452
+
453
+ 1. Download the nanochat-d34 checkpoint
454
+
455
+ ```
456
+ hf download karpathy/nanochat-d34 --local-dir nanochat-d34
457
+ ```
458
+
459
+ 2. Convert the checkpoint to transformers format
460
+
461
+ ```
462
+ uv run \
463
+ --with "transformers @ git+https://github.com/huggingface/transformers.git@nanochat-implementation" \
464
+ --with "tiktoken>=0.12.0" \
465
+ https://raw.githubusercontent.com/huggingface/transformers/nanochat-implementation/src/transformers/models/nanochat/convert_nanochat_checkpoints.py \
466
+ --input_dir ./nanochat-d34 \
467
+ --output_dir ./nanochat-d3-hf
468
+ ```
469
+
470
+ 3. (optional) Upload the checkpoint to the Hugging Face Hub
471
+
472
+ ```
473
+ hf upload <username>/nanochat-d34 nanochat-d34
474
+ ```
475
+
476
+ 4. Test the model
477
+
478
+ ```py
479
+ import torch
480
+ from transformers import AutoTokenizer, NanoChatForCausalLM
481
+
482
+ tokenizer = AutoTokenizer.from_pretrained("./nanochat-d3-hf")
483
+ model = NanoChatForCausalLM.from_pretrained("./nanochat-d3-hf")
484
+
485
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
486
+ model = model.to(device)
487
+
488
+ prompt = "Hello, how are you?"
489
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
490
+ inputs.pop("token_type_ids", None)
491
+ outputs = model.generate(**inputs, max_new_tokens=100)
492
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
493
+ ```
494
+
495
+ ## Notebooks
496
+
497
+ If you want to train with these models, you can use these colab notebooks:
498
 
499
+ - [SFT](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/sft.ipynb)
500
+ - [GRPO](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/grpo.ipynb)
_README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: 'Bringing paper to life: A modern template for scientific writing'
3
+ short_desc: 'A practical journey behind training SOTA LLMs'
4
+ emoji: 📝
5
+ colorFrom: blue
6
+ colorTo: indigo
7
+ sdk: docker
8
+ pinned: false
9
+ header: mini
10
+ app_port: 8080
11
+ tags:
12
+ - research-article-template
13
+ - research paper
14
+ - scientific paper
15
+ - data visualization
16
+ thumbnail: https://HuggingFaceTB-smol-training-playbook.hf.space/thumb.png
17
+ ---
18
+ <div align="center">
19
+
20
+ # Research Article Template
21
+
22
+ [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
23
+ [![Node.js Version](https://img.shields.io/badge/node-%3E%3D20.0.0-brightgreen.svg)](https://nodejs.org/)
24
+ [![Astro](https://img.shields.io/badge/Astro-4.10.0-orange.svg)](https://astro.build/)
25
+ [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/tfrere/research-article-template)
26
+
27
+
28
+ **A modern, interactive template for scientific writing** that brings papers to life with web-native features. The web offers what static PDFs can't: **interactive diagrams**, **progressive notation**, and **exploratory views** that show how ideas behave. This template treats interactive artifacts—figures, math, code, and inspectable experiments—as **first-class** alongside prose, helping readers **build intuition** instead of skimming results—all with **minimal setup** and no web knowledge required.
29
+
30
+ **[Try the live demo & documentation →](https://huggingface.co/spaces/tfrere/research-article-template)**
31
+
32
+ </div>
33
+
34
+ ## 🚀 Quick Start
35
+
36
+ ### Option 1: Duplicate on Hugging Face (Recommended)
37
+
38
+ 1. Visit **[🤗 Research Article Template](https://huggingface.co/spaces/tfrere/research-article-template)**
39
+ 2. Click **"Duplicate this Space"**
40
+ 3. Clone your new repository:
41
+ ```bash
42
+ git clone git@hf.co:spaces/<your-username>/<your-space>
43
+ cd <your-space>
44
+ ```
45
+
46
+ ### Option 2: Clone Directly
47
+
48
+ ```bash
49
+ git clone https://github.com/tfrere/research-article-template.git
50
+ cd research-article-template
51
+ ```
52
+
53
+ ### Installation
54
+
55
+ ```bash
56
+ # Install Node.js 20+ (use nvm for version management)
57
+ nvm install 20
58
+ nvm use 20
59
+
60
+ # Install Git LFS and pull assets
61
+ git lfs install
62
+ git lfs pull
63
+
64
+ # Install dependencies
65
+ cd app
66
+ npm install
67
+
68
+ # Start development server
69
+ npm run dev
70
+ ```
71
+
72
+ Visit `http://localhost:4321` to see your site!
73
+
74
+ ## 🎯 Who This Is For
75
+
76
+ - **Scientists** writing modern, web-native research papers
77
+ - **Educators** creating interactive, explorable lessons
78
+ - **Researchers** who want to focus on ideas, not infrastructure
79
+ - **Anyone** who values clear, engaging technical communication
80
+
81
+ ## 🌟 Inspired by Distill
82
+
83
+ This template carries forward the spirit of [Distill](https://distill.pub/) (2016–2021), pushing interactive scientific writing even further with:
84
+ - Accessible, high-quality explanations
85
+ - Reproducible, production-ready demos
86
+ - Modern web technologies and best practices
87
+
88
+ ## 🤝 Contributing
89
+
90
+ We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
91
+
92
+ ### Ways to Contribute
93
+
94
+ - **Report bugs** - Open an issue with detailed information
95
+ - **Suggest features** - Share ideas for improvements
96
+ - **Improve documentation** - Help others get started
97
+ - **Submit code** - Fix bugs or add features
98
+ - **Join discussions** - Share feedback and ideas
99
+
100
+ ## 📄 License
101
+
102
+ This project is licensed under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
103
+
104
+ - **Diagrams and text**: CC-BY 4.0
105
+ - **Source code**: Available on [Hugging Face](https://huggingface.co/spaces/tfrere/research-article-template)
106
+ - **Third-party figures**: Excluded and marked in captions
107
+
108
+ ## 🙏 Acknowledgments
109
+
110
+ - Inspired by [Distill](https://distill.pub/) and the interactive scientific writing movement
111
+ - Built with [Astro](https://astro.build/), [MDX](https://mdxjs.com/), and modern web technologies
112
+ - Community feedback and contributions from researchers worldwide
113
+
114
+ ## 📞 Support
115
+
116
+ - **[Community Discussions](https://huggingface.co/spaces/tfrere/research-article-template/discussions)** - Ask questions and share ideas
117
+ - **[Report Issues](https://huggingface.co/spaces/tfrere/research-article-template/discussions?status=open&type=issue)** - Bug reports and feature requests
118
+ - **Contact**: [@tfrere](https://huggingface.co/tfrere) on Hugging Face
119
+
120
+ ---
121
+
122
+ **Made with ❤️ for the scientific community**