xrsrke/link_nanotron_fp8_appexdix
#21
by
neuralink
- opened
- dist/bibliography.bib +6 -0
- dist/index.html +16 -1
- src/bibliography.bib +6 -0
- src/index.html +16 -1
dist/bibliography.bib
CHANGED
|
@@ -510,4 +510,10 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
|
|
| 510 |
archivePrefix={arXiv},
|
| 511 |
primaryClass={cs.LG},
|
| 512 |
url={https://arxiv.org/abs/2309.14322},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 513 |
}
|
|
|
|
| 510 |
archivePrefix={arXiv},
|
| 511 |
primaryClass={cs.LG},
|
| 512 |
url={https://arxiv.org/abs/2309.14322},
|
| 513 |
+
}
|
| 514 |
+
@software{nanotronfp8,
|
| 515 |
+
title = {nanotron's FP8 implementation},
|
| 516 |
+
author = {nanotron},
|
| 517 |
+
url = {https://github.com/huggingface/nanotron/pull/70},
|
| 518 |
+
year = {2024}
|
| 519 |
}
|
dist/index.html
CHANGED
|
@@ -2215,7 +2215,7 @@
|
|
| 2215 |
</tbody>
|
| 2216 |
</table>
|
| 2217 |
|
| 2218 |
-
<p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation
|
| 2219 |
|
| 2220 |
<p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
|
| 2221 |
|
|
@@ -2382,6 +2382,16 @@
|
|
| 2382 |
<p>Training language models across compute clusters with DiLoCo.</p>
|
| 2383 |
</div>
|
| 2384 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2385 |
<h3>Debugging</h3>
|
| 2386 |
|
| 2387 |
<div>
|
|
@@ -2499,6 +2509,11 @@
|
|
| 2499 |
<a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
|
| 2500 |
<p>Investigation into long context training in terms of data and training cost.</p>
|
| 2501 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2502 |
|
| 2503 |
<h2>Appendix</h2>
|
| 2504 |
|
|
|
|
| 2215 |
</tbody>
|
| 2216 |
</table>
|
| 2217 |
|
| 2218 |
+
<p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation<d-cite bibtex-key="nanotronfp8"></d-cite>. </p>
|
| 2219 |
|
| 2220 |
<p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
|
| 2221 |
|
|
|
|
| 2382 |
<p>Training language models across compute clusters with DiLoCo.</p>
|
| 2383 |
</div>
|
| 2384 |
|
| 2385 |
+
<div>
|
| 2386 |
+
<a href="https://github.com/kakaobrain/torchgpipe"><strong>torchgpipe</strong></a>
|
| 2387 |
+
<p>A GPipe implementation in PyTorch.</p>
|
| 2388 |
+
</div>
|
| 2389 |
+
|
| 2390 |
+
<div>
|
| 2391 |
+
<a href="https://github.com/EleutherAI/oslo"><strong>OSLO</strong></a>
|
| 2392 |
+
<p>OSLO: Open Source for Large-scale Optimization.</p>
|
| 2393 |
+
</div>
|
| 2394 |
+
|
| 2395 |
<h3>Debugging</h3>
|
| 2396 |
|
| 2397 |
<div>
|
|
|
|
| 2509 |
<a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
|
| 2510 |
<p>Investigation into long context training in terms of data and training cost.</p>
|
| 2511 |
</div>
|
| 2512 |
+
|
| 2513 |
+
<div>
|
| 2514 |
+
<a href="https://github.com/tunib-ai/large-scale-lm-tutorials"><strong>TunibAI's 3D parallelism tutorial</strong></a>
|
| 2515 |
+
<p>Large-scale language modeling tutorials with PyTorch.</p>
|
| 2516 |
+
</div>
|
| 2517 |
|
| 2518 |
<h2>Appendix</h2>
|
| 2519 |
|
src/bibliography.bib
CHANGED
|
@@ -510,4 +510,10 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
|
|
| 510 |
archivePrefix={arXiv},
|
| 511 |
primaryClass={cs.LG},
|
| 512 |
url={https://arxiv.org/abs/2309.14322},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 513 |
}
|
|
|
|
| 510 |
archivePrefix={arXiv},
|
| 511 |
primaryClass={cs.LG},
|
| 512 |
url={https://arxiv.org/abs/2309.14322},
|
| 513 |
+
}
|
| 514 |
+
@software{nanotronfp8,
|
| 515 |
+
title = {nanotron's FP8 implementation},
|
| 516 |
+
author = {nanotron},
|
| 517 |
+
url = {https://github.com/huggingface/nanotron/pull/70},
|
| 518 |
+
year = {2024}
|
| 519 |
}
|
src/index.html
CHANGED
|
@@ -2215,7 +2215,7 @@
|
|
| 2215 |
</tbody>
|
| 2216 |
</table>
|
| 2217 |
|
| 2218 |
-
<p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation
|
| 2219 |
|
| 2220 |
<p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
|
| 2221 |
|
|
@@ -2381,6 +2381,16 @@
|
|
| 2381 |
<a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
|
| 2382 |
<p>Training language models across compute clusters with DiLoCo.</p>
|
| 2383 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2384 |
|
| 2385 |
<h3>Debugging</h3>
|
| 2386 |
|
|
@@ -2499,6 +2509,11 @@
|
|
| 2499 |
<a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
|
| 2500 |
<p>Investigation into long context training in terms of data and training cost.</p>
|
| 2501 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2502 |
|
| 2503 |
<h2>Appendix</h2>
|
| 2504 |
|
|
|
|
| 2215 |
</tbody>
|
| 2216 |
</table>
|
| 2217 |
|
| 2218 |
+
<p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation<d-cite bibtex-key="nanotronfp8"></d-cite>. </p>
|
| 2219 |
|
| 2220 |
<p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
|
| 2221 |
|
|
|
|
| 2381 |
<a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
|
| 2382 |
<p>Training language models across compute clusters with DiLoCo.</p>
|
| 2383 |
</div>
|
| 2384 |
+
|
| 2385 |
+
<div>
|
| 2386 |
+
<a href="https://github.com/kakaobrain/torchgpipe"><strong>torchgpipe</strong></a>
|
| 2387 |
+
<p>torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models.</p>
|
| 2388 |
+
</div>
|
| 2389 |
+
|
| 2390 |
+
<div>
|
| 2391 |
+
<a href="https://github.com/EleutherAI/oslo"><strong>OSLO</strong></a>
|
| 2392 |
+
<p>OSLO: Open Source for Large-scale Optimization.</p>
|
| 2393 |
+
</div>
|
| 2394 |
|
| 2395 |
<h3>Debugging</h3>
|
| 2396 |
|
|
|
|
| 2509 |
<a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
|
| 2510 |
<p>Investigation into long context training in terms of data and training cost.</p>
|
| 2511 |
</div>
|
| 2512 |
+
|
| 2513 |
+
<div>
|
| 2514 |
+
<a href="https://github.com/tunib-ai/large-scale-lm-tutorials"><strong>TunibAI's 3D parallelism tutorial</strong></a>
|
| 2515 |
+
<p>Large-scale language modeling tutorials with PyTorch.</p>
|
| 2516 |
+
</div>
|
| 2517 |
|
| 2518 |
<h2>Appendix</h2>
|
| 2519 |
|