In a Training Loop 🔄
codelion
·
AI & ML interests
Creator of OptiLLM, OpenEvolve, Adaptive Classifier, and Ellora. Pioneering a new category in AI infrastructure: inference-time compute for LLMs.
Recent Activity
reacted
to
their post with 🤗 2 days ago Scaling Pedagogical Pre-training to 10 Billion Tokens
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens
All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection. reacted
to
their post with 👀 2 days ago Scaling Pedagogical Pre-training to 10 Billion Tokens
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens
All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection. reacted
to
their post with 🚀 2 days ago Scaling Pedagogical Pre-training to 10 Billion Tokens
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens
All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection. View all activity
Organizations
-
-
-
-
-
-
-
-
-
-
-
view article Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens
published an
article about 2 months ago view article Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models
view article The Optimal Architecture for Small Language Models
view article Ellora: Enhancing LLMs with LoRA - Standardized Recipes for Capability Enhancement
view article The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix
view article Building Enterprise-Ready Text Classifiers in Minutes with Adaptive Learning
view article Unsupervised Model Improvement via Internal Coherence Maximization: Outperforming Human-Supervised Methods Through Self-Elicitation
view article Understanding Model Reasoning Through Thought Anchors: A Comparative Study of Qwen3 and DeepSeek-R1
view article Automated Discovery of High-Performance GPU Kernels with OpenEvolve
view article Adaptive Classifier: Dynamic Text Classification with Continuous Learning
view article System Prompt Learning: Teaching LLMs to Learn Problem-Solving Strategies from Experience
view article AutoThink: Adaptive Reasoning for Large Language Models
view article OpenEvolve: An Open Source Implementation of Google DeepMind's AlphaEvolve
view article Introducing Pivotal Token Search (PTS): Targeting Critical Decision Points in LLM Training