Asankhaya Sharma's picture

In a Training Loop 🔄

Asankhaya Sharma

codelion

hugging-science

·

http://asankhaya.github.io/

AI & ML interests

Creator of OptiLLM, OpenEvolve, Adaptive Classifier, and Ellora. Pioneering a new category in AI infrastructure: inference-time compute for LLMs.

Recent Activity

reacted to their post with 🤗 2 days ago

Scaling Pedagogical Pre-training to 10 Billion Tokens New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself. We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse. The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry. We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours. Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through. Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

reacted to their post with 👀 2 days ago

Scaling Pedagogical Pre-training to 10 Billion Tokens New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself. We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse. The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry. We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours. Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through. Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

reacted to their post with 🚀 2 days ago

Scaling Pedagogical Pre-training to 10 Billion Tokens New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself. We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse. The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry. We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours. Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through. Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

View all activity

Organizations

published an article 5 days ago

Article

Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens

5 days ago

•

3

published an article about 2 months ago

Article

Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models

Jan 23

•

10

published an article 2 months ago

Article

The Optimal Architecture for Small Language Models

Dec 26, 2025

•

119

published an article 3 months ago

Article

Ellora: Enhancing LLMs with LoRA - Standardized Recipes for Capability Enhancement

Dec 3, 2025

•

14

published an article 4 months ago

Article

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

Nov 3, 2025

•

63

published an article 7 months ago

Article

Building Enterprise-Ready Text Classifiers in Minutes with Adaptive Learning

Aug 9, 2025

•

12

published an article 7 months ago

Article

Towards Open Evolutionary Agents

Aug 4, 2025

•

22

published an article 7 months ago

Article

Unsupervised Model Improvement via Internal Coherence Maximization: Outperforming Human-Supervised Methods Through Self-Elicitation

Aug 3, 2025

•

7

published an article 8 months ago

Article

Understanding Model Reasoning Through Thought Anchors: A Comparative Study of Qwen3 and DeepSeek-R1

Jul 23, 2025

•

5

published an article 9 months ago

Article

Automated Discovery of High-Performance GPU Kernels with OpenEvolve

Jun 27, 2025

•

25

published an article 9 months ago

Article

Adaptive Classifier: Dynamic Text Classification with Continuous Learning

Jun 20, 2025

•

18

published an article 9 months ago

Article

System Prompt Learning: Teaching LLMs to Learn Problem-Solving Strategies from Experience

Jun 2, 2025

•

24

published an article 10 months ago

Article

AutoThink: Adaptive Reasoning for Large Language Models

May 27, 2025

•

8

published an article 10 months ago

Article

OpenEvolve: An Open Source Implementation of Google DeepMind's AlphaEvolve

May 20, 2025

•

61

published an article 10 months ago

Article

Introducing Pivotal Token Search (PTS): Targeting Critical Decision Points in LLM Training

May 17, 2025

•

12