SmolLM2-135M-Dissecting

A custom implementation of the SmolLM2-135M language model architecture, trained from scratch for educational purposes. This project demonstrates building a transformer-based language model with 147.8M parameters.

Model Description

This is a custom implementation that mimics the SmolLM2-135M architecture. It was built from scratch to understand the inner workings of small language models and includes:

  • Custom transformer blocks with multi-head attention
  • Rotary Position Embeddings (RoPE)
  • SwiGLU activation functions
  • Layer normalization and residual connections

Note: This is an educational implementation trained on a small dataset. For production use, consider the official HuggingFaceTB/SmolLM2-135M model.

Model Details

  • Model Type: Causal Language Model (Decoder-only Transformer)
  • Architecture: Custom SmolLM2-135M implementation
  • Total Parameters: 147,821,184
  • Training Dataset: Custom text dataset (1,115,394 characters)
  • Training Steps: 5,000 steps
  • Language: English
  • License: Apache 2.0

Architecture Specifications

  • Vocabulary Size: 49,152
  • Hidden Size: 576
  • Number of Layers: 30
  • Attention Heads: 9
  • Intermediate Size: 1,536
  • Max Position Embeddings: 2,048
  • Head Dimension: 64
  • Activation Function: SwiGLU
  • Position Embedding: Rotary Position Embedding (RoPE)

Training Process

Initialization

The training started with model initialization on CPU:

Using device: cpu
Initializing custom model...
Total parameters: 147,821,184

Dataset Preparation

The tokenizer loaded successfully, and the input text was tokenized:

Loading tokenizer...
tokenizer_config.json: 3.66kB [00:00, 2.50MB/s]
vocab.json: 801kB [00:00, 5.63MB/s]
merges.txt: 466kB [00:00, 5.45MB/s]
tokenizer.json: 2.10MB [00:00, 7.78MB/s]

The training dataset consisted of:

  • 666 chunks of 512 tokens each
  • Batch size: 4
  • Steps per epoch: 167
  • Total training steps: 5,000

Training Progress

Loss Reduction Over Time

The model showed consistent improvement throughout training:

Step Loss Improvement
0 (initial) N/A -
500 4.6897 Baseline
1000 4.0074 -14.6%
1500 3.4715 -26.0%
2000 2.8648 -38.8%
2500 2.2658 -51.7%
3000 1.5617 -66.7%
3500 1.0885 -76.8%
4000 0.8004 -82.9%
4500 0.5178 -88.9%
5000 (final) 0.3271 -93.0%

Model Generation Quality Improvement

The model's text generation ability improved significantly:

Step 0 (Before Training):

Generated: What is English Muscle Kelly flossing towardsimatingćBind outrageroutine dreTClywood loudly brightness hardships 

Step 500:

Generated: What is Englishour.
HOLANIO:
My name you
To the king, I'll tell this in theREM;

Step 1000:

Generated: What is English's They knows no their place?
ISABELLA:
Speak me:
I am a grave to the maid and sh son.

Step 2000:

Generated: What is English'd to say theAnd I will come.
KING EDWARD IV:
Go, Warwick, in all my friends, my lords.

Step 5000 (Final):

Generated: What is English quarter
To frame of the people to himself.
CAMILLO:
God and your noble lord,
She does do much need on't.

Loss Convergence

The loss curve showed gradual but steady improvement:

  • Epochs 1-3: Rapid initial decrease from ~9.6 to ~4.7
  • Epochs 4-10: Continued improvement to ~3.9
  • Epochs 11-20: Moderate improvement to ~2.0
  • Epochs 21-30: Final optimization to ~0.3

Model Architecture Verification

After training, the custom model's architecture was compared against the official SmolLM2-135M:

Custom model parameters: 364
Official model parameters: 273
Matching parameters: 1
Only in custom: 363
Only in official: 272

The architecture verification revealed a partial match with some parameter naming differences between the custom implementation and the official model.

Usage

Loading the Model

import torch
from model import CustomSmolLM, ModelConfig
from transformers import AutoTokenizer

# Initialize model configuration
config = ModelConfig()

# Load the model
model = CustomSmolLM(config)
model.load_state_dict(torch.load('checkpoints/final_model.pt')['model_state_dict'])
model.eval()

# Load tokenizer (uses official SmolLM2 tokenizer)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")

Text Generation

import torch.nn.functional as F

def generate_text(model, tokenizer, prompt, max_length=50, temperature=0.8):
    model.eval()
    device = next(model.parameters()).device

    # Tokenize prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

    with torch.no_grad():
        for _ in range(max_length):
            outputs = model(input_ids)
            logits = outputs['logits']
            next_token_logits = logits[:, -1, :] / temperature
            probs = F.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            input_ids = torch.cat([input_ids, next_token], dim=1)

            if next_token.item() == tokenizer.eos_token_id:
                break

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)

# Generate text
generated = generate_text(model, tokenizer, "Once upon a time", max_length=50)
print(generated)

Resuming Training

from train import load_checkpoint

# Resume from a checkpoint
model, checkpoint = load_checkpoint(model, 'checkpoints/checkpoint_step_500.pt')
print(f"Resumed from step {checkpoint['step']}")

Training Configuration

  • Learning Rate: 1e-4
  • Optimizer: AdamW with betas (0.9, 0.95)
  • Weight Decay: 0.1
  • Gradient Clipping: 1.0
  • Batch Size: 4
  • Sequence Length: 512 tokens
  • Checkpoint Frequency: Every 500 steps
  • Device: CPU (GPU recommended for faster training)

Intended Uses

This model is designed for:

  • Educational purposes and understanding transformer architectures
  • Experimenting with small-scale language model training
  • Learning about PyTorch implementation of modern LLM components
  • Demonstrating custom model architecture development

Limitations

  • Trained on a small dataset (1.1M characters), limiting generalization
  • Only 5,000 training steps - significantly less than production models
  • No evaluation on standardized benchmarks
  • Architecture has some divergence from official SmolLM2-135M parameter naming
  • Not suitable for production use cases
  • May produce inconsistent or incorrect text

Ethical Considerations

This is an educational model trained on a small dataset. Users should:

  • Not rely on it for factual information
  • Be aware it may generate biased or inappropriate content
  • Use it only for learning and experimentation
  • Consider the official SmolLM2-135M for any serious applications

Citation

If you use this implementation in your research or projects, please cite:

@misc{smollm2-135m-dissecting,
  title={SmolLM2-135M-Dissecting: A Custom Implementation for Educational Purposes},
  author={agileabhi},
  year={2025},
  howpublished={\url{https://huggingface.co/spaces/agileabhi/SmolLM2-135M-Model}}
}

Also consider citing the original SmolLM2 work from Hugging Face.

Acknowledgments

  • Based on the SmolLM2-135M architecture by Hugging Face
  • Uses the official SmolLM2 tokenizer
  • Inspired by modern transformer implementations

Repository Structure

  • model.py: Custom model architecture implementation
  • train.py: Training script with checkpointing and evaluation
  • app.py: Gradio demo interface
  • strip_weights.py: Utility for model weight management
  • upload_to_spaces.py: Hugging Face Spaces deployment script
  • checkpoints/: Model checkpoints saved during training
  • input.txt: Training data file

Contact

For questions or issues, please open an issue on the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support