SmolLM2-135M-Dissecting

A custom implementation of the SmolLM2-135M language model architecture, trained from scratch for educational purposes. This project demonstrates building a transformer-based language model with 147.8M parameters.

Model Description

This is a custom implementation that mimics the SmolLM2-135M architecture. It was built from scratch to understand the inner workings of small language models and includes:

Custom transformer blocks with multi-head attention
Rotary Position Embeddings (RoPE)
SwiGLU activation functions
Layer normalization and residual connections

Note: This is an educational implementation trained on a small dataset. For production use, consider the official HuggingFaceTB/SmolLM2-135M model.

Model Details

Model Type: Causal Language Model (Decoder-only Transformer)
Architecture: Custom SmolLM2-135M implementation
Total Parameters: 147,821,184
Training Dataset: Custom text dataset (1,115,394 characters)
Training Steps: 5,000 steps
Language: English
License: Apache 2.0

Architecture Specifications

Vocabulary Size: 49,152
Hidden Size: 576
Number of Layers: 30
Attention Heads: 9
Intermediate Size: 1,536
Max Position Embeddings: 2,048
Head Dimension: 64
Activation Function: SwiGLU
Position Embedding: Rotary Position Embedding (RoPE)

Training Process

Initialization

The training started with model initialization on CPU:

Using device: cpu
Initializing custom model...
Total parameters: 147,821,184

Dataset Preparation

The tokenizer loaded successfully, and the input text was tokenized:

Loading tokenizer...
tokenizer_config.json: 3.66kB [00:00, 2.50MB/s]
vocab.json: 801kB [00:00, 5.63MB/s]
merges.txt: 466kB [00:00, 5.45MB/s]
tokenizer.json: 2.10MB [00:00, 7.78MB/s]

The training dataset consisted of:

666 chunks of 512 tokens each
Batch size: 4
Steps per epoch: 167
Total training steps: 5,000

Training Progress

Loss Reduction Over Time

The model showed consistent improvement throughout training:

Step	Loss	Improvement
0 (initial)	N/A	-
500	4.6897	Baseline
1000	4.0074	-14.6%
1500	3.4715	-26.0%
2000	2.8648	-38.8%
2500	2.2658	-51.7%
3000	1.5617	-66.7%
3500	1.0885	-76.8%
4000	0.8004	-82.9%
4500	0.5178	-88.9%
5000 (final)	0.3271	-93.0%

Model Generation Quality Improvement

The model's text generation ability improved significantly:

Step 0 (Before Training):

Generated: What is English Muscle Kelly flossing towardsimatingćBind outrageroutine dreTClywood loudly brightness hardships

Step 500:

Generated: What is Englishour.
HOLANIO:
My name you
To the king, I'll tell this in theREM;

Step 1000:

Generated: What is English's They knows no their place?
ISABELLA:
Speak me:
I am a grave to the maid and sh son.

Step 2000:

Generated: What is English'd to say theAnd I will come.
KING EDWARD IV:
Go, Warwick, in all my friends, my lords.

Step 5000 (Final):

Generated: What is English quarter
To frame of the people to himself.
CAMILLO:
God and your noble lord,
She does do much need on't.

Loss Convergence

The loss curve showed gradual but steady improvement:

Epochs 1-3: Rapid initial decrease from ~9.6 to ~4.7
Epochs 4-10: Continued improvement to ~3.9
Epochs 11-20: Moderate improvement to ~2.0
Epochs 21-30: Final optimization to ~0.3

Model Architecture Verification

After training, the custom model's architecture was compared against the official SmolLM2-135M:

Custom model parameters: 364
Official model parameters: 273
Matching parameters: 1
Only in custom: 363
Only in official: 272

The architecture verification revealed a partial match with some parameter naming differences between the custom implementation and the official model.

Usage

Loading the Model

import torch
from model import CustomSmolLM, ModelConfig
from transformers import AutoTokenizer

# Initialize model configuration
config = ModelConfig()

# Load the model
model = CustomSmolLM(config)
model.load_state_dict(torch.load('checkpoints/final_model.pt')['model_state_dict'])
model.eval()

# Load tokenizer (uses official SmolLM2 tokenizer)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")

Text Generation

import torch.nn.functional as F

def generate_text(model, tokenizer, prompt, max_length=50, temperature=0.8):
    model.eval()
    device = next(model.parameters()).device

    # Tokenize prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

    with torch.no_grad():
        for _ in range(max_length):
            outputs = model(input_ids)
            logits = outputs['logits']
            next_token_logits = logits[:, -1, :] / temperature
            probs = F.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            input_ids = torch.cat([input_ids, next_token], dim=1)

            if next_token.item() == tokenizer.eos_token_id:
                break

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)

# Generate text
generated = generate_text(model, tokenizer, "Once upon a time", max_length=50)
print(generated)

Resuming Training

from train import load_checkpoint

# Resume from a checkpoint
model, checkpoint = load_checkpoint(model, 'checkpoints/checkpoint_step_500.pt')
print(f"Resumed from step {checkpoint['step']}")

Training Configuration

Learning Rate: 1e-4
Optimizer: AdamW with betas (0.9, 0.95)
Weight Decay: 0.1
Gradient Clipping: 1.0
Batch Size: 4
Sequence Length: 512 tokens
Checkpoint Frequency: Every 500 steps
Device: CPU (GPU recommended for faster training)

Intended Uses

This model is designed for:

Educational purposes and understanding transformer architectures
Experimenting with small-scale language model training
Learning about PyTorch implementation of modern LLM components
Demonstrating custom model architecture development

Limitations

Trained on a small dataset (1.1M characters), limiting generalization
Only 5,000 training steps - significantly less than production models
No evaluation on standardized benchmarks
Architecture has some divergence from official SmolLM2-135M parameter naming
Not suitable for production use cases
May produce inconsistent or incorrect text

Ethical Considerations

This is an educational model trained on a small dataset. Users should:

Not rely on it for factual information
Be aware it may generate biased or inappropriate content
Use it only for learning and experimentation
Consider the official SmolLM2-135M for any serious applications

Citation

If you use this implementation in your research or projects, please cite:

@misc{smollm2-135m-dissecting,
  title={SmolLM2-135M-Dissecting: A Custom Implementation for Educational Purposes},
  author={agileabhi},
  year={2025},
  howpublished={\url{https://huggingface.co/spaces/agileabhi/SmolLM2-135M-Model}}
}

Also consider citing the original SmolLM2 work from Hugging Face.

Acknowledgments

Based on the SmolLM2-135M architecture by Hugging Face
Uses the official SmolLM2 tokenizer
Inspired by modern transformer implementations

Repository Structure

model.py: Custom model architecture implementation
train.py: Training script with checkpointing and evaluation
app.py: Gradio demo interface
strip_weights.py: Utility for model weight management
upload_to_spaces.py: Hugging Face Spaces deployment script
checkpoints/: Model checkpoints saved during training
input.txt: Training data file

Contact

For questions or issues, please open an issue on the GitHub repository.

Downloads last month: -; Downloads are not tracked for this model. How to track