SmolLM2-135M-Dissecting
A custom implementation of the SmolLM2-135M language model architecture, trained from scratch for educational purposes. This project demonstrates building a transformer-based language model with 147.8M parameters.
Model Description
This is a custom implementation that mimics the SmolLM2-135M architecture. It was built from scratch to understand the inner workings of small language models and includes:
- Custom transformer blocks with multi-head attention
- Rotary Position Embeddings (RoPE)
- SwiGLU activation functions
- Layer normalization and residual connections
Note: This is an educational implementation trained on a small dataset. For production use, consider the official HuggingFaceTB/SmolLM2-135M model.
Model Details
- Model Type: Causal Language Model (Decoder-only Transformer)
- Architecture: Custom SmolLM2-135M implementation
- Total Parameters: 147,821,184
- Training Dataset: Custom text dataset (1,115,394 characters)
- Training Steps: 5,000 steps
- Language: English
- License: Apache 2.0
Architecture Specifications
- Vocabulary Size: 49,152
- Hidden Size: 576
- Number of Layers: 30
- Attention Heads: 9
- Intermediate Size: 1,536
- Max Position Embeddings: 2,048
- Head Dimension: 64
- Activation Function: SwiGLU
- Position Embedding: Rotary Position Embedding (RoPE)
Training Process
Initialization
The training started with model initialization on CPU:
Using device: cpu
Initializing custom model...
Total parameters: 147,821,184
Dataset Preparation
The tokenizer loaded successfully, and the input text was tokenized:
Loading tokenizer...
tokenizer_config.json: 3.66kB [00:00, 2.50MB/s]
vocab.json: 801kB [00:00, 5.63MB/s]
merges.txt: 466kB [00:00, 5.45MB/s]
tokenizer.json: 2.10MB [00:00, 7.78MB/s]
The training dataset consisted of:
- 666 chunks of 512 tokens each
- Batch size: 4
- Steps per epoch: 167
- Total training steps: 5,000
Training Progress
Loss Reduction Over Time
The model showed consistent improvement throughout training:
| Step | Loss | Improvement |
|---|---|---|
| 0 (initial) | N/A | - |
| 500 | 4.6897 | Baseline |
| 1000 | 4.0074 | -14.6% |
| 1500 | 3.4715 | -26.0% |
| 2000 | 2.8648 | -38.8% |
| 2500 | 2.2658 | -51.7% |
| 3000 | 1.5617 | -66.7% |
| 3500 | 1.0885 | -76.8% |
| 4000 | 0.8004 | -82.9% |
| 4500 | 0.5178 | -88.9% |
| 5000 (final) | 0.3271 | -93.0% |
Model Generation Quality Improvement
The model's text generation ability improved significantly:
Step 0 (Before Training):
Generated: What is English Muscle Kelly flossing towardsimatingćBind outrageroutine dreTClywood loudly brightness hardships
Step 500:
Generated: What is Englishour.
HOLANIO:
My name you
To the king, I'll tell this in theREM;
Step 1000:
Generated: What is English's They knows no their place?
ISABELLA:
Speak me:
I am a grave to the maid and sh son.
Step 2000:
Generated: What is English'd to say theAnd I will come.
KING EDWARD IV:
Go, Warwick, in all my friends, my lords.
Step 5000 (Final):
Generated: What is English quarter
To frame of the people to himself.
CAMILLO:
God and your noble lord,
She does do much need on't.
Loss Convergence
The loss curve showed gradual but steady improvement:
- Epochs 1-3: Rapid initial decrease from ~9.6 to ~4.7
- Epochs 4-10: Continued improvement to ~3.9
- Epochs 11-20: Moderate improvement to ~2.0
- Epochs 21-30: Final optimization to ~0.3
Model Architecture Verification
After training, the custom model's architecture was compared against the official SmolLM2-135M:
Custom model parameters: 364
Official model parameters: 273
Matching parameters: 1
Only in custom: 363
Only in official: 272
The architecture verification revealed a partial match with some parameter naming differences between the custom implementation and the official model.
Usage
Loading the Model
import torch
from model import CustomSmolLM, ModelConfig
from transformers import AutoTokenizer
# Initialize model configuration
config = ModelConfig()
# Load the model
model = CustomSmolLM(config)
model.load_state_dict(torch.load('checkpoints/final_model.pt')['model_state_dict'])
model.eval()
# Load tokenizer (uses official SmolLM2 tokenizer)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
Text Generation
import torch.nn.functional as F
def generate_text(model, tokenizer, prompt, max_length=50, temperature=0.8):
model.eval()
device = next(model.parameters()).device
# Tokenize prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
with torch.no_grad():
for _ in range(max_length):
outputs = model(input_ids)
logits = outputs['logits']
next_token_logits = logits[:, -1, :] / temperature
probs = F.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
input_ids = torch.cat([input_ids, next_token], dim=1)
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0], skip_special_tokens=True)
# Generate text
generated = generate_text(model, tokenizer, "Once upon a time", max_length=50)
print(generated)
Resuming Training
from train import load_checkpoint
# Resume from a checkpoint
model, checkpoint = load_checkpoint(model, 'checkpoints/checkpoint_step_500.pt')
print(f"Resumed from step {checkpoint['step']}")
Training Configuration
- Learning Rate: 1e-4
- Optimizer: AdamW with betas (0.9, 0.95)
- Weight Decay: 0.1
- Gradient Clipping: 1.0
- Batch Size: 4
- Sequence Length: 512 tokens
- Checkpoint Frequency: Every 500 steps
- Device: CPU (GPU recommended for faster training)
Intended Uses
This model is designed for:
- Educational purposes and understanding transformer architectures
- Experimenting with small-scale language model training
- Learning about PyTorch implementation of modern LLM components
- Demonstrating custom model architecture development
Limitations
- Trained on a small dataset (1.1M characters), limiting generalization
- Only 5,000 training steps - significantly less than production models
- No evaluation on standardized benchmarks
- Architecture has some divergence from official SmolLM2-135M parameter naming
- Not suitable for production use cases
- May produce inconsistent or incorrect text
Ethical Considerations
This is an educational model trained on a small dataset. Users should:
- Not rely on it for factual information
- Be aware it may generate biased or inappropriate content
- Use it only for learning and experimentation
- Consider the official SmolLM2-135M for any serious applications
Citation
If you use this implementation in your research or projects, please cite:
@misc{smollm2-135m-dissecting,
title={SmolLM2-135M-Dissecting: A Custom Implementation for Educational Purposes},
author={agileabhi},
year={2025},
howpublished={\url{https://huggingface.co/spaces/agileabhi/SmolLM2-135M-Model}}
}
Also consider citing the original SmolLM2 work from Hugging Face.
Acknowledgments
- Based on the SmolLM2-135M architecture by Hugging Face
- Uses the official SmolLM2 tokenizer
- Inspired by modern transformer implementations
Repository Structure
model.py: Custom model architecture implementationtrain.py: Training script with checkpointing and evaluationapp.py: Gradio demo interfacestrip_weights.py: Utility for model weight managementupload_to_spaces.py: Hugging Face Spaces deployment scriptcheckpoints/: Model checkpoints saved during traininginput.txt: Training data file
Contact
For questions or issues, please open an issue on the GitHub repository.