Veronica-Polymorphic 24L (551M)
Veronica-Polymorphic is a decoder-only language model (≈551M params) with a polymorphic MLP:
each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a soft router that blends them per-token.
The goal is adaptive capacity and incremental expansion (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.
⚠️ Status: research preview, pre-training only, no external benchmarks yet.
Do not treat this as a production-ready model.
1. TL;DR
| Aspect | Value / Description |
|---|---|
| Type | Decoder-only causal LM |
| Params | ~551M |
| Layers | 24 |
| Hidden size | 768 |
| Heads | 12 |
| Positional encoding | RoPE (rotary) |
| MLP | Polymorphic (SwiGLU • GLU • DepthwiseConv) per block |
| Routing | Entropy-regularized soft routing, depth-scaled temperature |
| Precision | bf16 weights, fp32 LayerNorm |
| Context length | 1024 → 2048 (curriculum; 512 discouraged on 24L) |
| Data mix | FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20% |
| Intended use | Research on routing / branch specialization |
| Not included | Instruction tuning, RLHF, safety fine-tuning, eval suite |
2. Intended use & scope
Primary intent
This checkpoint is meant for:
- Researchers interested in:
- Mixture-of-branches / soft routing in MLPs
- Stability of routers on deeper (24L) architectures
- Incremental model growth via adding branches post-pretrain
- Practitioners who want a small, hackable codebase to experiment with:
- Polymorphic MLPs
- Entropy-regularized routing
- Context-length curricula
Out of scope
This model is not designed or evaluated (yet) for:
- General-purpose assistant use
- Safety-critical or high-stakes decisions
- Deployment to end-users without additional filtering, alignment, and evaluation
3. Model details
3.1 Architecture (high-level)
Input tokens ↓ Token & position embeddings (RoPE on Q/K) ↓ [ VeronicaBlock × 24 ] VeronicaBlock: x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual → Pre-LN → Polymorphic MLP (router + branches) → Residual ↓ Untied LM head → logits
Key design choices:
Decoder-only Transformer (causal LM)
Pre-LayerNorm blocks
RoPE positional encoding (no learned absolute positions)
Untied input embeddings / LM head
Gradient checkpointing used in training runs for memory efficiency
3.2 Polymorphic MLP & routing
Each block’s MLP is replaced by a polymorphic MLP:
router_logits = Router(x) # Linear → GELU → Linear alpha = softmax(router_logits / tau)
branches = [ SwiGLU(x), GLU(x), DepthwiseConvMLP(x), ]
output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))
Branches:
Branch Role Sketch
SwiGLU Default gated MLP Linear(up) → split → SiLU×gate → Linear(down) GLU Alternative gating dynamics Linear(up) → split → Sigmoid×gate → Linear(down) DepthwiseConv Local token patterns / n-grams Depthwise causal conv (k=3) → MLP
Routing controls:
Temperature schedule tau_start → tau_end (higher early = softer mixing)
Entropy-max aux-loss: encourages non-collapsed branch usage
Depth-scaled parameters:
Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models
The key property is that routing remains soft: typical healthy distributions have a dominant branch (55–65%) and minority branches (15–25%) instead of hard one-hot selection.
- Training data
The pre-train data follows the codelion / DataComp LM mixture guidelines:
Dataset Share Description
codelion/finepdfs-1B 50% Technical/academic PDFs (high semantic density) codelion/dclm-baseline-1B 30% General web corpus baseline codelion/fineweb-edu-1B 20% Educational / explanatory web data
Target token budget for this configuration: ~60B tokens (example setting).
For licensing and detailed descriptions, please refer to each dataset on Hugging Face.
If you reuse this mixture, please also cite:
@article{sharma2025billion, title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, author = {Sharma, Asankhaya}, year = {2025}, url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/} }
- Training procedure
Note: numbers below describe the reference run configuration used to train this checkpoint. You can adapt them for your own experiments.
5.1 Core hyperparameters
Hyperparameter Value / Notes
Layers 24 Hidden size 768 Attention heads 12 MLP expansion 4× Per-device batch size 4 Grad accumulation 8 (effective batch 32) Optimizer / LR schedule AdamW, lr=1.2e-4, cosine decay Warmup 10% of total steps Weight decay 0.01 Label smoothing 0.01 Precision bf16 + fp32 LayerNorm Max steps 60k (example target)
Example launch:
python scripts/train_veronica.py
--config configs/veronica-pretrain-24L.json
--dataset_paths data/mix_optimal_50_30_20
--output_dir runs/veronica-pretrain-24L
--per_device_train_batch_size 4
--gradient_accumulation_steps 8
--max_steps 60000
--learning_rate 1.2e-4
--warmup_ratio 0.10
--weight_decay 0.01
--max_seq_len 1024
--router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000
--router_aux_start 0.008 --router_aux_end 0.016
--router_force_prob 0.10 --router_force_warmup_steps 5000
--rep_alpha 0.05
--seed 42
5.2 Context-length curriculum & “512-token trap”
Empirical finding on 24-layer models:
Starting at 512 tokens caused router collapse around step ~3k:
One branch dominated (>70%), entropy dropped, other branches starved.
Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.
Recommended curriculum for 24L:
Steps 0–20k : 1024 tokens Steps 20k–60k : 2048 tokens
For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.
5.3 Router health during training
Training logs include entries like:
[router] alpha=[a0, a1, a2] entropy_norm=E
Healthy targets (rough guideline):
Phase Steps Entropy (norm) Min branch share
Warmup 0–5k ≥ 0.90 ≥ 0.25 Post-freeze 5k–10k ≥ 0.75 ≥ 0.12 Stable 10k+ ≥ 0.70 ≥ 0.15
Collapsed routing typically shows up as:
Entropy < 0.65
One branch > 80% usage for many thousands of steps
Other branches stuck < 5–10%
The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.
- Evaluation
6.1 Current evaluation status
At the time of this release:
No standardized benchmarks (e.g. lm-eval-harness) have been run yet.
There are no public numbers for:
MMLU (5-shot / 0-shot)
ARC-e / ARC-c
HellaSwag, PIQA, GSM8K, etc.
Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.
🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.
6.2 Planned evaluation (suggested)
If you adopt or extend Veronica-Polymorphic, consider running:
lm-eval-harness on:
mmlu, arc_challenge, arc_easy, hellaswag, piqa
Instruction / SFT (if you fine-tune):
Alpaca-style or OpenAssistant subsets
Ablations:
Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width
With / without entropy-max routing
Contributions of evaluation scripts and reported metrics are very welcome.
- How to use
7.1 Loading from code
If you’re using the Veronica codebase directly:
from veronica import VeronicaConfig, VeronicaForCausalLM
cfg = VeronicaConfig( n_layer=24, num_funcs=3, # SwiGLU, GLU, DepthwiseConv ) model = VeronicaForCausalLM(cfg) model.eval()
You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.
7.2 Simple generation example
from transformers import AutoTokenizer from veronica import VeronicaForCausalLM, VeronicaConfig
tokenizer = AutoTokenizer.from_pretrained("gpt2") # or your own tokenizer config = VeronicaConfig.from_pretrained("MhaWay/Veronica") model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)
prompt = "The theory of relativity states that" inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate( **inputs, max_new_tokens=64, temperature=0.7, top_p=0.9, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.
- Extensibility: adding new branches
One motivation for polymorphic MLPs is incremental expansion:
You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:
Expanding num_funcs
Initializing the new branch + router output slice
Running a short fine-tune with:
Router + new branch trainable
Optionally freezing the rest of the backbone during warmup
The repository includes utilities and example code for:
Adding a new branch type
Copying router weights and initializing the new column
Scheduling a short specialization fine-tune
For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.
- Limitations & risks
This model:
May generate inaccurate or nonsensical text
May reproduce biases present in the underlying datasets
Is not instruction-tuned:
Does not follow natural-language instructions reliably
Can ignore prompts, hallucinate, or switch topics
Has no safety layer:
No explicit filtering of harmful/toxic content
No RLHF / preference optimization
Do not use Veronica-Polymorphic for:
Safety-critical systems
Medical, legal, or financial advice
Content moderation without extensive additional work
Any setting where unfiltered, biased generations would cause harm
- Roadmap
Planned / desired directions:
Version Goal
v0.1 Core polymorphic MLP + tests v0.2 Stable router schedules + logging v0.3 Configurable attention variants / FlashAttention v0.4 Public evaluation scripts (lm-eval-harness) v0.5 Reference instruction-tuned variant v0.6 Example specialization branches (e.g. translation)
Community PRs are welcome, especially for:
Evaluation & ablations vs vanilla MLP baselines
New branch types and routing strategies
Practical recipes for SFT / alignment on top of Veronica
- License
This model and code are released under the Apache-2.0 license.
- Citation
If you use Veronica-Polymorphic in your work, please cite:
@misc{veronica-2025,
title = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
author = {Emanuele D'Angelo},
year = {2025},
howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
}
- Acknowledgments
Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.
Dataset mixture ratios guided by codelion’s DataComp LM work.
RoPE implementation adapted from GPT-NeoX-style implementations.
- Downloads last month
- 32