Veronica-Polymorphic 24L (551M)

Veronica-Polymorphic is a decoder-only language model (≈551M params) with a polymorphic MLP:
each block contains multiple MLP branches (SwiGLU, GLU, Depthwise Causal Conv) and a soft router that blends them per-token.

The goal is adaptive capacity and incremental expansion (adding new branches later, e.g. translation), while keeping the rest of the backbone stable.

⚠️ Status: research preview, pre-training only, no external benchmarks yet.
Do not treat this as a production-ready model.

1. TL;DR

Aspect	Value / Description
Type	Decoder-only causal LM
Params	~551M
Layers	24
Hidden size	768
Heads	12
Positional encoding	RoPE (rotary)
MLP	Polymorphic (SwiGLU • GLU • DepthwiseConv) per block
Routing	Entropy-regularized soft routing, depth-scaled temperature
Precision	bf16 weights, fp32 LayerNorm
Context length	1024 → 2048 (curriculum; 512 discouraged on 24L)
Data mix	FinePDFs-1B 50% • DCLM Baseline-1B 30% • FineWeb-Edu 20%
Intended use	Research on routing / branch specialization
Not included	Instruction tuning, RLHF, safety fine-tuning, eval suite

2. Intended use & scope

Primary intent

This checkpoint is meant for:

Researchers interested in:
- Mixture-of-branches / soft routing in MLPs
- Stability of routers on deeper (24L) architectures
- Incremental model growth via adding branches post-pretrain
Practitioners who want a small, hackable codebase to experiment with:
- Polymorphic MLPs
- Entropy-regularized routing
- Context-length curricula

Out of scope

This model is not designed or evaluated (yet) for:

General-purpose assistant use
Safety-critical or high-stakes decisions
Deployment to end-users without additional filtering, alignment, and evaluation

3. Model details

3.1 Architecture (high-level)

Input tokens ↓ Token & position embeddings (RoPE on Q/K) ↓ [ VeronicaBlock × 24 ] VeronicaBlock: x → Pre-LN → Multi-Head Self-Attention (RoPE) → Residual → Pre-LN → Polymorphic MLP (router + branches) → Residual ↓ Untied LM head → logits

Key design choices:

Decoder-only Transformer (causal LM)

Pre-LayerNorm blocks

RoPE positional encoding (no learned absolute positions)

Untied input embeddings / LM head

Gradient checkpointing used in training runs for memory efficiency

3.2 Polymorphic MLP & routing

Each block’s MLP is replaced by a polymorphic MLP:

router_logits = Router(x) # Linear → GELU → Linear alpha = softmax(router_logits / tau)

branches = [ SwiGLU(x), GLU(x), DepthwiseConvMLP(x), ]

output = sum(alpha_i * branch_i for alpha_i, branch_i in zip(alpha, branches))

Branches:

Branch Role Sketch

SwiGLU Default gated MLP Linear(up) → split → SiLU×gate → Linear(down) GLU Alternative gating dynamics Linear(up) → split → Sigmoid×gate → Linear(down) DepthwiseConv Local token patterns / n-grams Depthwise causal conv (k=3) → MLP

Routing controls:

Temperature schedule tau_start → tau_end (higher early = softer mixing)

Entropy-max aux-loss: encourages non-collapsed branch usage

Depth-scaled parameters:

Router temperature and aux-loss weight scaled ≈√(depth_ratio) when going from shallower (12L) to deeper (24L) models

The key property is that routing remains soft: typical healthy distributions have a dominant branch (~~55–65%) and minority branches (~~15–25%) instead of hard one-hot selection.

Training data

The pre-train data follows the codelion / DataComp LM mixture guidelines:

Dataset Share Description

codelion/finepdfs-1B 50% Technical/academic PDFs (high semantic density) codelion/dclm-baseline-1B 30% General web corpus baseline codelion/fineweb-edu-1B 20% Educational / explanatory web data

Target token budget for this configuration: ~60B tokens (example setting).

For licensing and detailed descriptions, please refer to each dataset on Hugging Face.

If you reuse this mixture, please also cite:

@article{sharma2025billion, title = {The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, author = {Sharma, Asankhaya}, year = {2025}, url = {https://huggingface.co/blog/codelion/optimal-dataset-mixing/} }

Training procedure

Note: numbers below describe the reference run configuration used to train this checkpoint. You can adapt them for your own experiments.

5.1 Core hyperparameters

Hyperparameter Value / Notes

Layers 24 Hidden size 768 Attention heads 12 MLP expansion 4× Per-device batch size 4 Grad accumulation 8 (effective batch 32) Optimizer / LR schedule AdamW, lr=1.2e-4, cosine decay Warmup 10% of total steps Weight decay 0.01 Label smoothing 0.01 Precision bf16 + fp32 LayerNorm Max steps 60k (example target)

Example launch:

python scripts/train_veronica.py
--config configs/veronica-pretrain-24L.json
--dataset_paths data/mix_optimal_50_30_20
--output_dir runs/veronica-pretrain-24L
--per_device_train_batch_size 4
--gradient_accumulation_steps 8
--max_steps 60000
--learning_rate 1.2e-4
--warmup_ratio 0.10
--weight_decay 0.01
--max_seq_len 1024
--router_tau_start 2.2 --router_tau_end 1.4 --router_tau_freeze_steps 6000
--router_aux_start 0.008 --router_aux_end 0.016
--router_force_prob 0.10 --router_force_warmup_steps 5000
--rep_alpha 0.05
--seed 42

5.2 Context-length curriculum & “512-token trap”

Empirical finding on 24-layer models:

Starting at 512 tokens caused router collapse around step ~3k:

One branch dominated (>70%), entropy dropped, other branches starved.

Starting directly at 1024 tokens avoided collapse and produced stable, soft routing.

Recommended curriculum for 24L:

Steps 0–20k : 1024 tokens Steps 20k–60k : 2048 tokens

For shallower (~12L) models, a 512→1024→2048 curriculum can work; for ≥20L, starting at 1024 is strongly recommended.

5.3 Router health during training

Training logs include entries like:

[router] alpha=[a0, a1, a2] entropy_norm=E

Healthy targets (rough guideline):

Phase Steps Entropy (norm) Min branch share

Warmup 0–5k ≥ 0.90 ≥ 0.25 Post-freeze 5k–10k ≥ 0.75 ≥ 0.12 Stable 10k+ ≥ 0.70 ≥ 0.15

Collapsed routing typically shows up as:

Entropy < 0.65

One branch > 80% usage for many thousands of steps

Other branches stuck < 5–10%

The provided training script (scripts/train_veronica.py) implements the entropy-max aux-loss and router schedules out-of-the-box.

Evaluation

6.1 Current evaluation status

At the time of this release:

No standardized benchmarks (e.g. lm-eval-harness) have been run yet.

There are no public numbers for:

MMLU (5-shot / 0-shot)

ARC-e / ARC-c

HellaSwag, PIQA, GSM8K, etc.

Internal training logs show sensible LM loss curves and stable routing, but this is not a substitute for external evaluation.

🔎 Interpretation: This checkpoint should be treated as a router / architecture experiment, not as a drop-in replacement for existing small LMs like Llama-3.2-1B, Gemma-2B, SmolLM, etc.

6.2 Planned evaluation (suggested)

If you adopt or extend Veronica-Polymorphic, consider running:

lm-eval-harness on:

mmlu, arc_challenge, arc_easy, hellaswag, piqa

Instruction / SFT (if you fine-tune):

Alpaca-style or OpenAssistant subsets

Ablations:

Polymorphic MLP vs vanilla SwiGLU MLP with same depth/width

With / without entropy-max routing

Contributions of evaluation scripts and reported metrics are very welcome.

How to use

7.1 Loading from code

If you’re using the Veronica codebase directly:

from veronica import VeronicaConfig, VeronicaForCausalLM

cfg = VeronicaConfig( n_layer=24, num_funcs=3, # SwiGLU, GLU, DepthwiseConv ) model = VeronicaForCausalLM(cfg) model.eval()

You can also integrate via transformers if you register the config/model, or load the checkpoint from this repo if exported.

7.2 Simple generation example

from transformers import AutoTokenizer from veronica import VeronicaForCausalLM, VeronicaConfig

tokenizer = AutoTokenizer.from_pretrained("gpt2") # or your own tokenizer config = VeronicaConfig.from_pretrained("MhaWay/Veronica") model = VeronicaForCausalLM.from_pretrained("MhaWay/Veronica", config=config)

prompt = "The theory of relativity states that" inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate( **inputs, max_new_tokens=64, temperature=0.7, top_p=0.9, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: this is a raw pre-train checkpoint. Expect unaligned, sometimes incoherent generations.

Extensibility: adding new branches

One motivation for polymorphic MLPs is incremental expansion:

You can increase capacity or add a specialized branch (e.g. translation, code, domain-specific MLP) by:

Expanding num_funcs

Initializing the new branch + router output slice

Running a short fine-tune with:

Router + new branch trainable

Optionally freezing the rest of the backbone during warmup

The repository includes utilities and example code for:

Adding a new branch type

Copying router weights and initializing the new column

Scheduling a short specialization fine-tune

For details, see the “Incremental Expansion” and “Translation Branch” sections in the source code and examples.

Limitations & risks

This model:

May generate inaccurate or nonsensical text

May reproduce biases present in the underlying datasets

Is not instruction-tuned:

Does not follow natural-language instructions reliably

Can ignore prompts, hallucinate, or switch topics

Has no safety layer:

No explicit filtering of harmful/toxic content

No RLHF / preference optimization

Do not use Veronica-Polymorphic for:

Safety-critical systems

Medical, legal, or financial advice

Content moderation without extensive additional work

Any setting where unfiltered, biased generations would cause harm

Roadmap

Planned / desired directions:

Version Goal

v0.1 Core polymorphic MLP + tests v0.2 Stable router schedules + logging v0.3 Configurable attention variants / FlashAttention v0.4 Public evaluation scripts (lm-eval-harness) v0.5 Reference instruction-tuned variant v0.6 Example specialization branches (e.g. translation)

Community PRs are welcome, especially for:

Evaluation & ablations vs vanilla MLP baselines

New branch types and routing strategies

Practical recipes for SFT / alignment on top of Veronica

License

This model and code are released under the Apache-2.0 license.

Citation

If you use Veronica-Polymorphic in your work, please cite:

@misc{veronica-2025,
  title        = {Veronica: Entropy-Regularized Polymorphic Branching for Adaptive Language Modeling},
  author       = {Emanuele D'Angelo},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/MhaWay/Veronica}}
}

Acknowledgments

Mixture / routing inspiration from Switch Transformer, GLaM, and broader MoE literature.

Dataset mixture ratios guided by codelion’s DataComp LM work.

RoPE implementation adapted from GPT-NeoX-style implementations.

Downloads last month: 32

Datasets used to train MhaWay/Veronica

Evaluation results

Metadata error: specify a dataset to view leaderboard