ChessGPT — 432M

A decoder-only transformer trained to predict the next move in chess games using UCI notation. The model learns purely from move sequences (no board state, no evaluation) via next-token prediction on Lichess games.

Model details


Architecture	LLaMA-style decoder-only transformer
Parameters	432M
Context length	256 tokens
Vocab size	4 211 (UCI moves + 3 special tokens)
Training tokens	7.87B
License	Apache 2.0

Architecture

d_model 1 280, n_layers 21, n_heads 20 (head_dim 64), d_ff 3 584
RMSNorm (pre-norm), Rotary Position Embeddings (RoPE), SwiGLU FFN
QK-Norm before RoPE (Gemma / DeepSeek-V2 practice)
No bias in linear layers, weight tying between embedding and output head
Scaled residual initialization: std / sqrt(2 * n_layers)

Training

Data

7 monthly snapshots of Lichess standard rated games (July 2025 — January 2026), filtered to both players >= 1 800 ELO. Games are converted to space-separated UCI move strings.

Datasets are streamed and interleaved from HuggingFace Hub. Sequence packing concatenates games into fixed 256-token sequences to eliminate padding.

Hyperparameters


Optimizer	AdamW (betas 0.9 / 0.95, weight decay 0.1)
Learning rate	3e-4 with cosine decay to 10 % of peak
Warmup	9 300 steps (linear)
Batch size	256 × 256 tokens = 65 536 tokens/step
Gradient clipping	1.0
Precision	BF16
Steps	120 155

Tokenizer

Custom UCI tokenizer that maps every legal UCI move string to a unique integer:

Range	Description	Count
0	`<PAD>`	1
1	`<BOS>`	1
2	`<EOS>`	1
3 — 4 034	Normal moves (src ≠ dst)	4 032
4 035 — 4 210	Promotion moves (file × direction × piece × color)	176
Total		4 211

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "malcouffe/chessgpt", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "malcouffe/chessgpt", trust_remote_code=True
)

# Encode an opening (Italian Game)
moves = "e2e4 e7e5 g1f3 b8c6 f1c4"
input_ids = tokenizer.encode(moves, return_tensors="pt")

with torch.no_grad():
    logits = model(input_ids).logits

# Get top-5 predicted next moves
top5 = logits[0, -1].topk(5)
for score, idx in zip(top5.values, top5.indices):
    print(f"{tokenizer.decode([idx.item()]):>8s}  {score:.2f}")

Limitations

It has no access to board state: all chess knowledge is inferred from move sequences.
No RLHF or self-play refinement — this is a pure next-token prediction model.
Predictions can include illegal moves; use python-chess to filter at inference time. (see the chessgpt-inference repo for legal move masking while generating.)

Citation

@misc{chessgpt2026,
  author       = {Matthieu Alcouffe},
  title        = {ChessGPT: A 432M Decoder-Only Transformer for UCI Move Prediction},
  year         = {2026},
  url          = {https://huggingface.co/malcouffe/chessgpt}
}

Downloads last month: 11

malcouffe
/

chessgpt