ChessGPT β€” 432M

A decoder-only transformer trained to predict the next move in chess games using UCI notation. The model learns purely from move sequences (no board state, no evaluation) via next-token prediction on Lichess games.

Model details

Architecture LLaMA-style decoder-only transformer
Parameters 432M
Context length 256 tokens
Vocab size 4 211 (UCI moves + 3 special tokens)
Training tokens 7.87B
License Apache 2.0

Architecture

  • d_model 1 280, n_layers 21, n_heads 20 (head_dim 64), d_ff 3 584
  • RMSNorm (pre-norm), Rotary Position Embeddings (RoPE), SwiGLU FFN
  • QK-Norm before RoPE (Gemma / DeepSeek-V2 practice)
  • No bias in linear layers, weight tying between embedding and output head
  • Scaled residual initialization: std / sqrt(2 * n_layers)

Training

Data

7 monthly snapshots of Lichess standard rated games (July 2025 β€” January 2026), filtered to both players >= 1 800 ELO. Games are converted to space-separated UCI move strings.

Datasets are streamed and interleaved from HuggingFace Hub. Sequence packing concatenates games into fixed 256-token sequences to eliminate padding.

Hyperparameters

Optimizer AdamW (betas 0.9 / 0.95, weight decay 0.1)
Learning rate 3e-4 with cosine decay to 10 % of peak
Warmup 9 300 steps (linear)
Batch size 256 Γ— 256 tokens = 65 536 tokens/step
Gradient clipping 1.0
Precision BF16
Steps 120 155

Tokenizer

Custom UCI tokenizer that maps every legal UCI move string to a unique integer:

Range Description Count
0 <PAD> 1
1 <BOS> 1
2 <EOS> 1
3 β€” 4 034 Normal moves (src β‰  dst) 4 032
4 035 β€” 4 210 Promotion moves (file Γ— direction Γ— piece Γ— color) 176
Total 4 211

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "malcouffe/chessgpt", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "malcouffe/chessgpt", trust_remote_code=True
)

# Encode an opening (Italian Game)
moves = "e2e4 e7e5 g1f3 b8c6 f1c4"
input_ids = tokenizer.encode(moves, return_tensors="pt")

with torch.no_grad():
    logits = model(input_ids).logits

# Get top-5 predicted next moves
top5 = logits[0, -1].topk(5)
for score, idx in zip(top5.values, top5.indices):
    print(f"{tokenizer.decode([idx.item()]):>8s}  {score:.2f}")

Limitations

  • It has no access to board state: all chess knowledge is inferred from move sequences.
  • No RLHF or self-play refinement β€” this is a pure next-token prediction model.
  • Predictions can include illegal moves; use python-chess to filter at inference time. (see the chessgpt-inference repo for legal move masking while generating.)

Citation

@misc{chessgpt2026,
  author       = {Matthieu Alcouffe},
  title        = {ChessGPT: A 432M Decoder-Only Transformer for UCI Move Prediction},
  year         = {2026},
  url          = {https://huggingface.co/malcouffe/chessgpt}
}
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train malcouffe/chessgpt