GLM-4.7-REAP-40p

40% expert-pruned GLM-4.7 (358B β†’ 218B) using Cerebras REAP method.

πŸ™ Compute sponsored by Prime Intellect - Decentralized AI infrastructure for open research.

Model Details

Property Value
Base Model zai-org/GLM-4.7 (358B params, 160 experts)
Architecture Mixture of Experts (MoE) with reasoning
Total Params 218B (40% pruned)
Active Params 32B (8 experts per token)
Experts per Layer 96 (down from 160)
Method REAP (Router-weighted Expert Activation Pruning)
Precision BF16
Size on Disk ~407 GB

Compression Pipeline

zai-org/GLM-4.7 (358B, 668GB BF16)
    ↓ REAP Pruning (40%)
GLM-4.7-REAP-40p (218B, 407GB)
    ↓ AutoRound W4A16
GLM-4.7-REAP-40p-W4A16-AutoRound (108GB) ← Recommended for inference

Pruning Configuration

Matching Cerebras defaults:

Parameter Value
samples_per_category 1024
model_max_length 2048
distance_measure angular
seed 42
dataset evol-codealpaca-v1

Benchmarks

Task GLM-4.7-REAP-40p GLM-4.7 (Base)
HumanEval TBD -
MBPP TBD -
GSM8K TBD -
ARC-Challenge TBD -
HellaSwag TBD -
MMLU TBD -

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "0xSero/GLM-4.7-REAP-40p"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hardware Requirements

Citation

If you use this model, please cite the REAP paper:

@article{lasby-reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Links

Acknowledgments

Downloads last month
3
Safetensors
Model size
218B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 0xSero/GLM-4.7-REAP-40

Base model

zai-org/GLM-4.7
Finetuned
(3)
this model