GLM-4.7-REAP-40p

40% expert-pruned GLM-4.7 (358B → 218B) using Cerebras REAP method.

🙏 Compute sponsored by Prime Intellect - Decentralized AI infrastructure for open research.

Model Details

Property	Value
Base Model	zai-org/GLM-4.7 (358B params, 160 experts)
Architecture	Mixture of Experts (MoE) with reasoning
Total Params	218B (40% pruned)
Active Params	32B (8 experts per token)
Experts per Layer	96 (down from 160)
Method	REAP (Router-weighted Expert Activation Pruning)
Precision	BF16
Size on Disk	~407 GB

Compression Pipeline

zai-org/GLM-4.7 (358B, 668GB BF16)
    ↓ REAP Pruning (40%)
GLM-4.7-REAP-40p (218B, 407GB)
    ↓ AutoRound W4A16
GLM-4.7-REAP-40p-W4A16-AutoRound (108GB) ← Recommended for inference

Pruning Configuration

Matching Cerebras defaults:

Parameter	Value
samples_per_category	1024
model_max_length	2048
distance_measure	angular
seed	42
dataset	evol-codealpaca-v1

Benchmarks

Task	GLM-4.7-REAP-40p	GLM-4.7 (Base)
HumanEval	TBD	-
MBPP	TBD	-
GSM8K	TBD	-
ARC-Challenge	TBD	-
HellaSwag	TBD	-
MMLU	TBD	-

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "0xSero/GLM-4.7-REAP-40p"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hardware Requirements

Minimum VRAM: ~420GB (8x H100 80GB recommended)
For smaller setups: Use the W4A16 quantized version (~60GB VRAM)

Citation

If you use this model, please cite the REAP paper:

@article{lasby-reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Acknowledgments

Cerebras for developing the REAP pruning method
Prime Intellect for providing 8x H200 compute for pruning and quantization
Zhipu AI / zai-org for the base GLM-4.7 model

Downloads last month: 3

Safetensors

Model size

218B params

Tensor type

F32

BF16

Model tree for 0xSero/GLM-4.7-REAP-40

Base model

zai-org/GLM-4.7

Finetuned

(3)

this model

0xSero
/

GLM-4.7-REAP-40