GLM-4.7-REAP-40p
40% expert-pruned GLM-4.7 (358B β 218B) using Cerebras REAP method.
π Compute sponsored by Prime Intellect - Decentralized AI infrastructure for open research.
Model Details
| Property | Value |
|---|---|
| Base Model | zai-org/GLM-4.7 (358B params, 160 experts) |
| Architecture | Mixture of Experts (MoE) with reasoning |
| Total Params | 218B (40% pruned) |
| Active Params | 32B (8 experts per token) |
| Experts per Layer | 96 (down from 160) |
| Method | REAP (Router-weighted Expert Activation Pruning) |
| Precision | BF16 |
| Size on Disk | ~407 GB |
Compression Pipeline
zai-org/GLM-4.7 (358B, 668GB BF16)
β REAP Pruning (40%)
GLM-4.7-REAP-40p (218B, 407GB)
β AutoRound W4A16
GLM-4.7-REAP-40p-W4A16-AutoRound (108GB) β Recommended for inference
Pruning Configuration
Matching Cerebras defaults:
| Parameter | Value |
|---|---|
| samples_per_category | 1024 |
| model_max_length | 2048 |
| distance_measure | angular |
| seed | 42 |
| dataset | evol-codealpaca-v1 |
Benchmarks
| Task | GLM-4.7-REAP-40p | GLM-4.7 (Base) |
|---|---|---|
| HumanEval | TBD | - |
| MBPP | TBD | - |
| GSM8K | TBD | - |
| ARC-Challenge | TBD | - |
| HellaSwag | TBD | - |
| MMLU | TBD | - |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "0xSero/GLM-4.7-REAP-40p"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Hardware Requirements
- Minimum VRAM: ~420GB (8x H100 80GB recommended)
- For smaller setups: Use the W4A16 quantized version (~60GB VRAM)
Citation
If you use this model, please cite the REAP paper:
@article{lasby-reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
Links
- π REAP Paper
- π REAP Blog
- π» REAP GitHub
- π W4A16 Quantized Version
- π Prime Intellect (Compute Sponsor)
Acknowledgments
- Cerebras for developing the REAP pruning method
- Prime Intellect for providing 8x H200 compute for pruning and quantization
- Zhipu AI / zai-org for the base GLM-4.7 model
- Downloads last month
- 3
Model tree for 0xSero/GLM-4.7-REAP-40
Base model
zai-org/GLM-4.7