GLM-4.7-Flash-Marlin-MMFP4

MMFP4-quantized GLM-4.7-Flash — a 30B-A3B MoE model compressed to 4 bits per weight using GPTQ with actorder and Metal Marlin's E2M1 FP4 format.

Metric Value
Effective bits 4.0 bpw
Compression 4× vs FP16
Model size ~16 GB (vs ~60 GB FP16)
Parameters 29.3B
Format HuggingFace sharded safetensors

Model Description

This is a quantized version of zai-org/GLM-4.7-Flash, the strongest model in the 30B class that balances performance and efficiency.

GLM-4.7-Flash features:

  • 30B-A3B MoE architecture (64 experts + shared expert, 2-4 active per token)
  • Multi-head Latent Attention (MLA) for 8× KV cache compression
  • State-of-the-art reasoning (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
  • Bilingual (English + Chinese)

Quantization Details

Quantized using MR-GPTQ (Metal Marlin GPTQ) with CUDA acceleration:

Method

  • Format: MMFP4 (E2M1 FP4) — Metal Marlin's native FP4 format
  • Quantization: GPTQ with actorder (activation-order column permutation)
  • Hessian calibration: Pre-computed Hessians for attention layers
  • Expert quantization: Identity Hessian with actorder (no calibration data for MoE experts)
  • Group size: 128
  • Hardware: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization)

Quantization Statistics

Component Bit Width Notes
Embeddings FP16 Full precision
LM Head FP16 Full precision
Attention (q/k/v/o) 4-bit GPTQ with Hessians
MoE Experts (64×) 4-bit GPTQ with actorder
Layer Norms FP16 Full precision
Router Weights FP16 Full precision
  • Total tensors: 19,066
  • Shards: 48 safetensors files
  • Quantization time: ~20 minutes (RTX 3090 Ti)

Files

GLM-4.7-Flash-Marlin-MMFP4/
├── model-00001-of-00048.safetensors   # Layer 0 (embeddings)
├── model-00002-of-00048.safetensors   # Layer 1
├── ...
├── model-00048-of-00048.safetensors   # Layer 47 + lm_head
├── model.safetensors.index.json       # Weight map
├── config.json                        # Model config
├── generation_config.json
├── tokenizer.json                     # Tokenizer
└── tokenizer_config.json

Usage

With Metal Marlin (Apple Silicon)

from metal_marlin import MarlinForCausalLM
from transformers import AutoTokenizer

model = MarlinForCausalLM.from_pretrained(
    "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
    device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")

prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Tensor Format

Each quantized weight tensor has corresponding scale factors:

  • {name}.weight: Packed FP4 weights (uint8)
  • {name}.scales: FP16 per-group scales (group_size=128)

Hardware Requirements

Device Memory Notes
Apple M4 Max 36 GB+ Via Metal Marlin
Apple M2 Ultra 36 GB+ Via Metal Marlin

Benchmarks

Original Model Performance (from Z.AI)

Benchmark GLM-4.7-Flash Qwen3-30B-A3B GPT-OSS-20B
AIME 2025 91.6 85.0 91.7
GPQA 75.2 73.4 71.5
SWE-bench Verified 59.2 22.0 34.0
τ²-Bench 79.5 49.0 47.7
BrowseComp 42.8 2.29 28.3

Quantized Model Notes

  • GPTQ with actorder minimizes quality loss vs RTN
  • Expected degradation: ~1-2% on benchmarks vs FP16
  • E2M1 FP4 format optimized for Metal Performance Shaders

Comparison with Trellis Quant

Model Format Size Bits Method
GLM-4.7-Flash-Trellis-MM Trellis 14 GB 3.78 bpw EXL3-style mixed precision
This model MMFP4 16 GB 4.0 bpw GPTQ + actorder

Choose Trellis for smaller size, MMFP4 for simpler tensor format and potentially better compatibility.

Limitations

  • Metal Marlin required for optimal inference on Apple Silicon
  • No speculative decoding yet
  • Quality loss: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization)

Credits

Citation

If you use this model, please cite the original GLM-4.5 paper:

@misc{glm2025glm45,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}

License

This quantized model inherits the MIT License from the original GLM-4.7-Flash model.

Downloads last month
18
Safetensors
Model size
5B params
Tensor type
F16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4

Finetuned
(40)
this model

Paper for RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4