what's the best Q4 quant?

#4
by SlavikF - opened
Model Memory Comment
Q4_0 202 GB legacy
Q4_1 224 GB legacy
Q4_K_M 216 GB ?
IQ4_NL 202 GB ?
IQ4_XS 191 GB ?
UD‑Q4_K_XL 204 GB Unsloth Dynamics

Also, there is MXFP4 quants available - 199 GB.

Can someone knowledgeable add comment about PROs & CONs of these quants?

System config:

  • Nvidia GPU (in my case 48GB VRAM) and most layers on CPU & RAM.

Few comments I found:

Unsloth AI org

Usually I recommend the K_XL one!

today I try UD-Q4_K_XL and UD-Q8_XL, the think process is very verbose.

with Q8_XL, just send hi get 1195 token response, most of them are think. for Chinese "你好" it response OK.

with UD-Q4_K_XL, hi and 你好 both cause 1300 ~ 1700 token. I try -temp 1.0, with and without --top-p 0.95 --top-k 40.

ubergarm/GLM-4.6-IQ5_K hi response 1300 token.

Unsloth AI org

today I try UD-Q4_K_XL and UD-Q8_XL, the think process is very verbose.

with Q8_XL, just send hi get 1195 token response, most of them are think. for Chinese "你好" it response OK.

with UD-Q4_K_XL, hi and 你好 both cause 1300 ~ 1700 token. I try -temp 1.0, with and without --top-p 0.95 --top-k 40.

ubergarm/GLM-4.6-IQ5_K hi response 1300 token.

How many times did you test it? I tried Q4_K_XL and got an average of ~1.2k-1.5k. I tried around 10 times

I try more than 10 times, I guess this is normal for GLM-4.6

Sign up or log in to comment