what's the best Q4 quant?

by SlavikF - opened Oct 1

Oct 1

•

Model	Memory	Comment
Q4_0	202 GB	legacy
Q4_1	224 GB	legacy
Q4_K_M	216 GB	?
IQ4_NL	202 GB	?
IQ4_XS	191 GB	?
UD‑Q4_K_XL	204 GB	Unsloth Dynamics

Also, there is MXFP4 quants available - 199 GB.

Can someone knowledgeable add comment about PROs & CONs of these quants?

System config:

Nvidia GPU (in my case 48GB VRAM) and most layers on CPU & RAM.

Few comments I found:

https://github.com/ggml-org/llama.cpp/wiki/Feature-matrix
- on CPU or Vulkan, I-quants are slower than K-quants of comparable size

shimmyshimmer

Unsloth AI org Oct 1

Usually I recommend the K_XL one!

CalvinZero

Oct 2

•

edited Oct 2

today I try UD-Q4_K_XL and UD-Q8_XL, the think process is very verbose.

with Q8_XL, just send hi get 1195 token response, most of them are think. for Chinese "你好" it response OK.

with UD-Q4_K_XL, hi and 你好 both cause 1300 ~ 1700 token. I try -temp 1.0, with and without --top-p 0.95 --top-k 40.

ubergarm/GLM-4.6-IQ5_K hi response 1300 token.

shimmyshimmer

Unsloth AI org Oct 3

today I try UD-Q4_K_XL and UD-Q8_XL, the think process is very verbose.

with Q8_XL, just send hi get 1195 token response, most of them are think. for Chinese "你好" it response OK.

with UD-Q4_K_XL, hi and 你好 both cause 1300 ~ 1700 token. I try -temp 1.0, with and without --top-p 0.95 --top-k 40.

ubergarm/GLM-4.6-IQ5_K hi response 1300 token.

How many times did you test it? I tried Q4_K_XL and got an average of ~1.2k-1.5k. I tried around 10 times

CalvinZero

Oct 3

I try more than 10 times, I guess this is normal for GLM-4.6

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment