what's the best Q4 quant?
| Model | Memory | Comment |
|---|---|---|
| Q4_0 | 202 GB | legacy |
| Q4_1 | 224 GB | legacy |
| Q4_K_M | 216 GB | ? |
| IQ4_NL | 202 GB | ? |
| IQ4_XS | 191 GB | ? |
| UD‑Q4_K_XL | 204 GB | Unsloth Dynamics |
Also, there is MXFP4 quants available - 199 GB.
Can someone knowledgeable add comment about PROs & CONs of these quants?
System config:
- Nvidia GPU (in my case 48GB VRAM) and most layers on CPU & RAM.
Few comments I found:
- https://github.com/ggml-org/llama.cpp/wiki/Feature-matrix
- on CPU or Vulkan, I-quants are slower than K-quants of comparable size
Usually I recommend the K_XL one!
today I try UD-Q4_K_XL and UD-Q8_XL, the think process is very verbose.
with Q8_XL, just send hi get 1195 token response, most of them are think. for Chinese "你好" it response OK.
with UD-Q4_K_XL, hi and 你好 both cause 1300 ~ 1700 token. I try -temp 1.0, with and without --top-p 0.95 --top-k 40.
ubergarm/GLM-4.6-IQ5_K hi response 1300 token.
today I try
UD-Q4_K_XLandUD-Q8_XL, the think process is very verbose.with
Q8_XL, just sendhiget 1195 token response, most of them are think. for Chinese "你好" it response OK.with
UD-Q4_K_XL,hiand你好both cause 1300 ~ 1700 token. I try-temp 1.0, with and without--top-p 0.95 --top-k 40.
ubergarm/GLM-4.6-IQ5_Khiresponse 1300 token.
How many times did you test it? I tried Q4_K_XL and got an average of ~1.2k-1.5k. I tried around 10 times
I try more than 10 times, I guess this is normal for GLM-4.6