Testing report on Intel Xeon W5-3425
#1
by
SlavikF
- opened
System:
- Intel Xeon W5-3425, 12 cores
- DDR5-4800 RAM (8 channels * 64GB), mlc reports 190GB/s
- Ubuntu 24
llama.cpp:
- commit 5edfe782a9d7dc1b717f9d132c42404c7a517e17 (HEAD -> qwen3_next, origin/qwen3_next)
- Date: Thu Oct 23 21:10:58 2025 +0200
Running:
./llama-server \
--hf-repo lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF:MXFP4_MOE \
--alias "local-qwen3-next80b" \
--ctx-size 32768 \
--host 0.0.0.0 --port 38000
I asked few computer-related queries and got good quality replies.
Used 3000-8000 tokens for each query.
Performance is slow, but I guess it's expected at this point:
build: 7260 (5edfe782) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
system info: n_threads = 12, n_threads_batch = 12, total_threads = 12
system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 |
F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 |
LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
...
prompt eval time = 338841.52 ms / 2798 tokens ( 121.10 ms per token, 8.26 tokens per second)
eval time = 1522746.65 ms / 5079 tokens ( 299.81 ms per token, 3.34 tokens per second)
total time = 1861588.18 ms / 7877 tokens
Yeah, the new operations are CPU only right now, so performance will be slow until a later PR adds CUDA support to them. Or something.
On the same system, when I run ggml-org/gpt-oss-120b-GGUF on CPU only, with same quantization - MXFP4, I'm getting TG: ~16t/s.
And gpt-oss-120b has 5B active parameters, comparing to Qwen3-Next-80B having only 3B active parameters.
SlavikF
changed discussion title from
Testing report
to Testing report on Intel Xeon W5-3425
