Testing report on Intel Xeon W5-3425

#1
by SlavikF - opened

System:

  • Intel Xeon W5-3425, 12 cores
  • DDR5-4800 RAM (8 channels * 64GB), mlc reports 190GB/s
  • Ubuntu 24

llama.cpp:

  • commit 5edfe782a9d7dc1b717f9d132c42404c7a517e17 (HEAD -> qwen3_next, origin/qwen3_next)
  • Date: Thu Oct 23 21:10:58 2025 +0200

Running:

./llama-server \
      --hf-repo lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF:MXFP4_MOE \
      --alias "local-qwen3-next80b" \
      --ctx-size 32768 \
      --host 0.0.0.0 --port 38000

I asked few computer-related queries and got good quality replies.
Used 3000-8000 tokens for each query.

Performance is slow, but I guess it's expected at this point:

build: 7260 (5edfe782) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
system info: n_threads = 12, n_threads_batch = 12, total_threads = 12

system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 |
 F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | 
 LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
...

prompt eval time =  338841.52 ms /  2798 tokens (  121.10 ms per token,     8.26 tokens per second)
       eval time = 1522746.65 ms /  5079 tokens (  299.81 ms per token,     3.34 tokens per second)
      total time = 1861588.18 ms /  7877 tokens

Yeah, the new operations are CPU only right now, so performance will be slow until a later PR adds CUDA support to them. Or something.

On the same system, when I run ggml-org/gpt-oss-120b-GGUF on CPU only, with same quantization - MXFP4, I'm getting TG: ~16t/s.

And gpt-oss-120b has 5B active parameters, comparing to Qwen3-Next-80B having only 3B active parameters.

SlavikF changed discussion title from Testing report to Testing report on Intel Xeon W5-3425

I think this is something that will evolve for the better
I am testing on M4 Max

image

Sign up or log in to comment