Qwen3-8B-FP8-KV

Introduction

This model was built by applying Quark with calibration samples from Pile dataset to Qwen/Qwen3-8B.

Quantization Strategy

  • Quantized Layers: All linear layers excluding "lm_head", ".mlp.experts."
  • Weight: FP8 symmetric per-tensor
  • Activation: FP8 symmetric per-tensor
  • KV Cache: FP8 symmetric per-tensor

Deployment

Quark has its own export format and allows FP8 quantized models to be efficiently deployed using the vLLM backend (vLLM-compatible).

Evaluation

Quark currently uses perplexity (PPL) as the evaluation metric for accuracy loss before and after quantization. The specific PPL algorithm can be referenced in the quantize_quark.py. The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only.

Evaluation scores

Benchmark Qwen3-8B Qwen3-8B-FP8-KV (this model)
Perplexity-wikitext2 9.531 9.708

Performance Summary

  • Accuracy Retention: 98.15% (only 1.85% perplexity increase)
  • Model Size: ~42% reduction vs FP16
  • Memory Efficiency: FP8 KV-cache for extended context
  • Hardware Optimization: AMD ROCm/HIP optimized

License

Based on Qwen3-8B licensing terms.

Downloads last month
9
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EliovpAI/Qwen3-8B-FP8-KV

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Quantized
(191)
this model