Model Card for qwen2.5-3b-instruct-dpo-gguf
Available Quantizations
This repository provides the model in the following GGUF quantization formats generated using llama.cpp (quantize):
| Quantization Type | File Name | Size | llama-bench Prompt Processing (pp 128 tokens, t/s) |
llama-bench Token Generation (tg 256 tokens, t/s) |
Notes |
|---|---|---|---|---|---|
| F16 | qwen2.5-3b-instruct-dpo-f16.gguf |
5.75 GiB | 3.76 ± 1.63 | 17.25 ± 10.40 | Full-precision baseline. Highest quality but slowest inference; best for validation or re-quantization reference. |
| Q4_K_M | qwen2.5-3b-instruct-dpo-Q4_K_M.gguf |
1.79 GiB | 15.47 ± 0.74 | 10.34 ± 2.44 | Recommended balance of size, speed, and quality. |
| Q5_K_S | qwen2.5-3b-instruct-dpo-Q5_K_S.gguf |
2.02 GiB | 26.52 ± 1.14 | 14.30 ± 8.52 | Slightly higher quality than Q4_K_M. |
| Q8_0 | qwen2.5-3b-instruct-dpo-Q8_0.gguf |
3.05 GiB | 34.77 ± 1.74 | 8.83 ± 0.17 | High-fidelity quantization; larger size, moderate generation speed. |
| IQ3_S | qwen2.5-3b-instruct-dpo-IQ3_S.gguf |
1.35 GiB | 14.14 ± 16.83 | 6.05 ± 0.82 | Smallest footprint, but noticeable quality loss. |
Benchmark performed on CPU (details omitted) using 4 threads (-t 4), processing 128 prompt tokens (-p 128), and generating 256 tokens (-n 256). Your results may vary depending on hardware.
Model Creation
These GGUF files were created through the following steps:
- Base Model: Started with
Qwen/Qwen2.5-3B-Instruct. - Adapter Application: Loaded the DPO fine-tuned LoRA adapter ogulcanakca/qwen2.5-3b-instruct-dpo-orca.
- Merging: Merged the adapter weights into the base model using
peft'smerge_and_unload()function to create a full fine-tuned model intransformersformat. - Conversion to GGUF (f16): Converted the merged model to a 16-bit float GGUF file using
llama.cpp'sconvert_hf_to_gguf.pyscript. - Quantization: Quantized the f16 GGUF file into the various formats (Q4_K_M, Q5_K_S, Q8_0, IQ3_S) using
llama.cpp'sllama-quantizetool.
The llama.cpp build used was dd62dcfa (6828).
Evaluation
The quality of the underlying fine-tuned model (ogulcanakca/qwen2.5-3b-instruct-dpo-orca) was evaluated using an LLM-as-a-Judge (Gemini 2.0 Flash Lite) approach on a filtered subset of the databricks/databricks-dolly-15k dataset.
Key Findings:
- The DPO model was preferred over the base model in approximately 93% of head-to-head comparisons.
- The DPO model showed slightly improved Usefulness scores compared to the base model.
For detailed evaluation results, please refer to the Adapter Model Card.
Note: Quantization can potentially lead to a slight degradation in model performance compared to the original f16 or adapter versions, especially for lower-bit quantizations like IQ3_S.
Bias, Risks, and Limitations
This model inherits the biases, risks, and limitations of the base Qwen2.5-3B model and the DPO fine-tuning process. Quantization might introduce additional minor performance differences. Please refer to the Adapter Model Card for a detailed discussion. Users should be aware of potential hallucinations, biases, and the model's limitations, especially when used in sensitive applications.
- Downloads last month
- 218
3-bit
4-bit
5-bit
8-bit
16-bit