Model Card for qwen2.5-3b-instruct-dpo-gguf

Available Quantizations

This repository provides the model in the following GGUF quantization formats generated using llama.cpp (quantize):

Quantization Type	File Name	Size	`llama-bench` Prompt Processing (pp 128 tokens, t/s)	`llama-bench` Token Generation (tg 256 tokens, t/s)	Notes
F16	`qwen2.5-3b-instruct-dpo-f16.gguf`	5.75 GiB	3.76 ± 1.63	17.25 ± 10.40	Full-precision baseline. Highest quality but slowest inference; best for validation or re-quantization reference.
Q4_K_M	`qwen2.5-3b-instruct-dpo-Q4_K_M.gguf`	1.79 GiB	15.47 ± 0.74	10.34 ± 2.44	Recommended balance of size, speed, and quality.
Q5_K_S	`qwen2.5-3b-instruct-dpo-Q5_K_S.gguf`	2.02 GiB	26.52 ± 1.14	14.30 ± 8.52	Slightly higher quality than Q4_K_M.
Q8_0	`qwen2.5-3b-instruct-dpo-Q8_0.gguf`	3.05 GiB	34.77 ± 1.74	8.83 ± 0.17	High-fidelity quantization; larger size, moderate generation speed.
IQ3_S	`qwen2.5-3b-instruct-dpo-IQ3_S.gguf`	1.35 GiB	14.14 ± 16.83	6.05 ± 0.82	Smallest footprint, but noticeable quality loss.

Benchmark performed on CPU (details omitted) using 4 threads (-t 4), processing 128 prompt tokens (-p 128), and generating 256 tokens (-n 256). Your results may vary depending on hardware.

Model Creation

These GGUF files were created through the following steps:

Base Model: Started with Qwen/Qwen2.5-3B-Instruct.
Adapter Application: Loaded the DPO fine-tuned LoRA adapter ogulcanakca/qwen2.5-3b-instruct-dpo-orca.
Merging: Merged the adapter weights into the base model using peft's merge_and_unload() function to create a full fine-tuned model in transformers format.
Conversion to GGUF (f16): Converted the merged model to a 16-bit float GGUF file using llama.cpp's convert_hf_to_gguf.py script.
Quantization: Quantized the f16 GGUF file into the various formats (Q4_K_M, Q5_K_S, Q8_0, IQ3_S) using llama.cpp's llama-quantize tool.

The llama.cpp build used was dd62dcfa (6828).

Evaluation

The quality of the underlying fine-tuned model (ogulcanakca/qwen2.5-3b-instruct-dpo-orca) was evaluated using an LLM-as-a-Judge (Gemini 2.0 Flash Lite) approach on a filtered subset of the databricks/databricks-dolly-15k dataset.

Key Findings:

The DPO model was preferred over the base model in approximately 93% of head-to-head comparisons.
The DPO model showed slightly improved Usefulness scores compared to the base model.

For detailed evaluation results, please refer to the Adapter Model Card.

Note: Quantization can potentially lead to a slight degradation in model performance compared to the original f16 or adapter versions, especially for lower-bit quantizations like IQ3_S.

Bias, Risks, and Limitations

This model inherits the biases, risks, and limitations of the base Qwen2.5-3B model and the DPO fine-tuning process. Quantization might introduce additional minor performance differences. Please refer to the Adapter Model Card for a detailed discussion. Users should be aware of potential hallucinations, biases, and the model's limitations, especially when used in sensitive applications.

Downloads last month: 218

GGUF

Model size

3B params

Architecture

qwen2

Hardware compatibility

3-bit

4-bit

5-bit

8-bit

16-bit

Model tree for ogulcanakca/qwen2.5-3b-instruct-dpo-gguf

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Quantized

(174)

this model

ogulcanakca
/

qwen2.5-3b-instruct-dpo-gguf