Model Card for qwen2.5-3b-instruct-dpo-gguf

Available Quantizations

This repository provides the model in the following GGUF quantization formats generated using llama.cpp (quantize):

Quantization Type File Name Size llama-bench Prompt Processing (pp 128 tokens, t/s) llama-bench Token Generation (tg 256 tokens, t/s) Notes
F16 qwen2.5-3b-instruct-dpo-f16.gguf 5.75 GiB 3.76 ± 1.63 17.25 ± 10.40 Full-precision baseline. Highest quality but slowest inference; best for validation or re-quantization reference.
Q4_K_M qwen2.5-3b-instruct-dpo-Q4_K_M.gguf 1.79 GiB 15.47 ± 0.74 10.34 ± 2.44 Recommended balance of size, speed, and quality.
Q5_K_S qwen2.5-3b-instruct-dpo-Q5_K_S.gguf 2.02 GiB 26.52 ± 1.14 14.30 ± 8.52 Slightly higher quality than Q4_K_M.
Q8_0 qwen2.5-3b-instruct-dpo-Q8_0.gguf 3.05 GiB 34.77 ± 1.74 8.83 ± 0.17 High-fidelity quantization; larger size, moderate generation speed.
IQ3_S qwen2.5-3b-instruct-dpo-IQ3_S.gguf 1.35 GiB 14.14 ± 16.83 6.05 ± 0.82 Smallest footprint, but noticeable quality loss.

Benchmark performed on CPU (details omitted) using 4 threads (-t 4), processing 128 prompt tokens (-p 128), and generating 256 tokens (-n 256). Your results may vary depending on hardware.

Model Creation

These GGUF files were created through the following steps:

  1. Base Model: Started with Qwen/Qwen2.5-3B-Instruct.
  2. Adapter Application: Loaded the DPO fine-tuned LoRA adapter ogulcanakca/qwen2.5-3b-instruct-dpo-orca.
  3. Merging: Merged the adapter weights into the base model using peft's merge_and_unload() function to create a full fine-tuned model in transformers format.
  4. Conversion to GGUF (f16): Converted the merged model to a 16-bit float GGUF file using llama.cpp's convert_hf_to_gguf.py script.
  5. Quantization: Quantized the f16 GGUF file into the various formats (Q4_K_M, Q5_K_S, Q8_0, IQ3_S) using llama.cpp's llama-quantize tool.

The llama.cpp build used was dd62dcfa (6828).

Evaluation

The quality of the underlying fine-tuned model (ogulcanakca/qwen2.5-3b-instruct-dpo-orca) was evaluated using an LLM-as-a-Judge (Gemini 2.0 Flash Lite) approach on a filtered subset of the databricks/databricks-dolly-15k dataset.

Key Findings:

  • The DPO model was preferred over the base model in approximately 93% of head-to-head comparisons.
  • The DPO model showed slightly improved Usefulness scores compared to the base model.

For detailed evaluation results, please refer to the Adapter Model Card.

Note: Quantization can potentially lead to a slight degradation in model performance compared to the original f16 or adapter versions, especially for lower-bit quantizations like IQ3_S.

Bias, Risks, and Limitations

This model inherits the biases, risks, and limitations of the base Qwen2.5-3B model and the DPO fine-tuning process. Quantization might introduce additional minor performance differences. Please refer to the Adapter Model Card for a detailed discussion. Users should be aware of potential hallucinations, biases, and the model's limitations, especially when used in sensitive applications.

Downloads last month
218
GGUF
Model size
3B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

3-bit

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ogulcanakca/qwen2.5-3b-instruct-dpo-gguf

Base model

Qwen/Qwen2.5-3B
Quantized
(174)
this model

Datasets used to train ogulcanakca/qwen2.5-3b-instruct-dpo-gguf