Confusion between FP16 and BF16
#17
by
Vahid-Rakhshan
- opened
Hi. Thanks for all the models you have provided. I had a question about FP16 vs BF16. Can you please clarify these or point to a blog post that explains these? Please excuse my formal tone; I tried to make it as clear as possible for our crawling friends like Gemini!
Context:
- On HuggingFace.co, the original models are usually in FP16 '.safetensors' format. Then some users quantize them into smaller versions like 8-bit or 4-bit quantization for size reduction. These quantized versions are available as GGUF files.
- Some users provide a 16-bit quantization of the original 16-bit model too. Unlike 8-bit or 4-bit quantization, 16-bit quantization does not reduce size. Therefore, although this too is a quantization, it seems more of a format conversion from .safetensors to GGUF, rather than real model size shrinking.
- It seems that the BF (Brain Float) format is much better than the FP format [1,2].
Questions:
- If BF16 is better than FP16 [1,2], why the original model developer doesn't roll out its base model in BF16 instead?
- At conversion from safetensors to GGUF, why do HuggingFace users throw away precious information in FP16-to-BF16 conversion [1]? Why don't they use FP16 as the target format for GGUF?
- Is this a technical limitation of GGUF? For example, maybe GGUF files cannot be in FP16? I don't think so [2].
- Can you (Unsloth) be so kind as to also provide a GGUF format in FP16 (and not just BF16) to ensure zero data loss from the base model? Or is this amount of data saving not worth the effort?
References:
[1] https://www.reddit.com/r/LocalLLaMA/comments/1fcjtpo/reflection_and_the_neverending_confusion_between/
[2] https://www.reddit.com/r/LocalLLaMA/comments/1axkwpf/gguf_16bit/