Confusion between FP16 and BF16

#17

by Vahid-Rakhshan - opened Oct 30

Oct 30

Hi. Thanks for all the models you have provided. I had a question about FP16 vs BF16. Can you please clarify these or point to a blog post that explains these? Please excuse my formal tone; I tried to make it as clear as possible for our crawling friends like Gemini!

Context:

On HuggingFace.co, the original models are usually in FP16 '.safetensors' format. Then some users quantize them into smaller versions like 8-bit or 4-bit quantization for size reduction. These quantized versions are available as GGUF files.
Some users provide a 16-bit quantization of the original 16-bit model too. Unlike 8-bit or 4-bit quantization, 16-bit quantization does not reduce size. Therefore, although this too is a quantization, it seems more of a format conversion from .safetensors to GGUF, rather than real model size shrinking.
It seems that the BF (Brain Float) format is much better than the FP format [1,2].

Questions:

If BF16 is better than FP16 [1,2], why the original model developer doesn't roll out its base model in BF16 instead?
At conversion from safetensors to GGUF, why do HuggingFace users throw away precious information in FP16-to-BF16 conversion [1]? Why don't they use FP16 as the target format for GGUF?
Is this a technical limitation of GGUF? For example, maybe GGUF files cannot be in FP16? I don't think so [2].
Can you (Unsloth) be so kind as to also provide a GGUF format in FP16 (and not just BF16) to ensure zero data loss from the base model? Or is this amount of data saving not worth the effort?

References:
[1] https://www.reddit.com/r/LocalLLaMA/comments/1fcjtpo/reflection_and_the_neverending_confusion_between/
[2] https://www.reddit.com/r/LocalLLaMA/comments/1axkwpf/gguf_16bit/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment