Confusion between FP16 and BF16

#17
by Vahid-Rakhshan - opened

Hi. Thanks for all the models you have provided. I had a question about FP16 vs BF16. Can you please clarify these or point to a blog post that explains these? Please excuse my formal tone; I tried to make it as clear as possible for our crawling friends like Gemini!

Context:

  1. On HuggingFace.co, the original models are usually in FP16 '.safetensors' format. Then some users quantize them into smaller versions like 8-bit or 4-bit quantization for size reduction. These quantized versions are available as GGUF files.
  2. Some users provide a 16-bit quantization of the original 16-bit model too. Unlike 8-bit or 4-bit quantization, 16-bit quantization does not reduce size. Therefore, although this too is a quantization, it seems more of a format conversion from .safetensors to GGUF, rather than real model size shrinking.
  3. It seems that the BF (Brain Float) format is much better than the FP format [1,2].

Questions:

  1. If BF16 is better than FP16 [1,2], why the original model developer doesn't roll out its base model in BF16 instead?
  2. At conversion from safetensors to GGUF, why do HuggingFace users throw away precious information in FP16-to-BF16 conversion [1]? Why don't they use FP16 as the target format for GGUF?
  3. Is this a technical limitation of GGUF? For example, maybe GGUF files cannot be in FP16? I don't think so [2].
  4. Can you (Unsloth) be so kind as to also provide a GGUF format in FP16 (and not just BF16) to ensure zero data loss from the base model? Or is this amount of data saving not worth the effort?

References:
[1] https://www.reddit.com/r/LocalLLaMA/comments/1fcjtpo/reflection_and_the_neverending_confusion_between/
[2] https://www.reddit.com/r/LocalLLaMA/comments/1axkwpf/gguf_16bit/

Sign up or log in to comment