Update - Tool Calling + Chat Template bug fixes
Just updated DeepSeek-R1-0528-Qwen3-8B GGUFs and BnB, unsloth-BnB quants and all BF16 safetensors.
- Native tool calling is now supported. Uses https://github.com/sgl-project/sglang/pull/6765 and https://github.com/vllm-project/vllm/pull/18874 which shows DeepSeek-R1 (not Qwen) getting 93.25% on the BFCL Berkeley Function-Calling Leaderboard https://gorilla.cs.berkeley.edu/leaderboard.html.
Use it via--jinjain llama.cpp. Native transformers and vLLM should work as well.
Had to fix multiple issues in SGLang and vLLM's PRs (dangling newlines etc) - Chat template bug fixes
add_generation_promptnow works - previously<|Assistant|>was auto appended - now it's toggle-able. Fixes many issues, and should streamline chat sessions. - UTF-8 encoding of
tokenizer_config.jsonis now fixed - now works in Windows. - Ollama is now fixed on using more memory - I removed
num_ctxandnum_predict-> it'll now default to Ollama's defaults. This allocated more KV cache VRAM, thus spiking VRAM usage. Please update your context length manually. - [10th June 2025] Update - LM Studio now also works
Please re-download all weights to get the latest updates!
It is updated again 2 hours ago can you what has changed?
It is updated again 2 hours ago can you what has changed?
Fixed specifically so it has combability for LM Studio because previously it worked in Ollama, llama.cpp etc but not LM Studio
@engrtipusultan Apologies - no need to re-download if NOT using lm studio - ie llama.cpp, transformers etc are fine.
LM Studio users said our new chat template update didnt work, so I had to redo them.
If you want to be super sure, then you're more than happy to redownload them, but it's not necessary
Just updated DeepSeek-R1-0528-Qwen3-8B GGUFs and BnB, unsloth-BnB quants and all BF16 safetensors.
...
Please re-download all weights to get the latest updates!
Thanks for the quants & fixes!
BTW I wonder if it'd make sense (in general) to use the '--no-tensor-first-split' option and always make such split GGUFs
so when one changes only the metadata and not the model tensors / weights it's just going to effect a 16kByte first-part file as opposed to a NN GByte first-part file which would have to be re-downloaded in its entirety as you said.
https://github.com/search?q=repo%3Aggml-org%2Fllama.cpp%20--no-tensor-first-split&type=code
"--no-tensor-first-split do not add tensors to the first split (disabled by default)"
@ideosphere Oh wait someone did mention it to me, (wait found it https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/discussions/20) - I accidentally forgot to reply sorry!!!
Yes I like your idea - I'll do it for the larger ones. The small ones like Qwen probs not, since Ollama doesn't like multiple split files
@danielhanchen
is there extra configuration needed in ollama to enable tool calling? pretty consistently getting {"error":"hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL does not support tools"}
( works fine in up to date llama.cpp with jinja, as expected )
edit: hmm, sometimes it crashes llama.cpp with libc++abi: terminating due to uncaught exception of type std::runtime_error: Unexpected empty grammar stack after accepting piece: <|tool▁calls▁begin|> ( multiple llama.cpp versions; on a 128gb m4 macbook pro )
@danielhanchen is there extra configuration needed in ollama to enable tool calling? pretty consistently getting
{"error":"hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL does not support tools"}
Same for me. Just redownloaded Q4_k_XL and ollama still complaining that this model does not support tools.
@fire
For Ollama?
@mlaihk
ok that is a weird message for llama.cpp - is this related https://github.com/ggml-org/llama.cpp/issues/13690 ?
@danielhanchen yes prior to the edit is ollama as noted. after the edit is llama.cpp as noted ( i figured if ollama's template is wrong i would try the other method, but alas )
What's the fix? I'm running into this currently.
Also seeing the error "hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL does not support tools" when using ollama on osx. Is it wrong for me to assume I could use this model locally without any fancy hardware ( a Mac M1 Pro with built in GPU.. )
I still have an issue with this model, in particular in Continue via Ollama, I got an error that this model does not support tools.