--- license: apache-2.0 base_model: - Writer/palmyra-mini-thinking-b tags: - gguf - qwen2 - palmyra - thinking - reasoning - quantized --- # Palmyra Mini Thinking B - GGUF ## Model Description This repository contains GGUF quantized versions of the [palmyra-mini-thinking-b model](https://huggingface.co/Writer/palmyra-mini-thinking-b), based on the Qwen2 architecture. This model represents an advanced iteration of the thinking model series with improved reasoning capabilities and ChatML format support. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks. ## Available Quantizations ### BF16 (Brain Float 16) - **File**: `palmyra-mini-thinking-b-BF16.gguf` - **Size**: 3.3GB - **Precision**: 16-bit brain float - **Use Case**: Highest quality reasoning, requires more memory ### Q8_0 (8-bit Quantization) - **File**: `palmyra-mini-thinking-b-Q8_0.gguf` - **Size**: 1.8GB - **Precision**: 8-bit integer - **Use Case**: Good balance of reasoning quality and efficiency ## Quick Start ### Installation ```bash # Install llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # Or use a pre-built binary ``` ### Usage ```bash # Run with ChatML format ./main -m /path/to/palmyra-mini-thinking-b-BF16.gguf \ -p "<|im_start|>user\nSolve this step by step: What is 30% of 250?<|im_end|>\n<|im_start|>assistant\n" \ -n 512 # Interactive mode ./main -m /path/to/palmyra-mini-thinking-b-Q8_0.gguf -i ``` ## LM Studio Use Steps to download a model through the **Discover** tab can be found [here](https://lmstudio.ai/docs/app/basics/download-model) ### Ollama Use Please see [the guide in this repo](https://huggingface.co/Writer/palmyra-mini-thinking-b-GGUF/resolve/main/ollama-README-B.md?download=true) for steps on how to load this model into Ollama ## Technical Specifications ### Model Architecture - **Model Type**: `qwen2` (Qwen2 Architecture) - **Architecture**: `Qwen2ForCausalLM` - **Parameters**: ~1.7 billion parameters - **Base Precision**: bfloat16 - **Specialization**: Advanced reasoning and thinking tasks ### Core Parameters | Parameter | Value | |-----------|-------| | Hidden Size | 1,536 | | Intermediate Size | 8,960 | | Number of Layers | 28 | | Attention Heads | 12 | | Key-Value Heads | 2 | | Head Dimension | 128 | | Vocabulary Size | 151,936 | ### Attention Mechanism - **Attention Type**: Full attention across all 28 layers - **Max Position Embeddings**: 131,072 tokens - **Context Length**: 4,096 tokens (default) - **Sliding Window**: Not used ### Advanced Features - **Extended Context**: Enhanced RoPE theta (1,000,000.0) for better long-context performance - **ChatML Format**: Standard ChatML conversation format - **Improved Tokenizer**: Qwen2Tokenizer with expanded vocabulary ### Quantization Comparison | Format | Size | Precision | Reasoning Quality | Speed | Memory | Compression | |--------|------|-----------|-------------------|-------|--------|-------------| | BF16 | 3.3GB| 16-bit | Highest | Slower| High | None | | Q8_0 | 1.8GB| 8-bit | High | Faster| Medium | ~45% | ### File Structure ``` palmyra-mini-thinking-b/GGUF/ ├── palmyra-mini-thinking-b-BF16.gguf # BF16 quantization ├── palmyra-mini-thinking-b-Q8_0.gguf # Q8_0 quantization ``` ## Performance Characteristics ### Hardware Requirements - **CPU**: Modern x86_64 or ARM64 processor - **Memory**: - BF16: 4GB+ RAM recommended - Q8_0: 3GB+ RAM recommended - **Platform**: Cross-platform (Windows, macOS, Linux) ### Inference Performance - **BF16**: Highest reasoning quality, slower inference - **Q8_0**: ~45% smaller size, faster inference with preserved reasoning capabilities ## Training Details ### Tokenizer - **Type**: Qwen2Tokenizer with 151,936 vocabulary size - **Special Tokens**: - EOS Token ID: 151643 (`<|endoftext|>`) - Pad Token ID: 151643 (`<|endoftext|>`) - IM Start: 151644 (`<|im_start|>`) - IM End: 151645 (`<|im_end|>`) ### Model Configuration - **Hidden Activation**: SiLU (Swish) - **Normalization**: RMSNorm (ε = 1e-06) - **Initializer Range**: 0.02 - **Attention Dropout**: 0.0 - **Word Embeddings**: Tied ### Chat Template The model uses the standard ChatML format: ``` <|im_start|>system {system_message}<|im_end|> <|im_start|>user {user_message}<|im_end|> <|im_start|>assistant {assistant_response}<|im_end|> ``` ## Usage Examples ### Reasoning Task ```bash ./main -m palmyra-mini-thinking-b-Q8_0.gguf \ -p "<|im_start|>user\nA rectangle has a length of 15 cm and width of 10 cm. What is its area and perimeter?<|im_end|>\n<|im_start|>assistant\n" \ -n 300 \ --temp 0.7 ``` ### Problem Solving with System Message ```bash ./main -m palmyra-mini-thinking-b-BF16.gguf \ -p "<|im_start|>system\nYou are a helpful assistant that explains concepts clearly and step by step.<|im_end|>\n<|im_start|>user\nExplain how photosynthesis works.<|im_end|>\n<|im_start|>assistant\n" \ -n 400 \ --temp 0.8 ``` ## Known Limitations 1. **Context Length**: Default context is 4,096 tokens, though the model supports up to 131,072 2. **Format Dependency**: Optimized for ChatML format; other formats may not work as well 3. **Quantization Trade-offs**: Lower bit quantizations may affect reasoning quality 4. **Platform Optimization**: Performance varies across different hardware configurations ## Compatibility - **llama.cpp**: Compatible with recent versions - **Frameworks**: Ollama, LM Studio, GPT4All, and other GGUF-compatible tools - **Platforms**: Windows, macOS, Linux (x86_64, ARM64) - **Chat Format**: ChatML format support required for optimal performance ## License Apache 2.0 #### Original model card below: ---

Palmyra-mini-thinking-b

### Model Description - **Language(s) (NLP):** English - **License:** Apache-2.0 - **Finetuned from model:** Qwen/Qwen2.5-1.5B - **Context window:** 131,072 tokens - **Parameters:** 1.7 billion ## Introduction Palmyra-mini-thinking-b represents a significant step forward in generative AI, demonstrating exceptional capabilities in complex reasoning and problem-solving domains. This model excels in mathematical and programming challenges, showcasing a robust understanding of abstract concepts and logical structures. Its performance is not just a measure of its power but a testament to its specialized training, which has honed its ability to tackle tasks that demand deep, multi-step thinking. ## Mathematical Prowess The model's mathematical abilities are particularly noteworthy. It achieves an impressive score of 0.925 on the AMC23 benchmark, indicating a strong grasp of advanced high school mathematics. This is further complemented by its performance on MATH500, where it scores 0.882, proving its proficiency across a wide range of mathematical problems. The model also shows its strength in competitive mathematics, scoring 0.6 on AIME24(pass@1)(avg-of-1) and 0.5733 on Olympiadbench (extractive_match). These scores highlight the model's capacity for sophisticated mathematical reasoning, making it a powerful tool for both educational and research applications. ## Excellence in Competitive Programming Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers. ## Benchmark Scores (sampling params: temperature:0.6, top_p:0.95) Pass@1(avg-of-64) | Benchmark | Pass@1 (avg-of-64) | Majority@64 | | :-------- | :------------------- | :----------- | | AIME24 | 59.43% | 71.67% | | AIME25 | 49.69% | 60.00% | | GPQA | 42.01% | 47.22% | | HMMT25 | 27.86% | 30.00% | | HLE | 5.22% | N/A | | MMLU-PRO | 55.49% | 60.60% | | MATH500 | 93.80% | 95.40% | | LCB | 34.51% | N/A | LCB here is version v6_2408_2505 Pass@1(avg-of-1) | Benchmark | Score (%) | |:-----------------------------------------------------------------|------------:| | GSM8K (strict-match) | 42.68% | | Minerva Math (exact match) | 7.08% | | MMLU-PRO (exact match) | 29.26% | | MATH (Hendrycks) | 0.16% | | IFEval (inst_level_loose_acc) | 32.97% | | MathQA (acc) | 30.45% | | HumanEval (pass@1) | 7.32% | | BBH (get-answer)(exact match) | 28.80% | | MBPP | 16.80% | | GPQA (diamond, pass@1: 8 samples) | 39.58% | | AIME24 (pass@1)(avg-of-1) | 60.00% | | AIME25 (pass@1)(avg-of-1) | 50.00% | | Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 28.73% | | AMC23 | 92.50% | | MATH500 | 88.20% | | Minerva | 29.41% | | Olympiadbench (extractive_match) | 57.33% | | Codecontests (pass_rate) | 20.18% | | Codeforces (pass_rate) | 63.43% | | Taco (pass_rate) | 34.56% | | APPS (all_levels) | 5.84% | | HMMT (Feb 2025) (extractive_match) | 23.33% | | Average | 35.94% | ### Use with transformers You can run conversational inference using the Transformers Auto classes with the `generate()` function. Here's an example: ```py import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "Writer/palmyra-mini-thinking-b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto", attn_implementation="flash_attention_2", ) messages = [ { "role": "user", "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?" } ], input_ids = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ) gen_conf = { "max_new_tokens": 256, "eos_token_id": tokenizer.eos_token_id, "temperature": 0.3, "top_p": 0.9, } with torch.inference_mode(): output_id = model.generate(input_ids, **gen_conf) output_text = tokenizer.decode(output_id[0][input_ids.shape[1] :]) print(output_text) ``` ## Running with vLLM ```py vllm serve Writer/palmyra-mini-thinking-b ``` ```py curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Writer/palmyra-mini-thinking-b", "messages": [ { "role": "user", "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?" } ], "max_tokens": 8000, "temperature": 0.2 }' ``` ## Ethical Considerations As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly. ### Footnotes - Base model: This model builds on NVIDIA's OpenReasoning-Nemotron-1.5B (`https://huggingface.co/nvidia/OpenReasoning-Nemotron-1.5B`). - Evaluation methodology: - Pass@1 (avg-of-1): computed using `lm_eval` and `lighteval`. - Pass@1 (avg-of-64) and Majority@64: computed using `nemoskills`. ### Citation and Related Information To cite this model: ``` @misc{Palmyra-mini-thinking-b, author = {Writer Engineering team}, title = {{Palmyra-mini: A powerful LLM designed for math and coding}}, howpublished = {\url{https://dev.writer.com}}, year = 2025, month = Sep } ``` Contact Hello@writer.com