--- license: apache-2.0 base_model: - Writer/palmyra-mini-thinking-b tags: - gguf - qwen2 - palmyra - thinking - reasoning - quantized --- # Palmyra Mini Thinking B - GGUF ## Model Description This repository contains GGUF quantized versions of the [palmyra-mini-thinking-b model](https://huggingface.co/Writer/palmyra-mini-thinking-b), based on the Qwen2 architecture. This model represents an advanced iteration of the thinking model series with improved reasoning capabilities and ChatML format support. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks. ## Available Quantizations ### BF16 (Brain Float 16) - **File**: `palmyra-mini-thinking-b-BF16.gguf` - **Size**: 3.3GB - **Precision**: 16-bit brain float - **Use Case**: Highest quality reasoning, requires more memory ### Q8_0 (8-bit Quantization) - **File**: `palmyra-mini-thinking-b-Q8_0.gguf` - **Size**: 1.8GB - **Precision**: 8-bit integer - **Use Case**: Good balance of reasoning quality and efficiency ## Quick Start ### Installation ```bash # Install llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # Or use a pre-built binary ``` ### Usage ```bash # Run with ChatML format ./main -m /path/to/palmyra-mini-thinking-b-BF16.gguf \ -p "<|im_start|>user\nSolve this step by step: What is 30% of 250?<|im_end|>\n<|im_start|>assistant\n" \ -n 512 # Interactive mode ./main -m /path/to/palmyra-mini-thinking-b-Q8_0.gguf -i ``` ## LM Studio Use Steps to download a model through the **Discover** tab can be found [here](https://lmstudio.ai/docs/app/basics/download-model) ### Ollama Use Please see [the guide in this repo](https://huggingface.co/Writer/palmyra-mini-thinking-b-GGUF/resolve/main/ollama-README-B.md?download=true) for steps on how to load this model into Ollama ## Technical Specifications ### Model Architecture - **Model Type**: `qwen2` (Qwen2 Architecture) - **Architecture**: `Qwen2ForCausalLM` - **Parameters**: ~1.7 billion parameters - **Base Precision**: bfloat16 - **Specialization**: Advanced reasoning and thinking tasks ### Core Parameters | Parameter | Value | |-----------|-------| | Hidden Size | 1,536 | | Intermediate Size | 8,960 | | Number of Layers | 28 | | Attention Heads | 12 | | Key-Value Heads | 2 | | Head Dimension | 128 | | Vocabulary Size | 151,936 | ### Attention Mechanism - **Attention Type**: Full attention across all 28 layers - **Max Position Embeddings**: 131,072 tokens - **Context Length**: 4,096 tokens (default) - **Sliding Window**: Not used ### Advanced Features - **Extended Context**: Enhanced RoPE theta (1,000,000.0) for better long-context performance - **ChatML Format**: Standard ChatML conversation format - **Improved Tokenizer**: Qwen2Tokenizer with expanded vocabulary ### Quantization Comparison | Format | Size | Precision | Reasoning Quality | Speed | Memory | Compression | |--------|------|-----------|-------------------|-------|--------|-------------| | BF16 | 3.3GB| 16-bit | Highest | Slower| High | None | | Q8_0 | 1.8GB| 8-bit | High | Faster| Medium | ~45% | ### File Structure ``` palmyra-mini-thinking-b/GGUF/ ├── palmyra-mini-thinking-b-BF16.gguf # BF16 quantization ├── palmyra-mini-thinking-b-Q8_0.gguf # Q8_0 quantization ``` ## Performance Characteristics ### Hardware Requirements - **CPU**: Modern x86_64 or ARM64 processor - **Memory**: - BF16: 4GB+ RAM recommended - Q8_0: 3GB+ RAM recommended - **Platform**: Cross-platform (Windows, macOS, Linux) ### Inference Performance - **BF16**: Highest reasoning quality, slower inference - **Q8_0**: ~45% smaller size, faster inference with preserved reasoning capabilities ## Training Details ### Tokenizer - **Type**: Qwen2Tokenizer with 151,936 vocabulary size - **Special Tokens**: - EOS Token ID: 151643 (`<|endoftext|>`) - Pad Token ID: 151643 (`<|endoftext|>`) - IM Start: 151644 (`<|im_start|>`) - IM End: 151645 (`<|im_end|>`) ### Model Configuration - **Hidden Activation**: SiLU (Swish) - **Normalization**: RMSNorm (ε = 1e-06) - **Initializer Range**: 0.02 - **Attention Dropout**: 0.0 - **Word Embeddings**: Tied ### Chat Template The model uses the standard ChatML format: ``` <|im_start|>system {system_message}<|im_end|> <|im_start|>user {user_message}<|im_end|> <|im_start|>assistant {assistant_response}<|im_end|> ``` ## Usage Examples ### Reasoning Task ```bash ./main -m palmyra-mini-thinking-b-Q8_0.gguf \ -p "<|im_start|>user\nA rectangle has a length of 15 cm and width of 10 cm. What is its area and perimeter?<|im_end|>\n<|im_start|>assistant\n" \ -n 300 \ --temp 0.7 ``` ### Problem Solving with System Message ```bash ./main -m palmyra-mini-thinking-b-BF16.gguf \ -p "<|im_start|>system\nYou are a helpful assistant that explains concepts clearly and step by step.<|im_end|>\n<|im_start|>user\nExplain how photosynthesis works.<|im_end|>\n<|im_start|>assistant\n" \ -n 400 \ --temp 0.8 ``` ## Known Limitations 1. **Context Length**: Default context is 4,096 tokens, though the model supports up to 131,072 2. **Format Dependency**: Optimized for ChatML format; other formats may not work as well 3. **Quantization Trade-offs**: Lower bit quantizations may affect reasoning quality 4. **Platform Optimization**: Performance varies across different hardware configurations ## Compatibility - **llama.cpp**: Compatible with recent versions - **Frameworks**: Ollama, LM Studio, GPT4All, and other GGUF-compatible tools - **Platforms**: Windows, macOS, Linux (x86_64, ARM64) - **Chat Format**: ChatML format support required for optimal performance ## License Apache 2.0 #### Original model card below: ---