cernis-intelligence
/

precis-gguf

@@ -1,19 +1,174 @@
 ---
 tags:
-- gguf
-- llama.cpp
 - unsloth
 ---
-# precis-gguf - GGUF
-This model was finetuned and converted to GGUF format using [Unsloth](https://github.com/unslothai/unsloth).
-**Example usage**:
-- For text only LLMs:    **llama-cli** **--hf** repo_id/model_name **-p** "why is the sky blue?"
-- For multimodal models: **llama-mtmd-cli** **-m** model_name.gguf **--mmproj** mmproj_file.gguf
-## Available Model files:
-- `granite-4.0-h-micro.Q8_0.gguf`
-- `granite-4.0-h-micro.Q4_K_M.gguf`

 ---
+base_model: unsloth/granite-4.0-h-micro
 tags:
+- text-generation-inference
+- transformers
 - unsloth
+- granitemoehybrid
+- trl
+license: apache-2.0
+language:
+- en
+---
+# Precis: Document Summarization
+## Model Overview
+**Precis** is a specialized document summarization model fine-tuned from IBM's Granite 4.0-H-Micro (3.2B parameters) using efficient LoRA adapters. It generates comprehensive ~300-word summaries optimized for question-answering capability while maintaining complete privacy through local, on-premise processing.
+**Key Features:**
+- 🔒 **Privacy-First**: Process sensitive documents entirely on your infrastructure
+- ⚡ **Fast**: 0.5s inference time (5-10x faster than cloud APIs)
+- 💰 **Cost-Effective**: Zero per-document API fees
+- 📚 **Long Context**: 128K tokens ≈ 320-380 book pages
+- 🎯 **Specialized**: Trained on 5,500+ document-summary pairs, processed millions of tokens during training
+## 🚀 Quick Start
+### Using with Transformers + PEFT
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+import torch
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "unsloth/granite-4.0-h-micro",
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+# Load LoRA adapters
+model = PeftModel.from_pretrained(base_model, "cernis-intelligence/precis")
+tokenizer = AutoTokenizer.from_pretrained("cernis-intelligence/precis")
+# Generate summary
+document = """Your long document here..."""
+messages = [
+    {"role": "user", "content": f"Summarize the following document in around 300 words:\n\n{document}"}
+]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt"
+).to(model.device)
+outputs = model.generate(
+    inputs,
+    max_new_tokens=512,
+    temperature=0.3,
+    top_p=0.9,
+    do_sample=True
+)
+summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(summary)
+```
+### Using with Unsloth (Recommended)
+```python
+from unsloth import FastLanguageModel
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name="cernis-intelligence/precis",
+    max_seq_length=2048,
+    load_in_4bit=True,  # For lower memory usage
+)
+FastLanguageModel.for_inference(model)
+messages = [
+    {"role": "user", "content": f"Summarize the following document in around 300 words:\n\n{document}"}
+]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt"
+).to("cuda")
+outputs = model.generate(inputs, max_new_tokens=512, temperature=0.3)
+summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
+```
+### Using with vLLM (Production)
+```python
+from vllm import LLM, SamplingParams
+from vllm.lora.request import LoRARequest
+# Initialize vLLM with base model
+llm = LLM(
+    model="unsloth/granite-4.0-h-micro",
+    enable_lora=True,
+    max_lora_rank=32,
+    gpu_memory_utilization=0.9
+)
+# Create LoRA request
+lora_request = LoRARequest(
+    "precis-granite",
+    1,
+    "cernis-intelligence/precis"
+)
+# Sampling parameters
+sampling_params = SamplingParams(
+    temperature=0.3,
+    top_p=0.9,
+    max_tokens=512
+)
+# Generate
+prompts = ["Summarize the following document in around 300 words:\n\n" + document]
+outputs = llm.generate(prompts, sampling_params, lora_request=lora_request)
+print(outputs[0].outputs[0].text)
+```
 ---
+## 📊 Training Details
+### Base Model
+- **Architecture**: IBM Granite 4.0-H-Micro
+- **Parameters**: 3.2B (38.4M trainable via LoRA)
+- **Context Length**: 128K tokens
+- **License**: Apache 2.0
+## 🎯 Use Cases
+### ✅ Perfect For:
+- 📄 **Legal Document Review**: Summarize contracts while maintaining confidentiality
+- 🏥 **Medical Records**: HIPAA-compliant summarization of patient notes
+- 💼 **Financial Reports**: Analyze earnings reports without exposing sensitive data
+- 📚 **Research Papers**: Quick digests of academic literature
+- 📧 **Email Threads**: Comprehensive summaries of long conversations
+### ⚠️ Considerations:
+- Works best with documents under 380 pages (128K token limit)
+- Optimized for English text (multilingual support coming)
+- May miss some deeply nested structured data (tables, forms)
+- For specialized needs, consider fine-tuning on domain-specific data
+📄 License
+This model is released under the **Apache 2.0 License**, same as the base IBM Granite 4.0 model.
+```
+Copyright 2025
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+```