File size: 4,412 Bytes
400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 2aaff1e 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c 400492b 9d5b59c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
library_name: peft
---
# 🩺 PointDetectCount-Qwen2.5-VL-7B-LoRA
**Model:** `SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA`
**Base model:** [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
**Library:** `peft` (LoRA)
**Paper:** [arXiv:2505.16647](https://doi.org/10.48550/arXiv.2505.16647)
**Code:** [GitHub - simula/PointDetectCount](https://github.com/simula/PointDetectCount)
**Dataset:** [`SimulaMet/MedMultiPoints`](https://huggingface.co/datasets/SimulaMet/MedMultiPoints)
---
## 📌 Model Summary
`PointDetectCount-Qwen2.5-VL-7B-LoRA` is a **multi-task medical vision-language model** fine-tuned using **LoRA** on top of **Qwen2.5-VL-7B-Instruct**, a vision-language instruction-following model. This model performs **pointing (localization), bounding box detection**, and **object counting** on medical images using natural language prompts and structured JSON outputs.
It is trained on the [MedMultiPoints dataset](https://huggingface.co/datasets/SimulaMet/MedMultiPoints), a multimodal collection of endoscopic and microscopic images with clinical annotations.
---
## 🧠 Intended Uses
- **Medical image localization**: Predict spatial locations (points/bounding boxes) of anatomical/clinical findings.
- **Object counting**: Accurately estimate number of objects like polyps, clusters, or cells in medical images.
- **Instruction-tuned VQA**: Accepts natural language queries prompting multimodal image understanding.
This model is designed for **research purposes**, particularly in **medical vision-language modeling**, and should not be used directly for clinical diagnosis.
---
## 🚀 How to Use
```python
import torch
from PIL import Image
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("/home/sushant/.cache/modelscope/hub/Qwen/Qwen2___5-VL-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA")
image = Image.open("example.jpg").convert("RGB")
prompt = "Return bounding boxes for each polyp in the image and the total count."
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
```
---
## 📊 Training Details
- **Fine-tuning method:** [LoRA](https://arxiv.org/abs/2106.09685) (`rank=16`)
- **Frozen components:** Vision encoder (ViT)
- **Trained components:** LLM layers (excluding final LM head)
- **Loss function:** Language modeling loss (cross-entropy over tokens)
- **Format:** Instruction → JSON response (`{"bbox": [...], "count": n, "points": [...]}`)
- **Hardware:** Single NVIDIA A100 (80GB)
- **Epochs:** 5
- **Batch size:** 4 (gradient accumulation used)
- **Learning rate:** 2e-4
---
## 📁 Repository Structure
- `create_datasetJSON.py`: Converts raw annotations into instruction-response format
- `evaluate_qwen.py`: Parses and evaluates model outputs vs. ground truth
- `MedMultiPoints-images/`: Folder containing the training/validation images
---
## 🧪 Evaluation
Each model output is parsed to extract:
- Bounding box coordinates
- Point coordinates
- Object count
The parsed outputs are compared against the ground truth for each modality (GI tract, sperm, clusters, etc.). Accuracy is measured through precision/recall on detection, mean absolute error for counting, and proximity scores for pointing.
---
## 🛑 Limitations
- Trained only on limited domains (GI endoscopy, microscopy).
- Not certified for real-world clinical use.
- Output format depends on correct JSON generation—parsing may fail with malformed outputs.
---
## 📚 Citation
```bibtex
@article{Gautam2025May,
author = {Gautam, Sushant and Riegler, Michael A. and Halvorsen, Pål},
title = {Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models},
journal = {arXiv},
year = {2025},
month = {may},
eprint = {2505.16647},
doi = {10.48550/arXiv.2505.16647}
}
```
---
## 🤝 Acknowledgements
Developed by researchers at **SimulaMet**, **Simula Research Laboratory**, and **OsloMet**.
Part of ongoing efforts to enhance **instruction-tuned medical VLMs** for robust multimodal reasoning. |