File size: 4,412 Bytes

400492b
 
 
 
9d5b59c
400492b
9d5b59c
 
 
 
 
 
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
 
 
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
 
 
400492b
2aaff1e
 
 
 
 
400492b
9d5b59c
 
400492b
9d5b59c
 
 
400492b
9d5b59c
 
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
 
 
 
 
 
 
 
 
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
 
 
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
 
 
 
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
 
 
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c
 
 
 
 
 
 
 
 
 
 
400492b
9d5b59c
400492b
9d5b59c
400492b
9d5b59c

---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
library_name: peft
---
# 🩺 PointDetectCount-Qwen2.5-VL-7B-LoRA

**Model:** `SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA`  
**Base model:** [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)  
**Library:** `peft` (LoRA)  
**Paper:** [arXiv:2505.16647](https://doi.org/10.48550/arXiv.2505.16647)  
**Code:** [GitHub - simula/PointDetectCount](https://github.com/simula/PointDetectCount)  
**Dataset:** [`SimulaMet/MedMultiPoints`](https://huggingface.co/datasets/SimulaMet/MedMultiPoints)

---

## 📌 Model Summary

`PointDetectCount-Qwen2.5-VL-7B-LoRA` is a **multi-task medical vision-language model** fine-tuned using **LoRA** on top of **Qwen2.5-VL-7B-Instruct**, a vision-language instruction-following model. This model performs **pointing (localization), bounding box detection**, and **object counting** on medical images using natural language prompts and structured JSON outputs.

It is trained on the [MedMultiPoints dataset](https://huggingface.co/datasets/SimulaMet/MedMultiPoints), a multimodal collection of endoscopic and microscopic images with clinical annotations.

---

## 🧠 Intended Uses

- **Medical image localization**: Predict spatial locations (points/bounding boxes) of anatomical/clinical findings.
- **Object counting**: Accurately estimate number of objects like polyps, clusters, or cells in medical images.
- **Instruction-tuned VQA**: Accepts natural language queries prompting multimodal image understanding.

This model is designed for **research purposes**, particularly in **medical vision-language modeling**, and should not be used directly for clinical diagnosis.

---

## 🚀 How to Use

```python
import torch
from PIL import Image

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("/home/sushant/.cache/modelscope/hub/Qwen/Qwen2___5-VL-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA")

image = Image.open("example.jpg").convert("RGB")
prompt = "Return bounding boxes for each polyp in the image and the total count."

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)

print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
```

---

## 📊 Training Details

- **Fine-tuning method:** [LoRA](https://arxiv.org/abs/2106.09685) (`rank=16`)
- **Frozen components:** Vision encoder (ViT)
- **Trained components:** LLM layers (excluding final LM head)
- **Loss function:** Language modeling loss (cross-entropy over tokens)
- **Format:** Instruction → JSON response (`{"bbox": [...], "count": n, "points": [...]}`)
- **Hardware:** Single NVIDIA A100 (80GB)
- **Epochs:** 5  
- **Batch size:** 4 (gradient accumulation used)  
- **Learning rate:** 2e-4

---

## 📁 Repository Structure

- `create_datasetJSON.py`: Converts raw annotations into instruction-response format
- `evaluate_qwen.py`: Parses and evaluates model outputs vs. ground truth
- `MedMultiPoints-images/`: Folder containing the training/validation images

---

## 🧪 Evaluation

Each model output is parsed to extract:
- Bounding box coordinates
- Point coordinates
- Object count

The parsed outputs are compared against the ground truth for each modality (GI tract, sperm, clusters, etc.). Accuracy is measured through precision/recall on detection, mean absolute error for counting, and proximity scores for pointing.

---

## 🛑 Limitations

- Trained only on limited domains (GI endoscopy, microscopy).
- Not certified for real-world clinical use.
- Output format depends on correct JSON generation—parsing may fail with malformed outputs.

---

## 📚 Citation

```bibtex
@article{Gautam2025May,
  author = {Gautam, Sushant and Riegler, Michael A. and Halvorsen, Pål},
  title = {Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models},
  journal = {arXiv},
  year = {2025},
  month = {may},
  eprint = {2505.16647},
  doi = {10.48550/arXiv.2505.16647}
}
```

---

## 🤝 Acknowledgements

Developed by researchers at **SimulaMet**, **Simula Research Laboratory**, and **OsloMet**.  
Part of ongoing efforts to enhance **instruction-tuned medical VLMs** for robust multimodal reasoning.