--- base_model: Qwen/Qwen2.5-VL-7B-Instruct library_name: peft --- # 🩺 PointDetectCount-Qwen2.5-VL-7B-LoRA **Model:** `SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA` **Base model:** [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) **Library:** `peft` (LoRA) **Paper:** [arXiv:2505.16647](https://doi.org/10.48550/arXiv.2505.16647) **Code:** [GitHub - simula/PointDetectCount](https://github.com/simula/PointDetectCount) **Dataset:** [`SimulaMet/MedMultiPoints`](https://huggingface.co/datasets/SimulaMet/MedMultiPoints) --- ## 📌 Model Summary `PointDetectCount-Qwen2.5-VL-7B-LoRA` is a **multi-task medical vision-language model** fine-tuned using **LoRA** on top of **Qwen2.5-VL-7B-Instruct**, a vision-language instruction-following model. This model performs **pointing (localization), bounding box detection**, and **object counting** on medical images using natural language prompts and structured JSON outputs. It is trained on the [MedMultiPoints dataset](https://huggingface.co/datasets/SimulaMet/MedMultiPoints), a multimodal collection of endoscopic and microscopic images with clinical annotations. --- ## 🧠 Intended Uses - **Medical image localization**: Predict spatial locations (points/bounding boxes) of anatomical/clinical findings. - **Object counting**: Accurately estimate number of objects like polyps, clusters, or cells in medical images. - **Instruction-tuned VQA**: Accepts natural language queries prompting multimodal image understanding. This model is designed for **research purposes**, particularly in **medical vision-language modeling**, and should not be used directly for clinical diagnosis. --- ## 🚀 How to Use ```python import torch from PIL import Image from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/home/sushant/.cache/modelscope/hub/Qwen/Qwen2___5-VL-7B-Instruct") model = PeftModel.from_pretrained(base_model, "SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA") image = Image.open("example.jpg").convert("RGB") prompt = "Return bounding boxes for each polyp in the image and the total count." inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=512) print(processor.batch_decode(outputs, skip_special_tokens=True)[0]) ``` --- ## 📊 Training Details - **Fine-tuning method:** [LoRA](https://arxiv.org/abs/2106.09685) (`rank=16`) - **Frozen components:** Vision encoder (ViT) - **Trained components:** LLM layers (excluding final LM head) - **Loss function:** Language modeling loss (cross-entropy over tokens) - **Format:** Instruction → JSON response (`{"bbox": [...], "count": n, "points": [...]}`) - **Hardware:** Single NVIDIA A100 (80GB) - **Epochs:** 5 - **Batch size:** 4 (gradient accumulation used) - **Learning rate:** 2e-4 --- ## 📁 Repository Structure - `create_datasetJSON.py`: Converts raw annotations into instruction-response format - `evaluate_qwen.py`: Parses and evaluates model outputs vs. ground truth - `MedMultiPoints-images/`: Folder containing the training/validation images --- ## 🧪 Evaluation Each model output is parsed to extract: - Bounding box coordinates - Point coordinates - Object count The parsed outputs are compared against the ground truth for each modality (GI tract, sperm, clusters, etc.). Accuracy is measured through precision/recall on detection, mean absolute error for counting, and proximity scores for pointing. --- ## 🛑 Limitations - Trained only on limited domains (GI endoscopy, microscopy). - Not certified for real-world clinical use. - Output format depends on correct JSON generation—parsing may fail with malformed outputs. --- ## 📚 Citation ```bibtex @article{Gautam2025May, author = {Gautam, Sushant and Riegler, Michael A. and Halvorsen, Pål}, title = {Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models}, journal = {arXiv}, year = {2025}, month = {may}, eprint = {2505.16647}, doi = {10.48550/arXiv.2505.16647} } ``` --- ## 🤝 Acknowledgements Developed by researchers at **SimulaMet**, **Simula Research Laboratory**, and **OsloMet**. Part of ongoing efforts to enhance **instruction-tuned medical VLMs** for robust multimodal reasoning.