Update README.md

2aaff1e verified 6 months ago

4.41 kB

	---
	base_model: Qwen/Qwen2.5-VL-7B-Instruct
	library_name: peft
	---
	# 🩺 PointDetectCount-Qwen2.5-VL-7B-LoRA

	Model: `SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA`
	Base model: [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
	Library: `peft` (LoRA)
	Paper: [arXiv:2505.16647](https://doi.org/10.48550/arXiv.2505.16647)
	Code: [GitHub - simula/PointDetectCount](https://github.com/simula/PointDetectCount)
	Dataset: [`SimulaMet/MedMultiPoints`](https://huggingface.co/datasets/SimulaMet/MedMultiPoints)

	---

	## 📌 Model Summary

	`PointDetectCount-Qwen2.5-VL-7B-LoRA` is a multi-task medical vision-language model fine-tuned using LoRA on top of Qwen2.5-VL-7B-Instruct, a vision-language instruction-following model. This model performs pointing (localization), bounding box detection, and object counting on medical images using natural language prompts and structured JSON outputs.

	It is trained on the [MedMultiPoints dataset](https://huggingface.co/datasets/SimulaMet/MedMultiPoints), a multimodal collection of endoscopic and microscopic images with clinical annotations.

	---

	## 🧠 Intended Uses

	- Medical image localization: Predict spatial locations (points/bounding boxes) of anatomical/clinical findings.
	- Object counting: Accurately estimate number of objects like polyps, clusters, or cells in medical images.
	- Instruction-tuned VQA: Accepts natural language queries prompting multimodal image understanding.

	This model is designed for research purposes, particularly in medical vision-language modeling, and should not be used directly for clinical diagnosis.

	---

	## 🚀 How to Use

	```python
	import torch
	from PIL import Image

	from peft import PeftModel
	from transformers import AutoModelForCausalLM

	base_model = AutoModelForCausalLM.from_pretrained("/home/sushant/.cache/modelscope/hub/Qwen/Qwen2___5-VL-7B-Instruct")
	model = PeftModel.from_pretrained(base_model, "SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA")

	image = Image.open("example.jpg").convert("RGB")
	prompt = "Return bounding boxes for each polyp in the image and the total count."

	inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=512)

	print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
	```

	---

	## 📊 Training Details

	- Fine-tuning method: [LoRA](https://arxiv.org/abs/2106.09685) (`rank=16`)
	- Frozen components: Vision encoder (ViT)
	- Trained components: LLM layers (excluding final LM head)
	- Loss function: Language modeling loss (cross-entropy over tokens)
	- Format: Instruction → JSON response (`{"bbox": [...], "count": n, "points": [...]}`)
	- Hardware: Single NVIDIA A100 (80GB)
	- Epochs: 5
	- Batch size: 4 (gradient accumulation used)
	- Learning rate: 2e-4

	---

	## 📁 Repository Structure

	- `create_datasetJSON.py`: Converts raw annotations into instruction-response format
	- `evaluate_qwen.py`: Parses and evaluates model outputs vs. ground truth
	- `MedMultiPoints-images/`: Folder containing the training/validation images

	---

	## 🧪 Evaluation

	Each model output is parsed to extract:
	- Bounding box coordinates
	- Point coordinates
	- Object count

	The parsed outputs are compared against the ground truth for each modality (GI tract, sperm, clusters, etc.). Accuracy is measured through precision/recall on detection, mean absolute error for counting, and proximity scores for pointing.

	---

	## 🛑 Limitations

	- Trained only on limited domains (GI endoscopy, microscopy).
	- Not certified for real-world clinical use.
	- Output format depends on correct JSON generation—parsing may fail with malformed outputs.

	---

	## 📚 Citation

	```bibtex
	@article{Gautam2025May,
	author = {Gautam, Sushant and Riegler, Michael A. and Halvorsen, Pål},
	title = {Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models},
	journal = {arXiv},
	year = {2025},
	month = {may},
	eprint = {2505.16647},
	doi = {10.48550/arXiv.2505.16647}
	}
	```

	---

	## 🤝 Acknowledgements

	Developed by researchers at SimulaMet, Simula Research Laboratory, and OsloMet.
	Part of ongoing efforts to enhance instruction-tuned medical VLMs for robust multimodal reasoning.