TCM-VisResolve: Multimodal Recognition of Dried Herbs and Clinical MCQ Answering in TCM
[cite_start]TCM-VisResolve (TCM-VR) is a domain-specific multimodal large language model (MLLM) designed for Traditional Chinese Medicine (TCM)[cite: 12, 13, 83].
[cite_start]This model, fine-tuned from Qwen2.5-VL [cite: 201][cite_start], is specifically engineered to bridge the gap between visual recognition and symbolic reasoning in TCM[cite: 14]. It excels at two primary tasks:
- [cite_start]Recognizing images of dried medicinal herbs[cite: 13, 84].
- [cite_start]Answering clinical-style multiple-choice questions (MCQs)[cite: 13, 84].
[cite_start]This model was trained using the LLaMA Factory framework[cite: 258].
๐จโ๐ป Authors and Affiliations
- Wudao Yang
- [cite_start]Affiliation: CS Dept., School of Mathematics & CS, Yunnan Minzu University, Kunming, China [cite: 2, 6, 7]
- [cite_start]Email: wudaoyang@ymu.edu.cn [cite: 7]
- Zhiqiang Yu* (Corresponding author)
- [cite_start]Affiliation: CS Dept., School of Mathematics & CS, Yunnan Minzu University, Kunming, China [cite: 4, 6, 8]
- [cite_start]Email: yzqyt@ymu.edu.cn [cite: 8, 37]
- Chee Seng Chan* (Corresponding author)
- [cite_start]Affiliation: AI Dept., Faculty of CS & IT (FSKTM), Universiti Malaya, Kuala Lumpur, Malaysia [cite: 5, 6, 9]
- [cite_start]Email: cs.chan@um.edu.my [cite: 9, 37]
๐ Key Features
- [cite_start]Domain-Specific Expertise: Fine-tuned on a massive dataset of 220,000 herb images (163 classes) and 220,000 clinical MCQs[cite: 14, 84, 115, 203].
- [cite_start]High Accuracy: Achieves 96.7% accuracy on a held-out test suite of TCM-related MCQs [cite: 22, 92, 117, 481][cite_start], significantly outperforming general-purpose models like GPT-40 and Gemini[cite: 22, 92, 117].
- [cite_start]Robust & Reliable: Incorporates a Cross-Transformation Memory Mechanism (CTMM) to prevent overfitting and "answer position bias" [cite: 19, 91][cite_start], forcing the model to reason about content rather than memorizing patterns[cite: 23, 330, 443].
๐ ๏ธ Training Procedure
Base Model
[cite_start]TCM-VR uses Qwen2.5-VL as its vision-language backbone[cite: 84, 201].
Dataset
[cite_start]The model was fine-tuned using a comprehensive, specially-curated dataset[cite: 162, 203]:
- [cite_start]Images: 220,000 real-world images of dried and processed herbs across 163 categories[cite: 14, 162, 203].
- [cite_start]Text: 220,000 multiple-choice questions (totaling 880,000 answer options) [cite: 14, 84, 115] [cite_start]structured in a vision-language JSON format[cite: 211].
Cross-Transformation Memory Mechanism (CTMM)
[cite_start]To ensure the model learns to reason rather than memorize [cite: 316, 434][cite_start], the CTMM was applied during training[cite: 91, 472]. [cite_start]This mechanism enforces semantic consistency by[cite: 337]:
- [cite_start]Paraphrasing Prompts: Using varied linguistic structures for semantically identical questions[cite: 19, 323, 343].
- [cite_start]Shuffling Answer Orders: Randomizing the
A, B, C, Doptions for each question to disrupt positional biases[cite: 19, 91, 325, 330].
๐ Performance
[cite_start]On a held-out test split, TCM-VR achieves 96.7% accuracy on multimodal clinical MCQs[cite: 22, 92, 117, 481]. [cite_start]Case studies confirm that the model can correctly identify the right answer even when its position is shuffled, demonstrating true reasoning capabilities[cite: 23, 442, 443].
๐ผ๏ธ Example Multimodal Cases
[cite_start]These examples illustrate typical inputs (herb images + MCQs) and the modelโs reasoning-style answers[cite: 185, 187, 190].
๐ก How to Use
You can use this model similarly to other Qwen-VL models for multimodal chat. [cite_start]The model expects an image and a structured query, as shown in the paper's training data (Figure 2)[cite: 185, 187, 190].
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the model and tokenizer
# !! Replace "your-username/TCM-VisResolve" with your model's HF path !!
model_id = "your-username/TCM-VisResolve"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
bf16=True
).eval()
# 1. Load your herb image
# [cite_start]Example using the 'Gan Jiang' image from the paper [cite: 169]
image_path = "path/to/your/herb_image.jpg" # e.g., "data/mcq_output/gangjiang_gangjiang_1577.jpg" [cite: 189]
image = Image.open(image_path)
# 2. Format your query
# The query should include the image placeholder and the MCQ
# [cite_start]This example is based on Figure 2 in the paper [cite: 185, 190]
question = (
"่ฟๆฏไปไนไธญ่ฏ? ไปฅไธๅช้กนไธๆฏ่ฏฅ่ฏๆ็้ๅบ็? "
"A. ไธปๆฒป:่ตค็ผๆถฉ็;ๅณๅฝไธๆฐ... "
"B. ไธปๆฒป:่กฅ่ๆ็ฎ... "
"C. ไธปๆฒป:้ผป่กไธๆญข... "
"D. ไธปๆฒป:่็..."
)
# Format the prompt for the model
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": question},
],
}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
# 3. Generate the response
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=1024,
)
generated_ids = [
output_ids[len(input_ids):]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
# [cite_start]Example output (based on Figure 2 [cite: 187]):
# "ๅ็งฐ:ๅนฒๅง ๆผ้ณ:gangjiang ... ๅๆ:ไธปๆฒป:่ตค็ผๆถฉ็... ๆญฃ็กฎ็ญๆกๆฏ A ้้กนๅ
ๅฎน:..."
- Downloads last month
- 98