--- license: mit library_name: transformers pipeline_tag: question-answering ---

m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models

A simple test-time scaling strategy, with minimal fine-tuning, can unlock strong medical reasoning within large language models.

## ⚑ Introduction ![](assets/teaser.png) Hi! Welcome to the repository for **m1** (πŸ“ƒ [Paper](https://arxiv.org/abs/2504.00869))! **m1** is a medical LLM designed to enhance reasoning through efficient test-time scaling. It enables lightweight models to match or exceed the performance of much larger counterparts by extending inference-time β€œthinking.” Unlike methods that rely on complex RL or expert supervision, m1 achieves strong results through: - **Fine-tuning on a small, high-quality set of verified medical reasoning examples**, showing that even with just 1K–23K examples, m1-7B *surpasses* previous SOTA models like HuatuoGPT-o1-7B and UltraMedical-8B, and m1-32B *rivals* 70B-scale models. - **Scaling reasoning at inference using token budgets**, which consistently improves performance across medical QA tasks: up to an optimal ~4K token budget, beyond which performance may degrade due to overthinking. - **Identifying medical knowledge as the key bottleneck**, revealing that additional reasoning alone cannot overcome knowledge gaps; instead, improvements require better data quality and increased model capacity. We open-sourced our models, data, and code here. **************************************************************** **Updates:** * 2025-03: We release our code, data, models, and paper! **************************************************************** ### 🌍 Environment Please refer to [docs/ENV.md](docs/ENV.md). ### πŸ‘¨β€βš•οΈ Models and Data | Model | Backbone | Training Data | Link | | ---------------- | --------------------- | ----------------------------------------------------------------------------- | -------------------------------------------------------------- | | **m1-32b-1k** | Qwen2.5-32B-Instruct | [m1k](https://huggingface.co/datasets/UCSC-VLAA/m1k-tokenized) | [HF Link](https://huggingface.co/UCSC-VLAA/m1-32B-1K) | | **m1-7b-1k** | Qwen2.5-7B-Instruct | [m1k](https://huggingface.co/datasets/UCSC-VLAA/m1k-tokenized) | [HF Link](https://huggingface.co/UCSC-VLAA/m1-7B-1K) | | **m1-7b-23k** | Qwen2.5-7B-Instruct | [m23k](https://huggingface.co/datasets/UCSC-VLAA/m23k-tokenized) | [HF Link](https://huggingface.co/UCSC-VLAA/m1-7B-23K) | ### πŸƒ Inference (... same content as original README ...) ### πŸ“– Citation ``` @misc{huang2025m1UnleashPotential, title={m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning in Large Language Models}, author={Xiaoke Huang and Juncheng Wu and Hui Liu and Xianfeng Tang and Yuyin Zhou}, year={2025}, eprint={2504.00869}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.00869}, } ```