Intelligent-Internet
/

II-Medical-8B

Text Generation

text-generation-inference

Model card Files Files and versions

hoanganhpham commited on May 15

Commit

d8f6f97

·

1 Parent(s): 9295141

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -15,6 +15,9 @@ II-Medical-8B is a medical reasoning model trained on a [comprehensive dataset](
 ![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png)
 ## II. Training Methodology
 We collected and generated a comprehensive set of reasoning datasets for the medical domain and performed SFT fine-tuning on the **Qwen/Qwen3-8B** model. Following this, we further optimized the SFT model by training DAPO on a hard-reasoning dataset to boost performance.
@@ -47,7 +50,7 @@ Journal of Medicine,  4 Options  and 5 Options splits from the MedBullets platfo
 | Model                   | MedMC | MedQA | PubMed | MMLU-P | GPQA | Lancet | MedB-4 | MedB-5 | MedX  | NEJM  | Avg   |
 |--------------------------|-------|-------|--------|--------|------|--------|--------|--------|------|-------|-------|
-| QWQ 32B                  | 76.76 | 88.85 | 79.90   | 80.46  | 64.36| 70.87   | 77.27  | 73.05  |23.53 |76.29  | 71.13 |
 | QWQ 32B                  | 69.73 | 87.03 | 88.5   | 79.86  | 69.17| 71.3   | 72.07  | 69.01  |24.98 |75.12  | 70.68 |
 | Qwen2.5-7B-IT            | 56.56 | 61.51 | 71.3   | 61.17  | 42.56| 61.17  | 46.75  | 40.58  |13.26 |59.04  | 51.39 |
 | HuatuoGPT-o1-8B          | 63.97 | 74.78 | **80.10**  | 63.71  | 55.38| 64.32  | 58.44  | 51.95  |15.79 |64.84  | 59.32 |

 ![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png)
+Our II-Medical-8B model achieved a 40% score on HealthBench, an open-source benchmark evaluating the performance and safety of large language models in healthcare, performing comparably to OpenAI's o1 reasoning model
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/S90HEqD6UJCme-1_17IJw.png)
 ## II. Training Methodology
 We collected and generated a comprehensive set of reasoning datasets for the medical domain and performed SFT fine-tuning on the **Qwen/Qwen3-8B** model. Following this, we further optimized the SFT model by training DAPO on a hard-reasoning dataset to boost performance.
 | Model                   | MedMC | MedQA | PubMed | MMLU-P | GPQA | Lancet | MedB-4 | MedB-5 | MedX  | NEJM  | Avg   |
 |--------------------------|-------|-------|--------|--------|------|--------|--------|--------|------|-------|-------|
+| HuatuoGPT-o1-72B         | 76.76 | 88.85 | 79.90   | 80.46  | 64.36| 70.87   | 77.27  | 73.05  |23.53 |76.29  | 71.13 |
 | QWQ 32B                  | 69.73 | 87.03 | 88.5   | 79.86  | 69.17| 71.3   | 72.07  | 69.01  |24.98 |75.12  | 70.68 |
 | Qwen2.5-7B-IT            | 56.56 | 61.51 | 71.3   | 61.17  | 42.56| 61.17  | 46.75  | 40.58  |13.26 |59.04  | 51.39 |
 | HuatuoGPT-o1-8B          | 63.97 | 74.78 | **80.10**  | 63.71  | 55.38| 64.32  | 58.44  | 51.95  |15.79 |64.84  | 59.32 |