AnonymousARR42
/

SynCABEL_SPACCC

Model card Files Files and versions

xet

Community

AnonymousARR42 commited on 11 days ago

Commit

3796734

verified ·

1 Parent(s): 7781d60

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +262 -0

README.md ADDED Viewed

	@@ -0,0 +1,262 @@

+---
+license: apache-2.0
+base_model:
+  - meta-llama/Meta-Llama-3-8B-Instruct
+language:
+  - en
+tags:
+  - BEL
+  - retrieval
+  - entity-retrieval
+  - named-entity-disambiguation
+  - entity-disambiguation
+  - named-entity-linking
+  - entity-linking
+  - text2text-generation
+  - biomedical
+  - healthcare
+  - synthetic-data
+  - causal-lm
+  - llm
+library_name: transformers
+finetuning_task:
+  - text2text-generation
+  - entity-linking
+metrics:
+  - recall
+model-index:
+  - name: syncabel-medmentions-8b
+    results:
+      - task:
+          type: entity-linking
+        dataset:
+          type: structured_dataset
+          name: medmentions
+          config: st21pv
+        metrics:
+          - type: recall
+            value: 0.754
+---
+# SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking
+## SynCABEL
+**SynCABEL** is a novel framework that addresses data scarcity in biomedical entity linking through **synthetic data generation**. The method, introduced in our [paper]
+## SynCABEL (SPACCC Edition)
+This is a **finetuned version of LLaMA-3-8B** trained on **MedMentions** using **SynthMM** (our synthetic dataset generated via the SynCABEL framework).
+| |  |
+|--------|---------|
+| **Base Model** | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
+| **Training Data** | [MedMentions](https://huggingface.co/datasets/bigbio/medmentions) (real) + [SynthMM](https://huggingface.co/datasets/Aremaki/SynCABEL) (synthetic) |
+| **Fine-tuning** | [Supervised Fine-Tuning](https://huggingface.co/docs/trl/en/sft_trainer) |
+## Training Data Composition
+The model is trained on a mix of **human-annotated** and **synthetic** data:
+```
+MedMentions (human)   :  4,392 abstracts
+SynthMM (synthetic)  : ~50,000 samples
+```
+To ensure balanced learning, **human data is upsampled during training** so that each batch contains:
+```
+50% human-annotated data
+50% synthetic data
+```
+In other words, although SynthMM is larger, the model always sees a **1:1 ratio of human to synthetic examples**, preventing synthetic data from overwhelming human supervision.
+## Usage
+### Loading
+```python
+import torch
+from transformers import AutoModelForCausalLM
+# Load the model (requires trust_remote_code for custom architecture)
+model = AutoModelForCausalLM.from_pretrained(
+    "Aremaki/SynCABEL_MedMentions",
+    trust_remote_code=True,
+    device_map="auto"
+)
+```
+### Unconstrained Generation
+```python
+# Let the model freely generate concept names
+sentences = [
+    "[Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug",
+    "[Myocardial infarction]{Disorders} requires immediate intervention"
+]
+results = model.sample(
+    sentences=sentences,
+    constrained=False,
+    num_beams=3,
+)
+for i, beam_results in enumerate(results):
+    print(f"Input: {sentences[i]}")
+    mention = beam_results[0]["mention"]
+    print(f"Mention: {mention}")
+    for j, result in enumerate(beam_results):
+        print(
+            f"Beam {j+1}"
+            f"Predicted concept name:{result['pred_concept_name']}"
+            f"Predicted code: {result['pred_concept_code']} "
+            f"Beam score: {result['beam_score']:.3f})"
+        )
+```
+**Output:**
+```
+Input: [Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug
+Mention: Ibuprofen
+Beam 1:
+Predicted concept name:Ibuprofen
+Predicted code: C0020740
+Beam score: 1.000
+Beam 2:
+Predicted concept name:IBUPROFEN
+Predicted code: NO_CODE
+Beam score: 0.114
+Beam 3:
+Predicted concept name:IBUPROfen
+Predicted code: NO_CODE
+Beam score: 0.060
+Input: [Myocardial infarction]{Disorders} requires immediate intervention
+Mention: Myocardial infarction
+Beam 1:
+Predicted concept name:Myocardial infarction
+Predicted code: C0027051
+Beam score: 1.000
+Beam 2:
+Predicted concept name:Myocardial Infarction
+Predicted code: C0027051
+Beam score: 0.200
+Beam 3:
+Predicted concept name:myocardial infarction
+Predicted code: NO_CODE
+Beam score: 0.149
+```
+### Constrained Decoding (Recommended for Entity Linking)
+```python
+# Constrained to valid biomedical concepts
+sentences = [
+    "[Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug",
+    "[Myocardial infarction]{Disorders} requires immediate intervention"
+]
+results = model.sample(
+    sentences=sentences,
+    constrained=False,
+    num_beams=3,
+)
+for i, beam_results in enumerate(results):
+    print(f"Input: {sentences[i]}")
+    mention = beam_results[0]["mention"]
+    print(f"Mention: {mention}")
+    for j, result in enumerate(beam_results):
+        print(
+            f"Beam {j+1}:\n"
+            f"Predicted concept name:{result['pred_concept_name']}\n"
+            f"Predicted code: {result['pred_concept_code']}\n"
+            f"Beam score: {result['beam_score']:.3f}\n"
+        )
+```
+**Output:**
+```
+Input: [Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug
+Mention: Ibuprofen
+Beam 1:
+Predicted concept name:Ibuprofen
+Predicted code: C0020740
+Beam score: 1.000
+Beam 2:
+Predicted concept name:IBUPROFEN/PSEUDOEPHEDRINE
+Predicted code: C0717858
+Beam score: 0.065
+Beam 3:
+Predicted concept name:Ibuprofen (substance)
+Predicted code: C0020740
+Beam score: 0.056
+Input: [Myocardial infarction]{Disorders} requires immediate intervention
+Mention: Myocardial infarction
+Beam 1:
+Predicted concept name:Myocardial infarction
+Predicted code: C0027051
+Beam score: 1.000
+Beam 2:
+Predicted concept name:Myocardial Infarction
+Predicted code: C0027051
+Beam score: 0.200
+Beam 3:
+Predicted concept name:Myocardial infarction (disorder)
+Predicted code: C0027051
+Beam score: 0.194
+```
+## Assets
+The model automatically loads:
+- `text_to_code.json`: Maps concept names to ontology codes (UMLS, SNOMED CT)
+- `candidate_trie.pkl`: Prefix tree for efficient constrained decoding
+## MedMentions Test Set Results
+| Training Data | Recall@1 | Improvement |
+|---------------|----------|-------------|
+| MedMentions Only | 0.76 | Baseline |
+| + SynthMM (Ours) | **0.85** | **+11.8%** |
+### Comparison with State-of-the-Art
+| Model | F1 Score | Training Data |
+|-------|----------|---------------|
+| **SapBERT** | 0.83 | MedMentions + UMLS |
+| **BioSyn** | 0.81 | MedMentions |
+| **GENRE (baseline)** | 0.79 | MedMentions |
+| **SynCABEL-8B (Ours)** | **0.85** | MedMentions + SynthMM |
+| **SynCABEL-8B (w/ UMLS)** | **0.88** | + UMLS pretraining |
+### Speed and Efficiency
+| Batch Size | Avg. Latency | Throughput |
+|------------|--------------|------------|
+| 1 | 120ms | 8.3 samples/sec |
+| 8 | 650ms | 12.3 samples/sec |
+| 16 | 1.2s | 13.3 samples/sec |
+| 32 | 2.1s | 15.2 samples/sec |
+*Measured on single H100 GPU, constrained decoding*