AnonymousARR42 commited on
Commit
3796734
·
verified ·
1 Parent(s): 7781d60

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +262 -0
README.md ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+
4
+ base_model:
5
+ - meta-llama/Meta-Llama-3-8B-Instruct
6
+
7
+ language:
8
+ - en
9
+
10
+ tags:
11
+ - BEL
12
+ - retrieval
13
+ - entity-retrieval
14
+ - named-entity-disambiguation
15
+ - entity-disambiguation
16
+ - named-entity-linking
17
+ - entity-linking
18
+ - text2text-generation
19
+ - biomedical
20
+ - healthcare
21
+ - synthetic-data
22
+ - causal-lm
23
+ - llm
24
+
25
+ library_name: transformers
26
+ finetuning_task:
27
+ - text2text-generation
28
+ - entity-linking
29
+ metrics:
30
+ - recall
31
+ model-index:
32
+ - name: syncabel-medmentions-8b
33
+ results:
34
+ - task:
35
+ type: entity-linking
36
+ dataset:
37
+ type: structured_dataset
38
+ name: medmentions
39
+ config: st21pv
40
+ metrics:
41
+ - type: recall
42
+ value: 0.754
43
+ ---
44
+
45
+
46
+ # SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking
47
+
48
+ ## SynCABEL
49
+
50
+ **SynCABEL** is a novel framework that addresses data scarcity in biomedical entity linking through **synthetic data generation**. The method, introduced in our [paper]
51
+
52
+ ## SynCABEL (SPACCC Edition)
53
+
54
+ This is a **finetuned version of LLaMA-3-8B** trained on **MedMentions** using **SynthMM** (our synthetic dataset generated via the SynCABEL framework).
55
+
56
+ | | |
57
+ |--------|---------|
58
+ | **Base Model** | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
59
+ | **Training Data** | [MedMentions](https://huggingface.co/datasets/bigbio/medmentions) (real) + [SynthMM](https://huggingface.co/datasets/Aremaki/SynCABEL) (synthetic) |
60
+ | **Fine-tuning** | [Supervised Fine-Tuning](https://huggingface.co/docs/trl/en/sft_trainer) |
61
+
62
+ ## Training Data Composition
63
+
64
+ The model is trained on a mix of **human-annotated** and **synthetic** data:
65
+
66
+ ```
67
+ MedMentions (human) : 4,392 abstracts
68
+ SynthMM (synthetic) : ~50,000 samples
69
+ ```
70
+
71
+ To ensure balanced learning, **human data is upsampled during training** so that each batch contains:
72
+
73
+ ```
74
+ 50% human-annotated data
75
+ 50% synthetic data
76
+ ```
77
+
78
+ In other words, although SynthMM is larger, the model always sees a **1:1 ratio of human to synthetic examples**, preventing synthetic data from overwhelming human supervision.
79
+
80
+
81
+ ## Usage
82
+
83
+
84
+ ### Loading
85
+ ```python
86
+ import torch
87
+ from transformers import AutoModelForCausalLM
88
+
89
+ # Load the model (requires trust_remote_code for custom architecture)
90
+ model = AutoModelForCausalLM.from_pretrained(
91
+ "Aremaki/SynCABEL_MedMentions",
92
+ trust_remote_code=True,
93
+ device_map="auto"
94
+ )
95
+ ```
96
+
97
+ ### Unconstrained Generation
98
+ ```python
99
+ # Let the model freely generate concept names
100
+ sentences = [
101
+ "[Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug",
102
+ "[Myocardial infarction]{Disorders} requires immediate intervention"
103
+ ]
104
+
105
+ results = model.sample(
106
+ sentences=sentences,
107
+ constrained=False,
108
+ num_beams=3,
109
+ )
110
+
111
+ for i, beam_results in enumerate(results):
112
+ print(f"Input: {sentences[i]}")
113
+
114
+ mention = beam_results[0]["mention"]
115
+ print(f"Mention: {mention}")
116
+
117
+ for j, result in enumerate(beam_results):
118
+ print(
119
+ f"Beam {j+1}"
120
+ f"Predicted concept name:{result['pred_concept_name']}"
121
+ f"Predicted code: {result['pred_concept_code']} "
122
+ f"Beam score: {result['beam_score']:.3f})"
123
+ )
124
+
125
+ ```
126
+
127
+ **Output:**
128
+ ```
129
+ Input: [Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug
130
+ Mention: Ibuprofen
131
+ Beam 1:
132
+ Predicted concept name:Ibuprofen
133
+ Predicted code: C0020740
134
+ Beam score: 1.000
135
+
136
+ Beam 2:
137
+ Predicted concept name:IBUPROFEN
138
+ Predicted code: NO_CODE
139
+ Beam score: 0.114
140
+
141
+ Beam 3:
142
+ Predicted concept name:IBUPROfen
143
+ Predicted code: NO_CODE
144
+ Beam score: 0.060
145
+
146
+ Input: [Myocardial infarction]{Disorders} requires immediate intervention
147
+ Mention: Myocardial infarction
148
+ Beam 1:
149
+ Predicted concept name:Myocardial infarction
150
+ Predicted code: C0027051
151
+ Beam score: 1.000
152
+
153
+ Beam 2:
154
+ Predicted concept name:Myocardial Infarction
155
+ Predicted code: C0027051
156
+ Beam score: 0.200
157
+
158
+ Beam 3:
159
+ Predicted concept name:myocardial infarction
160
+ Predicted code: NO_CODE
161
+ Beam score: 0.149
162
+ ```
163
+
164
+ ### Constrained Decoding (Recommended for Entity Linking)
165
+ ```python
166
+ # Constrained to valid biomedical concepts
167
+ sentences = [
168
+ "[Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug",
169
+ "[Myocardial infarction]{Disorders} requires immediate intervention"
170
+ ]
171
+
172
+ results = model.sample(
173
+ sentences=sentences,
174
+ constrained=False,
175
+ num_beams=3,
176
+ )
177
+
178
+ for i, beam_results in enumerate(results):
179
+ print(f"Input: {sentences[i]}")
180
+
181
+ mention = beam_results[0]["mention"]
182
+ print(f"Mention: {mention}")
183
+
184
+ for j, result in enumerate(beam_results):
185
+ print(
186
+ f"Beam {j+1}:\n"
187
+ f"Predicted concept name:{result['pred_concept_name']}\n"
188
+ f"Predicted code: {result['pred_concept_code']}\n"
189
+ f"Beam score: {result['beam_score']:.3f}\n"
190
+ )
191
+ ```
192
+
193
+ **Output:**
194
+ ```
195
+ Input: [Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug
196
+ Mention: Ibuprofen
197
+ Beam 1:
198
+ Predicted concept name:Ibuprofen
199
+ Predicted code: C0020740
200
+ Beam score: 1.000
201
+
202
+ Beam 2:
203
+ Predicted concept name:IBUPROFEN/PSEUDOEPHEDRINE
204
+ Predicted code: C0717858
205
+ Beam score: 0.065
206
+
207
+ Beam 3:
208
+ Predicted concept name:Ibuprofen (substance)
209
+ Predicted code: C0020740
210
+ Beam score: 0.056
211
+
212
+ Input: [Myocardial infarction]{Disorders} requires immediate intervention
213
+ Mention: Myocardial infarction
214
+ Beam 1:
215
+ Predicted concept name:Myocardial infarction
216
+ Predicted code: C0027051
217
+ Beam score: 1.000
218
+
219
+ Beam 2:
220
+ Predicted concept name:Myocardial Infarction
221
+ Predicted code: C0027051
222
+ Beam score: 0.200
223
+
224
+ Beam 3:
225
+ Predicted concept name:Myocardial infarction (disorder)
226
+ Predicted code: C0027051
227
+ Beam score: 0.194
228
+ ```
229
+
230
+ ## Assets
231
+ The model automatically loads:
232
+ - `text_to_code.json`: Maps concept names to ontology codes (UMLS, SNOMED CT)
233
+ - `candidate_trie.pkl`: Prefix tree for efficient constrained decoding
234
+
235
+
236
+ ## MedMentions Test Set Results
237
+
238
+ | Training Data | Recall@1 | Improvement |
239
+ |---------------|----------|-------------|
240
+ | MedMentions Only | 0.76 | Baseline |
241
+ | + SynthMM (Ours) | **0.85** | **+11.8%** |
242
+
243
+ ### Comparison with State-of-the-Art
244
+
245
+ | Model | F1 Score | Training Data |
246
+ |-------|----------|---------------|
247
+ | **SapBERT** | 0.83 | MedMentions + UMLS |
248
+ | **BioSyn** | 0.81 | MedMentions |
249
+ | **GENRE (baseline)** | 0.79 | MedMentions |
250
+ | **SynCABEL-8B (Ours)** | **0.85** | MedMentions + SynthMM |
251
+ | **SynCABEL-8B (w/ UMLS)** | **0.88** | + UMLS pretraining |
252
+
253
+ ### Speed and Efficiency
254
+
255
+ | Batch Size | Avg. Latency | Throughput |
256
+ |------------|--------------|------------|
257
+ | 1 | 120ms | 8.3 samples/sec |
258
+ | 8 | 650ms | 12.3 samples/sec |
259
+ | 16 | 1.2s | 13.3 samples/sec |
260
+ | 32 | 2.1s | 15.2 samples/sec |
261
+
262
+ *Measured on single H100 GPU, constrained decoding*