BioClinical Treatment Information Detector

Model Description

This model is a specialized token classification system designed to detect treatment-related information in addiction medicine clinical notes. It is fine-tuned from thomas-sounack/BioClinical-ModernBERT-large to identify current and future treatment plans, medication decisions, and therapeutic interventions while preserving patient privacy.

Key Purpose: Prevent information leakage about locus of care and medication decisions when training clinical decision support systems in addiction medicine. This model enables researchers to mask sensitive treatment information before using clinical data for machine learning applications.

Intended Use

Primary Use Case

Privacy-preserving clinical AI: Mask treatment-related information from clinical notes before training decision support systems
Research data preparation: Identify and redact sensitive treatment details while preserving other clinical information
Compliance support: Help maintain patient confidentiality when sharing clinical datasets for research

What It Detects

Current medication prescriptions and dosages
Treatment plans and recommendations
Therapeutic interventions and procedures
Follow-up care instructions
Clinical advice and care coordination

What It Does NOT Detect

Past treatment history (focuses only on current/future treatments)
Personal Identifiable Information (PII) like names, addresses, phone numbers
General medical conditions or diagnoses
Demographics or personal details

Model Details

Model Type: Token Classification (NER)
Base Model: thomas-sounack/BioClinical-ModernBERT-large
Language: English
Domain: Clinical text (addiction medicine)
Training Data: Single-center clinical notes from addiction medicine department
Labels:
- O: Outside treatment information
- B-TREATMENT: Beginning of treatment entity
- I-TREATMENT: Inside treatment entity

Performance

The model achieves strong performance on treatment detection:

Treatment F1-Score: 0.892
Treatment Precision: 0.885
Treatment Recall: 0.899

These metrics reflect the model's ability to accurately identify treatment-related spans while minimizing false positives and negatives.

Limitations and Bias

Domain Specificity

Single-center training: Model is trained exclusively on data from one addiction medicine center
Specialty focus: Optimized for addiction medicine; may not generalize well to other medical specialties
Language limitation: English-only model

Temporal Focus

Current/future treatments only: Does not detect historical treatment information
Context dependency: Performance may vary with different clinical note structures

Ethical Considerations

This model is designed for defensive security purposes only
Should be used to protect patient privacy, not to extract sensitive information
Users must ensure compliance with healthcare privacy regulations (HIPAA, GDPR, etc.)

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "Lekhansh/bioclinical-treatment-detector"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example clinical text
text = """
Treatment Plan:
1. Start Tablet Buprenorphine 8mg twice daily
2. Continue counseling sessions weekly
3. Follow up in outpatient clinic after 2 weeks
"""

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, 
                  return_offsets_mapping=True)
outputs = model(**{k: v for k, v in inputs.items() if k != 'offset_mapping'})

# Get predictions
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_labels = torch.argmax(predictions, dim=-1)[0]

# Map predictions to text spans
id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"}
offset_mapping = inputs["offset_mapping"][0]

treatment_spans = []
current_span = None

for i, (label_id, (start, end)) in enumerate(zip(predicted_labels, offset_mapping)):
    if start == 0 and end == 0:  # Skip special tokens
        continue
    
    label = id2label[label_id.item()]
    
    if label == "B-TREATMENT":
        if current_span:
            treatment_spans.append(current_span)
        current_span = {"start": start.item(), "end": end.item()}
    elif label == "I-TREATMENT" and current_span:
        current_span["end"] = end.item()
    else:
        if current_span:
            treatment_spans.append(current_span)
            current_span = None

if current_span:
    treatment_spans.append(current_span)

# Extract treatment text
for span in treatment_spans:
    treatment_text = text[span["start"]:span["end"]]
    print(f"Treatment detected: '{treatment_text}'")

Advanced Usage

For more sophisticated inference with confidence scores and batch processing, see the complete example:

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForTokenClassification

class TreatmentDetector:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)
        self.model.eval()
        self.id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"}
    
    def detect_treatments(self, text, confidence_threshold=0.5):
        encoding = self.tokenizer(
            text, return_tensors="pt", truncation=True, max_length=8192,
            return_offsets_mapping=True, padding=True
        )
        
        with torch.no_grad():
            outputs = self.model(**{k: v for k, v in encoding.items() if k != 'offset_mapping'})
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_labels = torch.argmax(predictions, dim=-1)[0]
            confidence_scores = torch.max(predictions, dim=-1)[0][0]
        
        treatment_spans = []
        current_span = None
        
        for i, (label_id, confidence, (start, end)) in enumerate(
            zip(predicted_labels, confidence_scores, encoding["offset_mapping"][0])
        ):
            if start == 0 and end == 0:
                continue
                
            label = self.id2label[label_id.item()]
            conf = confidence.item()
            
            if label == "B-TREATMENT" and conf > confidence_threshold:
                if current_span:
                    treatment_spans.append(current_span)
                current_span = {
                    "start": start.item(), "end": end.item(),
                    "confidence": conf
                }
            elif label == "I-TREATMENT" and current_span and conf > confidence_threshold:
                current_span["end"] = end.item()
                current_span["confidence"] = (current_span["confidence"] + conf) / 2
            else:
                if current_span:
                    treatment_spans.append(current_span)
                    current_span = None
        
        if current_span:
            treatment_spans.append(current_span)
        
        # Add text content
        for span in treatment_spans:
            span["text"] = text[span["start"]:span["end"]]
        
        return treatment_spans

# Usage
detector = TreatmentDetector("Lekhansh/bioclinical-treatment-detector")
treatments = detector.detect_treatments(clinical_text)

Training Details

Training Data

Source: Single addiction medicine center clinical notes
Annotation: Manual annotation of treatment-related text spans
Size: Balanced dataset with both positive and negative examples
Preprocessing: Text segmentation with sliding windows for long documents

Training Configuration

Base Model: thomas-sounack/BioClinical-ModernBERT-large
Training Epochs: 3
Batch Size: 8 (with gradient accumulation)
Learning Rate: 5e-5
Optimizer: AdamW with weight decay
Hardware: Single GPU training

Citation

If you use this model in your research, please cite:

@misc{bioclinical-treatment-detector,
  title={Addiction Medicine Treatment Information Detector for Clinical AI},
  author={[Lekhansh S, Prakrithi SN]},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Lekhansh/bioclinical-treatment-detector}}
}

Contact

For questions about this model or its applications in privacy-preserving clinical AI, please contact [drlekhansh@gmail.com].

License

This model is released under the Apache 2.0 License. Please ensure compliance with all applicable healthcare privacy regulations when using this model.

Downloads last month: 6

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for Lekhansh/bioclinical-treatment-detector

Base model

answerdotai/ModernBERT-large

Finetuned

thomas-sounack/BioClinical-ModernBERT-large

Finetuned

(1)

this model

Evaluation results

Treatment F1-Score on Addiction Medicine Clinical Notes
self-reported

0.900
Treatment Precision on Addiction Medicine Clinical Notes
self-reported

0.900
Treatment Recall on Addiction Medicine Clinical Notes
self-reported

0.899