BioClinical Treatment Information Detector

Model Description

This model is a specialized token classification system designed to detect treatment-related information in addiction medicine clinical notes. It is fine-tuned from thomas-sounack/BioClinical-ModernBERT-large to identify current and future treatment plans, medication decisions, and therapeutic interventions while preserving patient privacy.

Key Purpose: Prevent information leakage about locus of care and medication decisions when training clinical decision support systems in addiction medicine. This model enables researchers to mask sensitive treatment information before using clinical data for machine learning applications.

Intended Use

Primary Use Case

  • Privacy-preserving clinical AI: Mask treatment-related information from clinical notes before training decision support systems
  • Research data preparation: Identify and redact sensitive treatment details while preserving other clinical information
  • Compliance support: Help maintain patient confidentiality when sharing clinical datasets for research

What It Detects

  • Current medication prescriptions and dosages
  • Treatment plans and recommendations
  • Therapeutic interventions and procedures
  • Follow-up care instructions
  • Clinical advice and care coordination

What It Does NOT Detect

  • Past treatment history (focuses only on current/future treatments)
  • Personal Identifiable Information (PII) like names, addresses, phone numbers
  • General medical conditions or diagnoses
  • Demographics or personal details

Model Details

  • Model Type: Token Classification (NER)
  • Base Model: thomas-sounack/BioClinical-ModernBERT-large
  • Language: English
  • Domain: Clinical text (addiction medicine)
  • Training Data: Single-center clinical notes from addiction medicine department
  • Labels:
    • O: Outside treatment information
    • B-TREATMENT: Beginning of treatment entity
    • I-TREATMENT: Inside treatment entity

Performance

The model achieves strong performance on treatment detection:

  • Treatment F1-Score: 0.892
  • Treatment Precision: 0.885
  • Treatment Recall: 0.899

These metrics reflect the model's ability to accurately identify treatment-related spans while minimizing false positives and negatives.

Limitations and Bias

Domain Specificity

  • Single-center training: Model is trained exclusively on data from one addiction medicine center
  • Specialty focus: Optimized for addiction medicine; may not generalize well to other medical specialties
  • Language limitation: English-only model

Temporal Focus

  • Current/future treatments only: Does not detect historical treatment information
  • Context dependency: Performance may vary with different clinical note structures

Ethical Considerations

  • This model is designed for defensive security purposes only
  • Should be used to protect patient privacy, not to extract sensitive information
  • Users must ensure compliance with healthcare privacy regulations (HIPAA, GDPR, etc.)

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "Lekhansh/bioclinical-treatment-detector"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example clinical text
text = """
Treatment Plan:
1. Start Tablet Buprenorphine 8mg twice daily
2. Continue counseling sessions weekly
3. Follow up in outpatient clinic after 2 weeks
"""

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, 
                  return_offsets_mapping=True)
outputs = model(**{k: v for k, v in inputs.items() if k != 'offset_mapping'})

# Get predictions
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_labels = torch.argmax(predictions, dim=-1)[0]

# Map predictions to text spans
id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"}
offset_mapping = inputs["offset_mapping"][0]

treatment_spans = []
current_span = None

for i, (label_id, (start, end)) in enumerate(zip(predicted_labels, offset_mapping)):
    if start == 0 and end == 0:  # Skip special tokens
        continue
    
    label = id2label[label_id.item()]
    
    if label == "B-TREATMENT":
        if current_span:
            treatment_spans.append(current_span)
        current_span = {"start": start.item(), "end": end.item()}
    elif label == "I-TREATMENT" and current_span:
        current_span["end"] = end.item()
    else:
        if current_span:
            treatment_spans.append(current_span)
            current_span = None

if current_span:
    treatment_spans.append(current_span)

# Extract treatment text
for span in treatment_spans:
    treatment_text = text[span["start"]:span["end"]]
    print(f"Treatment detected: '{treatment_text}'")

Advanced Usage

For more sophisticated inference with confidence scores and batch processing, see the complete example:

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForTokenClassification

class TreatmentDetector:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)
        self.model.eval()
        self.id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"}
    
    def detect_treatments(self, text, confidence_threshold=0.5):
        encoding = self.tokenizer(
            text, return_tensors="pt", truncation=True, max_length=8192,
            return_offsets_mapping=True, padding=True
        )
        
        with torch.no_grad():
            outputs = self.model(**{k: v for k, v in encoding.items() if k != 'offset_mapping'})
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_labels = torch.argmax(predictions, dim=-1)[0]
            confidence_scores = torch.max(predictions, dim=-1)[0][0]
        
        treatment_spans = []
        current_span = None
        
        for i, (label_id, confidence, (start, end)) in enumerate(
            zip(predicted_labels, confidence_scores, encoding["offset_mapping"][0])
        ):
            if start == 0 and end == 0:
                continue
                
            label = self.id2label[label_id.item()]
            conf = confidence.item()
            
            if label == "B-TREATMENT" and conf > confidence_threshold:
                if current_span:
                    treatment_spans.append(current_span)
                current_span = {
                    "start": start.item(), "end": end.item(),
                    "confidence": conf
                }
            elif label == "I-TREATMENT" and current_span and conf > confidence_threshold:
                current_span["end"] = end.item()
                current_span["confidence"] = (current_span["confidence"] + conf) / 2
            else:
                if current_span:
                    treatment_spans.append(current_span)
                    current_span = None
        
        if current_span:
            treatment_spans.append(current_span)
        
        # Add text content
        for span in treatment_spans:
            span["text"] = text[span["start"]:span["end"]]
        
        return treatment_spans

# Usage
detector = TreatmentDetector("Lekhansh/bioclinical-treatment-detector")
treatments = detector.detect_treatments(clinical_text)

Training Details

Training Data

  • Source: Single addiction medicine center clinical notes
  • Annotation: Manual annotation of treatment-related text spans
  • Size: Balanced dataset with both positive and negative examples
  • Preprocessing: Text segmentation with sliding windows for long documents

Training Configuration

  • Base Model: thomas-sounack/BioClinical-ModernBERT-large
  • Training Epochs: 3
  • Batch Size: 8 (with gradient accumulation)
  • Learning Rate: 5e-5
  • Optimizer: AdamW with weight decay
  • Hardware: Single GPU training

Citation

If you use this model in your research, please cite:

@misc{bioclinical-treatment-detector,
  title={Addiction Medicine Treatment Information Detector for Clinical AI},
  author={[Lekhansh S, Prakrithi SN]},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Lekhansh/bioclinical-treatment-detector}}
}

Contact

For questions about this model or its applications in privacy-preserving clinical AI, please contact [drlekhansh@gmail.com].

License

This model is released under the Apache 2.0 License. Please ensure compliance with all applicable healthcare privacy regulations when using this model.

Downloads last month
6
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Lekhansh/bioclinical-treatment-detector

Evaluation results

  • Treatment F1-Score on Addiction Medicine Clinical Notes
    self-reported
    0.900
  • Treatment Precision on Addiction Medicine Clinical Notes
    self-reported
    0.900
  • Treatment Recall on Addiction Medicine Clinical Notes
    self-reported
    0.899