---
language:
- en
license: mit
tags:
- vision
- image-to-text
- ocr
- medical
- paddleocr
- unsloth
- lora
- ernie-challenge
datasets:
- naazimsnh02/medocr-vision-dataset
base_model: unsloth/PaddleOCR-VL
library_name: transformers
pipeline_tag: image-to-text
---

# MedOCR-Vision: Medical Document OCR with PaddleOCR-VL

**ERNIE AI Developer Challenge Submission**

A fine-tuned PaddleOCR-VL model specialized for medical document OCR, achieving high accuracy on medical prescriptions, lab reports, and forms while maintaining general OCR capabilities.

## Model Description

MedOCR-Vision is a vision-language model fine-tuned specifically for optical character recognition (OCR) of medical documents. The model is based on PaddleOCR-VL (1B parameters) and has been fine-tuned using LoRA (Low-Rank Adaptation) on a carefully curated dataset of 2,462 medical and general documents.

### Key Features

- **Specialized for Medical Documents**: Optimized for prescriptions, lab reports, and medical forms
- **Domain-Balanced Training**: Maintains general OCR capabilities (invoices, receipts, business documents)
- **Production-Ready**: Full merged model in float16 precision
- **Efficient Fine-tuning**: LoRA-based training for optimal performance with minimal parameters
- **High Accuracy**: Validation loss of 0.578 after 3 epochs of training

### Model Architecture

- **Base Model**: unsloth/PaddleOCR-VL (1B parameters)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
- **LoRA Rank**: 64
- **LoRA Alpha**: 64
- **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, out_proj, fc1, fc2, linear_1, linear_2
- **Precision**: Mixed (BF16/FP16)

## Performance Highlights

### Model Improvements Over Base Model

Our fine-tuned model demonstrates significant improvements across multiple metrics:

- ✅ **Enhanced Information Extraction**: Captures more complete medical information including headers, test values, and reference ranges
- ✅ **Better Document Understanding**: Improved coverage of document structure and context
- ✅ **Medical Domain Specialization**: Superior performance on medical terminology and clinical data
- ✅ **Comprehensive Coverage**: Extracts significantly more relevant content from medical documents

## Intended Uses

### Primary Use Cases

- **Medical Prescription Digitization**: Extract text from handwritten and printed prescriptions
- **Lab Report Processing**: Extract data from medical laboratory reports
- **Medical Form OCR**: Process various medical forms and documents
- **Healthcare Document Management**: Digitize medical records and documentation
- **General Document OCR**: Invoices, receipts, and business documents

### Out-of-Scope Uses

- Real-time medical diagnosis (this is an OCR tool, not a diagnostic system)
- Legal document verification (requires domain-specific training)
- Privacy-sensitive applications without proper data handling protocols

## Training Data

### Dataset Composition

The model was trained on **naazimsnh02/medocr-vision-dataset**, a curated dataset of 2,462 samples with the following composition:

| Dataset | Samples | Domain | Type |
|---------|---------|--------|------|
| Medical Prescriptions | 1,000 | Medical | Handwritten |
| OMR Scanned Documents | 36 | Medical | Scanned Forms |
| Medical Lab Reports | 426 | Medical | Printed Reports |
| Invoices & Receipts | 1,000 | General | Business Docs |
| **Total** | **2,462** | - | - |

### Dataset Statistics

- **Training**: 1,969 samples (80%)
- **Validation**: 246 samples (10%)
- **Test**: 247 samples (10%)
- **Domain Balance**: 59.4% Medical / 40.6% General

### Data Sources

1. [Medical Prescriptions](https://huggingface.co/datasets/chinmays18/medical-prescription-dataset)
2. [OMR Scanned Documents](https://huggingface.co/datasets/saurabh1896/OMR-scanned-documents)
3. [Medical Lab Reports](https://www.kaggle.com/datasets/dikshaasinghhh/bajaj)
4. [Invoices & Receipts](https://huggingface.co/datasets/mychen76/invoices-and-receipts_ocr_v1)

## Training Procedure

### Training Hyperparameters

```yaml
# Training Duration
num_train_epochs: 3
total_steps: 741

# Batch Configuration
per_device_train_batch_size: 4
gradient_accumulation_steps: 2
effective_batch_size: 8

# Learning Rate
learning_rate: 5e-5
warmup_steps: 50
lr_scheduler_type: linear

# Optimization
optimizer: adamw_8bit
weight_decay: 0.001

# LoRA Configuration
lora_r: 64
lora_alpha: 64
lora_dropout: 0

# Checkpointing
save_steps: 100
eval_steps: 100
save_total_limit: 5
load_best_model_at_end: true
```

### Training Results

| Step | Training Loss | Validation Loss |
|------|---------------|-----------------|
| 100  | 1.7026        | 0.7900          |
| 200  | 1.3005        | 0.6821          |
| 300  | 1.0004        | 0.6402          |
| 400  | 0.8176        | 0.6036          |
| 500  | 0.7387        | 0.5806          |
| 600  | 0.7406        | 0.5819          |
| 700  | 0.8801        | 0.5787          |

**Final Metrics:**
- **Final Validation Loss**: 0.5787
- **Training Time**: 38.36 minutes (2,301.80 seconds)
- **Peak GPU Memory**: 15.84 GB
- **GPU Utilization**: 71.89%

### Training Environment

- **GPU**: NVIDIA L4 (24GB VRAM)
- **Framework**: Unsloth + HuggingFace Transformers
- **Precision**: Mixed (BF16/FP16)
- **Memory Usage**: ~14 GB for training

### Training Strategy

1. **Domain-Balanced Approach**: 60/40 split between medical and general documents to prevent catastrophic forgetting
2. **LoRA Fine-tuning**: Efficient parameter-efficient fine-tuning targeting key attention and MLP layers
3. **Checkpoint Selection**: Best model selected based on lowest validation loss
4. **Evaluation**: Regular evaluation every 100 steps to monitor convergence

## How to Use

### Installation

```bash
pip install transformers unsloth einops torch pillow
```

### Basic Usage

```python
from unsloth import FastVisionModel
from transformers import AutoProcessor
from PIL import Image

# Load model and processor
model, tokenizer = FastVisionModel.from_pretrained(
    "naazimsnh02/medocr-vision"
)
processor = AutoProcessor.from_pretrained(
    "naazimsnh02/medocr-vision",
    trust_remote_code=True
)

# Enable inference mode
FastVisionModel.for_inference(model)

# Load image
image = Image.open("medical_document.jpg")

# Prepare input
instruction = "Extract all text from this medical document:"
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]
}]

# Generate
text_prompt = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = processor(
    image,
    text_prompt,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

output = model.generate(
    **inputs,
    max_new_tokens=256,
    use_cache=False,
    temperature=1.5,
    min_p=0.1
)

# Decode output
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)
```

### Advanced Usage with Streaming

```python
from transformers import TextStreamer

# Create text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate with streaming
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=False,
    temperature=1.5,
    min_p=0.1
)
```

## Limitations and Biases

### Limitations

1. **Image Quality**: Performance may degrade with very low-quality or heavily degraded images
2. **Handwriting Variability**: Extremely poor handwriting may not be accurately recognized
3. **Language**: Primarily trained on English medical documents
4. **Document Types**: Optimized for the specific document types in the training set
5. **Context Understanding**: This is an OCR model, not a medical understanding model

### Potential Biases

1. **Dataset Bias**: Training data is primarily from specific medical document sources
2. **Domain Bias**: Better performance on medical documents similar to training data
3. **Language Bias**: Primarily English-language documents
4. **Format Bias**: May perform better on document formats similar to training data

### Recommendations

- Validate outputs in critical medical applications
- Use as part of a larger system with human oversight
- Test on your specific use case before production deployment
- Consider fine-tuning on domain-specific data for specialized applications

## Ethical Considerations

### Privacy and Security

- **Medical Data**: This model processes medical documents which may contain sensitive patient information
- **HIPAA Compliance**: Users must ensure compliance with relevant healthcare data protection regulations
- **Data Handling**: Implement appropriate data security measures when using this model
- **Audit Trail**: Maintain logs of OCR processing for accountability

### Responsible Use

- This model should be used as an assistive tool, not a replacement for human review
- Medical professionals should verify all extracted information
- Implement appropriate error handling and validation
- Consider the implications of automated medical document processing

## Citation

```bibtex
@misc{medocr-vision-2025,
  title={MedOCR-Vision: Medical Document OCR with PaddleOCR-VL},
  author={Syed Naazim Hussain},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/naazimsnh02/medocr-vision}}
}
```

## Additional Resources

- **Code Repository**: https://github.com/naazimsnh02/medocr-vision
- **Training Dataset**: https://huggingface.co/datasets/naazimsnh02/medocr-vision-dataset
- **Training Notebook**: Available in the repository
- **ERNIE Challenge**: Submitted for ERNIE AI Developer Challenge

## License

This model is released under the MIT License. Please refer to individual dataset licenses for usage terms of the training data.

## Acknowledgments

- **Base Model**: unsloth/PaddleOCR-VL
- **Framework**: Unsloth for efficient training
- **Dataset Sources**: chinmays18, saurabh1896, dikshaasinghhh, mychen76
- **LLM Providers**: Nebius and Novita for data processing
- **PaddleOCR Team**: For the excellent OCR framework

---

**Model Version**: 1.0  
**Release Date**: December 2025  
**Challenge**: ERNIE AI Developer Challenge