--- language: - en license: mit tags: - vision - image-to-text - ocr - medical - paddleocr - unsloth - lora - ernie-challenge datasets: - naazimsnh02/medocr-vision-dataset base_model: unsloth/PaddleOCR-VL library_name: transformers pipeline_tag: image-to-text --- # MedOCR-Vision: Medical Document OCR with PaddleOCR-VL **ERNIE AI Developer Challenge Submission** A fine-tuned PaddleOCR-VL model specialized for medical document OCR, achieving high accuracy on medical prescriptions, lab reports, and forms while maintaining general OCR capabilities. ## Model Description MedOCR-Vision is a vision-language model fine-tuned specifically for optical character recognition (OCR) of medical documents. The model is based on PaddleOCR-VL (1B parameters) and has been fine-tuned using LoRA (Low-Rank Adaptation) on a carefully curated dataset of 2,462 medical and general documents. ### Key Features - **Specialized for Medical Documents**: Optimized for prescriptions, lab reports, and medical forms - **Domain-Balanced Training**: Maintains general OCR capabilities (invoices, receipts, business documents) - **Production-Ready**: Full merged model in float16 precision - **Efficient Fine-tuning**: LoRA-based training for optimal performance with minimal parameters - **High Accuracy**: Validation loss of 0.578 after 3 epochs of training ### Model Architecture - **Base Model**: unsloth/PaddleOCR-VL (1B parameters) - **Fine-tuning Method**: LoRA (Low-Rank Adaptation) - **LoRA Rank**: 64 - **LoRA Alpha**: 64 - **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, out_proj, fc1, fc2, linear_1, linear_2 - **Precision**: Mixed (BF16/FP16) ## Performance Highlights ### Model Improvements Over Base Model Our fine-tuned model demonstrates significant improvements across multiple metrics: - ✅ **Enhanced Information Extraction**: Captures more complete medical information including headers, test values, and reference ranges - ✅ **Better Document Understanding**: Improved coverage of document structure and context - ✅ **Medical Domain Specialization**: Superior performance on medical terminology and clinical data - ✅ **Comprehensive Coverage**: Extracts significantly more relevant content from medical documents ## Intended Uses ### Primary Use Cases - **Medical Prescription Digitization**: Extract text from handwritten and printed prescriptions - **Lab Report Processing**: Extract data from medical laboratory reports - **Medical Form OCR**: Process various medical forms and documents - **Healthcare Document Management**: Digitize medical records and documentation - **General Document OCR**: Invoices, receipts, and business documents ### Out-of-Scope Uses - Real-time medical diagnosis (this is an OCR tool, not a diagnostic system) - Legal document verification (requires domain-specific training) - Privacy-sensitive applications without proper data handling protocols ## Training Data ### Dataset Composition The model was trained on **naazimsnh02/medocr-vision-dataset**, a curated dataset of 2,462 samples with the following composition: | Dataset | Samples | Domain | Type | |---------|---------|--------|------| | Medical Prescriptions | 1,000 | Medical | Handwritten | | OMR Scanned Documents | 36 | Medical | Scanned Forms | | Medical Lab Reports | 426 | Medical | Printed Reports | | Invoices & Receipts | 1,000 | General | Business Docs | | **Total** | **2,462** | - | - | ### Dataset Statistics - **Training**: 1,969 samples (80%) - **Validation**: 246 samples (10%) - **Test**: 247 samples (10%) - **Domain Balance**: 59.4% Medical / 40.6% General ### Data Sources 1. [Medical Prescriptions](https://huggingface.co/datasets/chinmays18/medical-prescription-dataset) 2. [OMR Scanned Documents](https://huggingface.co/datasets/saurabh1896/OMR-scanned-documents) 3. [Medical Lab Reports](https://www.kaggle.com/datasets/dikshaasinghhh/bajaj) 4. [Invoices & Receipts](https://huggingface.co/datasets/mychen76/invoices-and-receipts_ocr_v1) ## Training Procedure ### Training Hyperparameters ```yaml # Training Duration num_train_epochs: 3 total_steps: 741 # Batch Configuration per_device_train_batch_size: 4 gradient_accumulation_steps: 2 effective_batch_size: 8 # Learning Rate learning_rate: 5e-5 warmup_steps: 50 lr_scheduler_type: linear # Optimization optimizer: adamw_8bit weight_decay: 0.001 # LoRA Configuration lora_r: 64 lora_alpha: 64 lora_dropout: 0 # Checkpointing save_steps: 100 eval_steps: 100 save_total_limit: 5 load_best_model_at_end: true ``` ### Training Results | Step | Training Loss | Validation Loss | |------|---------------|-----------------| | 100 | 1.7026 | 0.7900 | | 200 | 1.3005 | 0.6821 | | 300 | 1.0004 | 0.6402 | | 400 | 0.8176 | 0.6036 | | 500 | 0.7387 | 0.5806 | | 600 | 0.7406 | 0.5819 | | 700 | 0.8801 | 0.5787 | **Final Metrics:** - **Final Validation Loss**: 0.5787 - **Training Time**: 38.36 minutes (2,301.80 seconds) - **Peak GPU Memory**: 15.84 GB - **GPU Utilization**: 71.89% ### Training Environment - **GPU**: NVIDIA L4 (24GB VRAM) - **Framework**: Unsloth + HuggingFace Transformers - **Precision**: Mixed (BF16/FP16) - **Memory Usage**: ~14 GB for training ### Training Strategy 1. **Domain-Balanced Approach**: 60/40 split between medical and general documents to prevent catastrophic forgetting 2. **LoRA Fine-tuning**: Efficient parameter-efficient fine-tuning targeting key attention and MLP layers 3. **Checkpoint Selection**: Best model selected based on lowest validation loss 4. **Evaluation**: Regular evaluation every 100 steps to monitor convergence ## How to Use ### Installation ```bash pip install transformers unsloth einops torch pillow ``` ### Basic Usage ```python from unsloth import FastVisionModel from transformers import AutoProcessor from PIL import Image # Load model and processor model, tokenizer = FastVisionModel.from_pretrained( "naazimsnh02/medocr-vision" ) processor = AutoProcessor.from_pretrained( "naazimsnh02/medocr-vision", trust_remote_code=True ) # Enable inference mode FastVisionModel.for_inference(model) # Load image image = Image.open("medical_document.jpg") # Prepare input instruction = "Extract all text from this medical document:" messages = [{ "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": instruction} ] }] # Generate text_prompt = processor.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = processor( image, text_prompt, add_special_tokens=False, return_tensors="pt", ).to("cuda") output = model.generate( **inputs, max_new_tokens=256, use_cache=False, temperature=1.5, min_p=0.1 ) # Decode output text = tokenizer.decode(output[0], skip_special_tokens=True) print(text) ``` ### Advanced Usage with Streaming ```python from transformers import TextStreamer # Create text streamer text_streamer = TextStreamer(tokenizer, skip_prompt=True) # Generate with streaming _ = model.generate( **inputs, streamer=text_streamer, max_new_tokens=256, use_cache=False, temperature=1.5, min_p=0.1 ) ``` ## Limitations and Biases ### Limitations 1. **Image Quality**: Performance may degrade with very low-quality or heavily degraded images 2. **Handwriting Variability**: Extremely poor handwriting may not be accurately recognized 3. **Language**: Primarily trained on English medical documents 4. **Document Types**: Optimized for the specific document types in the training set 5. **Context Understanding**: This is an OCR model, not a medical understanding model ### Potential Biases 1. **Dataset Bias**: Training data is primarily from specific medical document sources 2. **Domain Bias**: Better performance on medical documents similar to training data 3. **Language Bias**: Primarily English-language documents 4. **Format Bias**: May perform better on document formats similar to training data ### Recommendations - Validate outputs in critical medical applications - Use as part of a larger system with human oversight - Test on your specific use case before production deployment - Consider fine-tuning on domain-specific data for specialized applications ## Ethical Considerations ### Privacy and Security - **Medical Data**: This model processes medical documents which may contain sensitive patient information - **HIPAA Compliance**: Users must ensure compliance with relevant healthcare data protection regulations - **Data Handling**: Implement appropriate data security measures when using this model - **Audit Trail**: Maintain logs of OCR processing for accountability ### Responsible Use - This model should be used as an assistive tool, not a replacement for human review - Medical professionals should verify all extracted information - Implement appropriate error handling and validation - Consider the implications of automated medical document processing ## Citation ```bibtex @misc{medocr-vision-2025, title={MedOCR-Vision: Medical Document OCR with PaddleOCR-VL}, author={Syed Naazim Hussain}, year={2025}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/naazimsnh02/medocr-vision}} } ``` ## Additional Resources - **Code Repository**: https://github.com/naazimsnh02/medocr-vision - **Training Dataset**: https://huggingface.co/datasets/naazimsnh02/medocr-vision-dataset - **Training Notebook**: Available in the repository - **ERNIE Challenge**: Submitted for ERNIE AI Developer Challenge ## License This model is released under the MIT License. Please refer to individual dataset licenses for usage terms of the training data. ## Acknowledgments - **Base Model**: unsloth/PaddleOCR-VL - **Framework**: Unsloth for efficient training - **Dataset Sources**: chinmays18, saurabh1896, dikshaasinghhh, mychen76 - **LLM Providers**: Nebius and Novita for data processing - **PaddleOCR Team**: For the excellent OCR framework --- **Model Version**: 1.0 **Release Date**: December 2025 **Challenge**: ERNIE AI Developer Challenge