MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

MemOCR Logo

arXiv HuggingFace Paper Project Page GitHub License

🌟 Model Description

MemOCR is a visual memory agent that dynamically adapts information density during memory drafting and reading, and optimizes visual layouts to highlight key information. This checkpoint is fine-tuned from Qwen2.5-VL-7B-Instruct with budget-aware training objectives.

Key Capabilities

  • Adaptive Information Density: Dynamically adjusts memory content richness based on task requirements
  • Budget-Aware Memory: Optimizes memory usage with explicit token budget constraints
  • Dual-Domain Architecture: Separate memory drafting (text domain) and reading (vision domain) processes
  • Multi-Hop Reasoning: Superior performance on complex question answering tasks

πŸ—οΈ Architecture

MemOCR Architecture

MemOCR consists of two main components:

  1. Memory Drafting in Text Domain: An LLM agent iteratively refines rich-text memory content based on question-answering feedback
  2. Memory Reading in Vision Domain: A vision-language model processes rendered visual memory with optimized information density

The framework employs budget-aware training objectives to balance memory informativeness and token efficiency.

Data Augmentation Strategy

πŸ“Š Performance

Main Results

Main Results

MemOCR achieves state-of-the-art performance across multiple multi-hop QA benchmarks:

  • HotpotQA: Superior accuracy with efficient memory budgets
  • 2WikiMultihopQA: Strong multi-hop reasoning capabilities
  • NaturalQuestions & TriviaQA: Excellent knowledge retrieval performance

Ablation Studies

Ablation Studies

Analysis: Information Density & Budget

πŸš€ Usage

This model is designed to work with the MemOCR framework. Please refer to the official repository for detailed usage instructions.

πŸ“š Citation

If you find MemOCR useful in your research, please consider citing:

@article{shi2026memocr,
  title={MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning},
  author={Yaorui Shi and Shugui Liu and Yu Yang and Wenyu Mao and Yuxin Chen and Qi GU and Hui Su and Xunliang Cai and Xiang Wang and An Zhang},
  journal={arXiv preprint arXiv:2601.21468},
  year={2026},
}

πŸ™ Acknowledgements

This model is built upon:

  • Qwen2.5-VL as the vision-language model backbone
  • veRL as the reinforcement learning training framework
  • MemAgent for the recurrent module and training dataset

πŸ“„ License

This model is licensed under the Apache License 2.0. See the LICENSE file for details.

Dataset License

Training and evaluation datasets are derived from:

  • HotpotQA, 2WikiMultihopQA, Natural Questions, TriviaQA: Wikipedia-derived content licensed under CC BY-SA 4.0
Downloads last month
46
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for meituan/MemOCR-7B

Finetuned
(1011)
this model
Quantizations
2 models

Paper for meituan/MemOCR-7B