Abstract
AgentOCR reduces token consumption in agentic systems by representing interaction history as visual tokens and employing visual caching and self-compression techniques.
Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.
Community
We’re introducing AgentOCR, a new way to scale LLM agents by reimagining long interaction histories as compact rendered images, leveraging the higher information density of visual tokens to curb exploding context costs. To make long-horizon rollouts practical, we add segment optical caching, splitting history into hashable segments and caching the visuals, so agents avoid redundant re-rendering as trajectories grow. We go beyond fixed compression with agentic self-compression: the agent actively emits a compression rate and is trained with a compression-aware reward to balance task success against token efficiency. Across ALFWorld and search-based QA, AgentOCR keeps >95% of text-agent performance while cutting token use by >50% average and ~80% in peak, and our analysis shows up to a 20× rendering speedup thanks to our segment optical caching
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LongVideoAgent: Multi-Agent Reasoning with Long Videos (2025)
- AstraNav-Memory: Contexts Compression for Long Memory (2025)
- FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents (2025)
- ConMax: Confidence-Maximizing Compression for Efficient Chain-of-Thought Reasoning (2026)
- OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding (2025)
- Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks (2025)
- VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper