Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL
Abstract
RC, an iterative decoding algorithm, enables large language models to extrapolate and continuously improve beyond training budgets by constructing reasoning chains that enhance across iterations, achieving superior performance on long-horizon tasks.
Large Language Models (LLMs) that can continually improve beyond their training budgets are able to solve increasingly difficult problems by adapting at test time, a property we refer to as extrapolation. However, standard reinforcement learning (RL) operates over fixed problem distributions and training budgets, which limits extrapolation amidst distribution shift at test time. To address this, we introduce RC, an iterative decoding algorithm that replaces standard autoregressive decoding during both training and inference. RC exploits an asymmetry between the response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve across iterations. Models trained to use RC can extrapolate and continually improve over reasoning horizons more than an order of magnitude longer than those seen during training. Empirically, training a 4B model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to nearly 70% with 0.5m tokens at test time, outperforming both comparably sized models and many larger reasoning LLMs. Finally, we also show that models trained with RC can more effectively leverage existing scaffolds to further scale test-time performance, due to the improved summary-conditioned generation abilities learned through training.
Community
Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL
Large Language Models (LLMs) that can continually improve beyond their training budgets are able to solve increasingly difficult problems by adapting at test time, a property we refer to as extrapolation. However, standard reinforcement learning (RL) operates over fixed problem distributions and training budgets, which limits extrapolation amidst distribution shift at test time. To address this, we introduce RC, an iterative decoding algorithm that replaces standard autoregressive decoding during both training and inference. RC exploits an asymmetry between the response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve across iterations. Models trained to use RC can extrapolate and continually improve over reasoning horizons more than an order of magnitude longer than those seen during training. Empirically, training a 4B model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to nearly 70% with 0.5m tokens at test time, outperforming both comparably sized models and many larger reasoning LLMs. Finally, we also show that models trained with RC can more effectively leverage existing scaffolds to further scale test-time performance, due to the improved summary-conditioned generation abilities learned through training.
This is great work!
I'm wondering if there are performance optimizations that can be made to account for having to throw away the KV-cache?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning (2026)
- POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration (2026)
- PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning (2026)
- iGRPO: Self-Feedback-Driven LLM Reasoning (2026)
- Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning (2026)
- Beyond Correctness: Learning Robust Reasoning via Transfer (2026)
- Structured Reasoning for Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend