Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning
Abstract
A framework called SLATE addresses credit assignment challenges in training language models with search engines by using truncated step-level sampling and dense LLM-as-judge rewards to reduce gradient variance and improve performance on complex reasoning tasks.
Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.
Community
SLATE trains LLMs to reason with search engines via RL by (1) sampling multiple continuations from a shared trajectory prefix at each step instead of full trajectories, provably reducing gradient variance up to T-fold, and (2) using an LLM judge to score each reasoning step, query, and answer on a {-1, 0, +1} scale for dense supervision. This outperforms sparse-reward (Search-R1) and process-reward (StepSearch) baselines across 7 QA benchmarks, with the largest gains on multi-hop tasks and smaller models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training (2026)
- TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG (2026)
- ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation (2026)
- Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models (2026)
- Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration (2026)
- Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward (2026)
- Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper