OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions Paper โข 2602.05843 โข Published 4 days ago โข 51
TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents Paper โข 2602.02196 โข Published 7 days ago โข 32
TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents Paper โข 2602.02196 โข Published 7 days ago โข 32
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation Paper โข 2602.03796 โข Published 6 days ago โข 54
$ฯ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation Paper โข 2503.13288 โข Published Mar 17, 2025 โข 51
MUR: Momentum Uncertainty guided Reasoning for Large Language Models Paper โข 2507.14958 โข Published Jul 20, 2025 โข 47
A^3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation Paper โข 2601.09274 โข Published 26 days ago โข 84
MAXS: Meta-Adaptive Exploration with LLM Agents Paper โข 2601.09259 โข Published 26 days ago โข 95
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices Paper โข 2512.01374 โข Published Dec 1, 2025 โข 104
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows Paper โข 2510.24411 โข Published Oct 28, 2025 โข 72
MUR: Momentum Uncertainty guided Reasoning for Large Language Models Paper โข 2507.14958 โข Published Jul 20, 2025 โข 47
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper โข 2505.19897 โข Published May 26, 2025 โข 104