SPHINX: A Synthetic Environment for Visual Perception and Reasoning Paper • 2511.20814 • Published 11 days ago • 2 • 2
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence Paper • 2511.01144 • Published Nov 3 • 3 • 1
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning Paper • 2510.27044 • Published Oct 30 • 5 • 1