SPHINX: A Synthetic Environment for Visual Perception and Reasoning Paper • 2511.20814 • Published 12 days ago • 2
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning Paper • 2510.27044 • Published Oct 30 • 5
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence Paper • 2511.01144 • Published Nov 3 • 3
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence Paper • 2406.07599 • Published Jun 11, 2024