YourBench: Easy Custom Evaluation Sets for Everyone Paper β’ 2504.01833 β’ Published Apr 2, 2025 β’ 22
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper β’ 2510.08697 β’ Published Oct 9, 2025 β’ 39
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper β’ 2506.20920 β’ Published Jun 26, 2025 β’ 77
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper β’ 2506.20920 β’ Published Jun 26, 2025 β’ 77
YourBench: Easy Custom Evaluation Sets for Everyone Paper β’ 2504.01833 β’ Published Apr 2, 2025 β’ 22
From Context to Action: Analysis of the Impact of State Representation and Context on the Generalization of Multi-Turn Web Navigation Agents Paper β’ 2410.23555 β’ Published Oct 31, 2024
Better Slow than Sorry: Introducing Positive Friction for Reliable Dialogue Systems Paper β’ 2501.17348 β’ Published Jan 28, 2025 β’ 1