Announcing RealPerformance, a dataset of functional issues of language models that mirrors failure patterns identified through rigorous testing in real LLM agents
π₯ Announcing FLUX-Juiced: The Fastest Image Generation Endpoint (2.6x faster)!
Optimisations are widely applied and can reduce inference time, but their impact on quality often remains unclear, so we decided to challenge the status quo and create our own optimised version of FLUX.1[dev] called FLUX-juiced.
RealHarm: A Collection of Real-World Language Model Application Failure
I'm David from Giskard, and we work on securing your Agents. Today, we are launching RealHarm: a dataset of real-world problematic interactions with AI agents, drawn from publicly reported incidents.
In this unit, you'll learn: - Offline Evaluation β Benchmark and iterate your agent using datasets. - Online Evaluation β Continuously track key metrics such as latency, costs, and user feedback.