New Benchmark Dataset

#2
by burtenshaw - opened

Are you maintaining an evaluation benchmark, and would like for it to be included in the eval results short list so that reported result appear as a leaderboard.

image

⭐️ comment and link to you dataset repo and sources using the benchmark.

Not sure what are the specific requirements for benchmarks to be included, but we would like to have this functionality on these language specific benchmarks that we've built. They're quite recent so we don't have much sources yet beyond our own benchmarking efforts and EuroEval.

Manually translated and culturally adapted IFEval for Estonian.
https://huggingface.co/datasets/tartuNLP/ifeval_et

Manually translated and culturally adapted WinoGrande for Estonian.
https://huggingface.co/datasets/tartuNLP/winogrande_et

I'm not completely sure yet how to port the configs from LM Evaluation Harness to eval.yaml though.

Hi, we maintain Encyclo-K, a benchmark for evaluating LLMs with dynamically composed knowledge statements.

Dataset: https://huggingface.co/datasets/m-a-p/Encyclo-K
Paper: https://arxiv.org/abs/2512.24867
Leaderboard: https://encyclo-k.github.io/

We've added the eval.yaml file and would like to be included in the shortlist.

OpenEvals org

hey @yimingliang ! everything look great, we will add you to the shortlist and all should be set very impressive work on the evals, do you think it could be possible to open PRs on the models you evaluated with the results from your leaderboard ?

OpenEvals org

hey @adorkin ! Thanks for reaching out. IFEval would require custom code to run; this feature is not available yet, but will be in the future. For winogrande, you could absolutely make a eval.yaml file and turn it into a benchmark. You would need a small modification, though: The answer field should be either A or B instead of 1 or 2, and instead of having two columns for the choices, it would be easier to use one column with a list of choices. Then, your benchmark would simply be a multichoice benchmark :)

SaylorTwift pinned discussion

@SaylorTwift I see, thanks! Is the yaml expected to contain the prompt itself? I mean it works well as a multiple choice problem, but nonetheless the formulation is a bit non-standard, because you're filling the gap rather than answering a question.

OpenEvals org

@adorkin yes you can set th eprompt in the yaml file like so https://huggingface.co/datasets/cais/hle/blob/main/eval.yaml. Using the multiple_choice solver instead of the system prompt. Here are the docs from inspect.

@SaylorTwift I've added the eval.yaml and a custom dataset config to work with it. The dataset viewer seems to be stuck now which may or may not be related.
https://huggingface.co/datasets/tartuNLP/winogrande_et/blob/main/eval.yaml

Sign up or log in to comment