Spaces:

HiTZ
/

Critical_Questions_Leaderboard

Restarting

Blanca commited on Sep 24

Commit

d776026

verified ·

1 Parent(s): 95df290

Update content.py

Files changed (1) hide show

content.py CHANGED Viewed

@@ -26,10 +26,10 @@ SUBMISSION_TEXT = """
 ## Submissions
 Results can be submitted for the test set only. Scores are expressed as the percentage of correct answers for a given split.
-Evaluation is done by comparing the newly generated question to the reference questions using gemma-2-9b-it, and inheriting the label of the most similar reference. Questions were no reference is found are considered invalid.
-We expect submissions to be json-line files with the following format.
-```jsonl
 {
     "CLINTON_1_1": {
         "intervention_id": "CLINTON_1_1",
@@ -54,20 +54,20 @@ We expect submissions to be json-line files with the following format.
 }
 ```
-Our scoring function can be found [here](https://huggingface.co/spaces/HiTZ/Critical_Questions_Leaderboard/blob/main/scorer.py).
-This leaderboard was created using as base the [GAIA-benchmark](https://huggingface.co/spaces/gaia-benchmark/leaderboard)
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""@misc{figueras2025benchmarkingcriticalquestionsgeneration,
       title={Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models},
-      author={Banca Calvo Figueras and Rodrigo Agerri},
       year={2025},
-      eprint={2505.11341},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2505.11341},
 }"""

 ## Submissions
 Results can be submitted for the test set only. Scores are expressed as the percentage of correct answers for a given split.
+Evaluation is done by comparing the newly generated question to the reference questions using Semantic Text Similarity, and inheriting the label of the most similar reference. Questions were no reference is found are considered invalid. See the evaluation function [here](https://huggingface.co/spaces/HiTZ/Critical_Questions_Leaderboard/blob/main/app.py#L141), or find more details in the [paper](https://arxiv.org/abs/2505.11341).
+We expect submissions to be json files with the following format.
+```json
 {
     "CLINTON_1_1": {
         "intervention_id": "CLINTON_1_1",
 }
 ```
+After clicking 'Submit Eval' wait for a couple of minutes before trying to refresh.
+If you find any issues, please email blancacalvofigueras@gmail.com
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""@misc{figueras2025benchmarkingcriticalquestionsgeneration,
       title={Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models},
+      author={Calvo Figueras, Banca and Rodrigo Agerri},
       year={2025},
+      booktitle={2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)},
+      organization={Association for Computational Linguistics (ACL)},
+      url={https://arxiv.org/abs/2505.11341},
 }"""