Blanca commited on
Commit
d776026
·
verified ·
1 Parent(s): 95df290

Update content.py

Browse files
Files changed (1) hide show
  1. content.py +10 -10
content.py CHANGED
@@ -26,10 +26,10 @@ SUBMISSION_TEXT = """
26
  ## Submissions
27
  Results can be submitted for the test set only. Scores are expressed as the percentage of correct answers for a given split.
28
 
29
- Evaluation is done by comparing the newly generated question to the reference questions using gemma-2-9b-it, and inheriting the label of the most similar reference. Questions were no reference is found are considered invalid.
30
 
31
- We expect submissions to be json-line files with the following format.
32
- ```jsonl
33
  {
34
  "CLINTON_1_1": {
35
  "intervention_id": "CLINTON_1_1",
@@ -54,20 +54,20 @@ We expect submissions to be json-line files with the following format.
54
  }
55
  ```
56
 
57
- Our scoring function can be found [here](https://huggingface.co/spaces/HiTZ/Critical_Questions_Leaderboard/blob/main/scorer.py).
 
 
58
 
59
- This leaderboard was created using as base the [GAIA-benchmark](https://huggingface.co/spaces/gaia-benchmark/leaderboard)
60
  """
61
 
62
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
63
  CITATION_BUTTON_TEXT = r"""@misc{figueras2025benchmarkingcriticalquestionsgeneration,
64
  title={Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models},
65
- author={Banca Calvo Figueras and Rodrigo Agerri},
66
  year={2025},
67
- eprint={2505.11341},
68
- archivePrefix={arXiv},
69
- primaryClass={cs.CL},
70
- url={https://arxiv.org/abs/2505.11341},
71
  }"""
72
 
73
 
 
26
  ## Submissions
27
  Results can be submitted for the test set only. Scores are expressed as the percentage of correct answers for a given split.
28
 
29
+ Evaluation is done by comparing the newly generated question to the reference questions using Semantic Text Similarity, and inheriting the label of the most similar reference. Questions were no reference is found are considered invalid. See the evaluation function [here](https://huggingface.co/spaces/HiTZ/Critical_Questions_Leaderboard/blob/main/app.py#L141), or find more details in the [paper](https://arxiv.org/abs/2505.11341).
30
 
31
+ We expect submissions to be json files with the following format.
32
+ ```json
33
  {
34
  "CLINTON_1_1": {
35
  "intervention_id": "CLINTON_1_1",
 
54
  }
55
  ```
56
 
57
+ After clicking 'Submit Eval' wait for a couple of minutes before trying to refresh.
58
+
59
+ If you find any issues, please email blancacalvofigueras@gmail.com
60
 
 
61
  """
62
 
63
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
64
  CITATION_BUTTON_TEXT = r"""@misc{figueras2025benchmarkingcriticalquestionsgeneration,
65
  title={Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models},
66
+ author={Calvo Figueras, Banca and Rodrigo Agerri},
67
  year={2025},
68
+ booktitle={2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)},
69
+ organization={Association for Computational Linguistics (ACL)},
70
+ url={https://arxiv.org/abs/2505.11341},
 
71
  }"""
72
 
73