Spaces:

HiTZ
/

Critical_Questions_Leaderboard

Running

App Files Files Community

Critical_Questions_Leaderboard / content.py

Blanca

Update content.py

cbe1273 verified 7 months ago

raw

history blame

5.28 kB

	TITLE = """<h1 align="center" id="space-title">Critical Questions Leaderboard</h1>"""

	INTRODUCTION_TEXT = """
	The Critical Questions Leaderboard is a benchmark which aims at evaluating the capacity of language technology systems to generate critical questions. (See our [paper](https://arxiv.org/abs/2505.11341) for more details.)

	The task of Critical Questions Generation consists of generating useful critical questions when given an argumentative text. For this purpose, a dataset of real debate interventions with associated critical questions has been released.

	Critical Questions are the set of inquiries that should be asked in order to judge if an argument is acceptable or fallacious. Therefore, these questions are designed to unmask the assumptions held by the premises of the argument and attack its inference.

	In the dataset, the argumentative texts are interventions of real debates, which have been annotated with Argumentation Schemes and later associated to a set of critical questions. For every intervention, the speaker, the set of Argumentation Schemes, and the critical questions are provided. These questions have been annotated according to their usefulness for challenging the arguments in each text. The labels are either Useful, Unhelpful, or Invalid. The goal of the task is to generate 3 critical questions that are Useful.

	Each of this 3 critical questions will be evaluated separately and then the punctuation will be aggregated.

	## Data
	The Critical Questions Dataset is made of 220 interventions associated to ~5k gold standard questions. These questions are in turn annotated as Useful, Unhelpful or Invalid, and serve as a reference for the evaluation model.

	The data can be found in [this dataset](https://huggingface.co/datasets/Blanca/CQs-Gen). The test set is contained in `test.jsonl` and contains 34 of the interventions, the validation set contains the remaining 186, and the reference questions of this set are public.

	## Leaderboard
	Submission made by our team are labelled as "CQs-Gen authors".

	See below for submissions.
	"""

	SUBMISSION_TEXT = """
	## Submissions
	Results can be submitted for the test set only. Scores are expressed as the percentage of correct answers for a given split.

	Evaluation is done by comparing the newly generated question to the reference questions using gemma-2-9b-it, and inheriting the label of the most similar reference. Questions were no reference is found are considered invalid.

	We expect submissions to be json-line files with the following format.
	```jsonl
	{
	"CLINTON_1_1": {
	"intervention_id": "CLINTON_1_1",
	"intervention": "CLINTON: \"The central question in this election is really what kind of country we want to be and what kind of future we 'll build together\nToday is my granddaughter 's second birthday\nI think about this a lot\nwe have to build an economy that works for everyone , not just those at the top\nwe need new jobs , good jobs , with rising incomes\nI want us to invest in you\nI want us to invest in your future\njobs in infrastructure , in advanced manufacturing , innovation and technology , clean , renewable energy , and small business\nmost of the new jobs will come from small business\nWe also have to make the economy fairer\nThat starts with raising the national minimum wage and also guarantee , finally , equal pay for women 's work\nI also want to see more companies do profit-sharing\"",
	"dataset": "US2016",
	"cqs": [
	{
	"id": 0,
	"cq": "What does the author mean by \"build an economy that works for everyone, not just those at the top\"?"
	},
	{
	"id": 1,
	"cq": "What is the author's definition of \"new jobs\" and \"good jobs\"?"
	},
	{
	"id": 2,
	"cq": "How will the author's plan to \"make the economy fairer\" benefit the working class?"
	}
	]
	},
	...
	}
	```

	Our scoring function can be found [here](https://huggingface.co/spaces/HiTZ/Critical_Questions_Leaderboard/blob/main/scorer.py).

	This leaderboard was created using as base the [GAIA-benchmark](https://huggingface.co/spaces/gaia-benchmark/leaderboard)
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = r"""@misc{figueras2025benchmarkingcriticalquestionsgeneration,
	title={Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models},
	author={Banca Calvo Figueras and Rodrigo Agerri},
	year={2025},
	eprint={2505.11341},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2505.11341},
	}"""


	def format_error(msg):
	return f"<p style='color: red; font-size: 20px; text-align: center;'>{msg}</p>"

	def format_warning(msg):
	return f"<p style='color: orange; font-size: 20px; text-align: center;'>{msg}</p>"

	def format_log(msg):
	return f"<p style='color: green; font-size: 20px; text-align: center;'>{msg}</p>"

	def model_hyperlink(link, model_name):
	return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'