In the Counterfactual Logical Modification (CLOMO) task, a model is given a pair of Argument and Premise 1 in the relation R, and then is given an additional Premise 2 that perturbs R. The model is required to modify Argument to Argument' such that R stands in the Argument'-Premise 2 pair.
We thus introduce the CLOMO dataset with 1,000 high-quality and challenging questions in four logical relations. The data is collected by multi-turn human annotation and verification.
An LLM is given an argument and two premises. The LLM needs to modify the statements in Argument such that the logical relation R switches to stand in state 2 instead of state 1.
Four categories of logical restrictions in CLOMO.
You can download the dataset on the GitHub repo.
Statistics of CLOMO.
CLOMO input length statistics by number of tokens. CoT: The chain-of-thought setting. Few: The few-shot setting. Zero: The zero-shot setting.
Examples in CLOMO.
Examples of different prompting setups for CLOMO.
Example few-shot input and output of CLOMO. Logical relation: Necessary Assumption.
Example few-shot input and output of CLOMO. Logical relation: Sufficient Assumption.
Example few-shot input and output of CLOMO. Logical relation: Strengthen.
Example few-shot input and output of CLOMO. Logical relation: Weaken.
Example zero-shot input and output of CLOMO. Logical relation: Necessary Assumption.
Example few-shot input and output of CLOMO. Logical relation: Sufficient Assumption.
Example few-shot input and output of CLOMO. Logical relation: Strengthen.
Example few-shot input and output of CLOMO. Logical relation: Weaken.
Example plain CoT input and output of CLOMO. Logical relation: Necessary Assumption.
Example plain CoT input and output of CLOMO. Logical relation: Sufficient Assumption.
Example plain CoT input and output of CLOMO. Logical relation: Strengthen.
Example plain CoT input and output of CLOMO. Logical relation: Weaken.
Additionally, we introduce a Self-Evaluation Score (SES) for the logically consistent generation in CLOMO. SES decomposes the evaluation into several LLMs basic discrimination tasks, which is demonstrated comparable with human evaluation.
Decomposed SES evaluation tasks.
@article{huang2023clomo,
author = {Huang, Yinya and Hong, Ruixin and Zhang, Hongming and Shao, Wei and Yang, Zhicheng and Yu, Dong and Zhang, Changshui and Liang, Xiaodan and Song, Linqi},
journal = {The 62nd Annual Meeting of the Association for Computational Linguistics},
title = {{CLOMO}: Counterfactual Logical Modification with Large Language Models},
booktitle = {The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},
year = {2024}
}