
CommonGen
Building machines with commonsense to compose realistically plausible sentences is challenging. CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday sce- nario using these concepts.
CommonGen
Summary
Introduction
CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday scenario using these concepts.
CommonGen is challenging because it inherently requires 1) relational reasoning using background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowd-sourcing from AMT and existing caption corpora, consists of 30k concept-sets and 50k sentences in total.
Dataset Structure
Data Instances
- Size of downloaded dataset files: 1.85 MB
- Size of the generated dataset: 7.21 MB
- Total amount of disk used: 9.06 MB
An example of 'train' looks as follows.
{
"concept_set_idx": 0,
"concepts": ["ski", "mountain", "skier"],
"target": "Three skiers are skiing on a snowy mountain."
}
Data Fields
The data fields are the same among all splits.
concept_set_idx
: a int32 feature.concepts
: a list of string features.target
: a string feature.
Data Splits
name | train | validation | test |
---|---|---|---|
default | 67389 | 4018 | 1497 |
Reference
We would like to acknowledge Lin, Bill Yuchen et al. for creating and maintaining the CommonGen dataset as a valuable resource for the computer vision and machine learning research community. For more information about the CommonGen dataset and its creator, please visit the CommonGen website.
License
The dataset has been released under the MIT License.
Citation
@inproceedings{lin-etal-2020-commongen,
title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
author = "Lin, Bill Yuchen and
Zhou, Wangchunshu and
Shen, Ming and
Zhou, Pei and
Bhagavatula, Chandra and
Choi, Yejin and
Ren, Xiang",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
doi = "10.18653/v1/2020.findings-emnlp.165",
pages = "1823--1840"
}