CommonGen

Summary

Introduction
Dataset Structure
Reference
License
Citation

Introduction

CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday scenario using these concepts.

CommonGen is challenging because it inherently requires 1) relational reasoning using background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowd-sourcing from AMT and existing caption corpora, consists of 30k concept-sets and 50k sentences in total.

Dataset Structure

Data Instances

Size of downloaded dataset files: 1.85 MB
Size of the generated dataset: 7.21 MB
Total amount of disk used: 9.06 MB

An example of 'train' looks as follows.

{
    "concept_set_idx": 0,
    "concepts": ["ski", "mountain", "skier"],
    "target": "Three skiers are skiing on a snowy mountain."
}

Data Fields

The data fields are the same among all splits.

concept_set_idx: a int32 feature.
concepts: a list of string features.
target: a string feature.

Data Splits

name	train	validation	test
default	67389	4018	1497

Reference

We would like to acknowledge Lin, Bill Yuchen et al. for creating and maintaining the CommonGen dataset as a valuable resource for the computer vision and machine learning research community. For more information about the CommonGen dataset and its creator, please visit the CommonGen website.

License

The dataset has been released under the MIT License.

Citation

@inproceedings{lin-etal-2020-commongen,
    title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
    author = "Lin, Bill Yuchen  and
      Zhou, Wangchunshu  and
      Shen, Ming  and
      Zhou, Pei  and
      Bhagavatula, Chandra  and
      Choi, Yejin  and
      Ren, Xiang",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
    doi = "10.18653/v1/2020.findings-emnlp.165",
    pages = "1823--1840"
}