PLOD: An Abbreviation Detection Dataset

Summary

Introduction

This PLOD Dataset is an English-language dataset of abbreviations and their long-forms tagged in text. The dataset has been collected for research from the PLOS journals indexing of abbreviations and long-forms in the text. This dataset was created to support the Natural Language Processing task of abbreviation detection and covers the scientific domain.

Dataset Structure

Data Instances

A typical data point comprises an ID, a set of tokens present in the text, a set of pos_tags for the corresponding tokens obtained via Spacy NER, and a set of ner_tags which are limited to AC for Acronym and LF for long-forms.

An example from the dataset:

{ 
    'tokens': ['Study', '-', 'specific', 'risk', 'ratios', '(', 'RRs', ')', 'and', 'mean', 'BW', 'differences', 'were', 'calculated', 'using', 'linear', 'and', 'log', '-', 'binomial', 'regression', 'models', 'controlling', 'for', 'confounding', 'using', 'inverse', 'probability', 'of', 'treatment', 'weights', '(', 'IPTW', ')', 'truncated', 'at', 'the', '1st', 'and', '99th', 'percentiles', '.'], 
    'pos_tags': [8, 13, 0, 8, 8, 13, 12, 13, 5, 0, 12, 8, 3, 16, 16, 0, 5, 0, 13, 0, 8, 8, 16, 1, 8, 16, 0, 8, 1, 8, 8, 13, 12, 13, 16, 1, 6, 0, 5, 0, 8, 13], 
    'ner_tags': [0, 0, 0, 3, 4, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 4, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 
}

Data Fields

The dataset has the following fields:

tokens: The tokens contained in the text.
pos_tags: the Part-of-Speech tags obtained for the corresponding token above from Spacy NER.
ner_tags: The tags for abbreviations and long-forms.

Reference

We would like to acknowledge Zilio, Leonardo and Saadany et al. for creating and maintaining the PLOD dataset as a valuable resource for the computer vision and machine learning research community. For more information about the PLOD dataset and its creator, please visit the PLOD website.

License

The dataset has been released under a Creative Commons Attribution-ShareAlike 4.0 International License.

Citation

@InProceedings{zilio-EtAl:2022:LREC,
  author    = {Zilio, Leonardo  and  Saadany, Hadeel  and  Sharma, Prashant  and  Kanojia, Diptesh  and  OrÄƒsan, Constantin},
  title     = {PLOD: An Abbreviation Detection Dataset for Scientific Documents},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {680--688},
  abstract  = {The detection and extraction of abbreviations from unstructured texts can help to improve the performance of Natural Language Processing tasks, such as machine translation and information retrieval. However, in terms of publicly available datasets, there is not enough data for training deep-neural-networks-based models to the point of generalising well over data. This paper presents PLOD, a large-scale dataset for abbreviation detection and extraction that contains 160k+ segments automatically annotated with abbreviations and their long forms. We performed manual validation over a set of instances and a complete automatic validation for this dataset. We then used it to generate several baseline models for detecting abbreviations and long forms. The best models achieved an F1-score of 0.92 for abbreviations and 0.89 for detecting their corresponding long forms. We release this dataset along with our code and all the models publicly at https://github.com/surrey-nlp/PLOD-AbbreviationDetection},
  url       = {https://aclanthology.org/2022.lrec-1.71}
}