Movie Reviews Data

Summary

Introduction
Dataset Structure
Reference
License
Citation

Introduction

This dataset is derived from the Large Movie Review Dataset published by the Association for Computational Linguistics. It provides labeled data for binary sentiment classification, containing movie reviews paired with their corresponding sentiment labels. The dataset is used in the Movie Review Challenge, enabling participants to develop and evaluate sentiment classification models.

Dataset Structure

Download

Download Movie Reviews Data.

Include

The dataset is distributed as two .zip archives containing the following files when extracted:

train_1.csv:
- 17,484 labeled reviews for model training
train_2.csv:
- Additional 17,484 labeled reviews for model training
test.csv:
- 14,987 unlabeled reviews for prediction submission
- Note:
  - The test data (test.csv) is used to predict labels for generating the submission file.
  - Submissions are evaluated and ranked on the Public Leaderboard

An example from train.csv:

review_index,review,sentiment
0,'The film is so bad',negative

An example from test.csv:

review_index,review
0,'The film is so bad'

Data Fields

The training dataset contains the following fields:

review_index: A unique identifier for each review.
review: The text content of the movie review.
sentiment: The sentiment label assigned to the review, either Positive or Negative. The test dataset includes the same fields, except it does not contain the sentiment label.

Reference

For more dataset information, please go through the following link, movie review dataset.

License

The dataset has been published by Association for Computational Linguistics.

Citation

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}