Datasets
Dataset for Face Anti-Spoofing Challenge
by @AIOZNetwork

Dataset for Spaceship Titanic Challenge
by @AIOZNetwork

Dataset for Movie Reviews Challenge
by @AIOZNetwork

Dataset for Housing prices, include train and test data.
by @AIOZNetwork

Multimodal dataset containing 1.2M media assets with metadata tags, sourced from licensed content libraries. Suitable for recommendation system training.
Radiology images are an essential part of clinical decision making and population screening, e.g., for cancer. Automated systems could help clinicians cope with large amounts of images by answering questions about the image contents. An emerging area of artificial intelligence, Visual Question Answering (VQA) in the medical domain explores approaches to this form of clinical decision support. Success of such machine learning tools hinges on availability and design of collections composed of medical images augmented with question-answer pairs directed at the content of the image. We introduce VQA-RAD, the first manually constructed dataset where clinicians asked naturally occurring questions about radiology images and provided reference answers. Manual categorization of images and questions provides insight into clinically relevant tasks and the natural language to phrase them.
This dataset is a great resource for researchers who want to evaluate cross-lingual question answering performance.
by @AIOZNetwork

Building machines with commonsense to compose realistically plausible sentences is challenging. CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday sce- nario using these concepts.
by @AIOZNetwork

The Benchmark of Linguistic Minimal Pairs, a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English, finds that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena.
by @AIOZNetwork

TAL-SCQ5K are high-quality mathematical competition datasets created by TAL Education Group.
by @AIOZNetwork

To create these datasets, the authors automatically translated the original CSQA and CODAH datasets, originally available only in English, into 15 other languages.
by @AIOZNetwork

The DOCCI dataset consists of comprehensive descriptions on 15k images specifically taken with the objective of evaluating T2I and I2T models. These cover a lot of key details in the images, as illustrated below.
by @AIOZNetwork

The ARC dataset consists of 7,787 science exam questions drawn from a variety of sources, including science questions, provided under license by a research partner affiliated with AI2.
by @AIOZNetwork

This is the repository for PLOD Dataset subset being used for CW in NLP module 2023-2024 at University of Surrey.
by @AIOZNetwork

NIH Chest X-Ray is a large dataset containing chest X-ray images of patients collected by the National Institutes of Health (NIH) of the United States.
by @AIOZNetwork

MNIST is used to train and evaluate image classification models in complex tasks.
by @AIOZNetwork
