All Datasets

Search Datasets

all

AIOZ AI

Advanced Filters

Dataset July 2026

About bugs in natural 🐞 🐞🐞

Apache-2.0

1K – 10K

Image Classification

by @brian-ai-6899

Smoker Classification Challenge

Dataset for Smoker Classification Challenge

CC-BY-4.0

< 1K

Image Classification

English

by @AIOZAI

License Plate Recognition Challenge

Dataset for License Plate Recognition Challenge

CC0-1.0

1K – 10K

Image-to-Text

Object Detection

English

by @AIOZAI

Spaceship Titanic Prediction Challenge

Dataset for Spaceship Titanic Prediction Challenge

CC-BY-4.0

1K – 10K

Tabular Classification

English

by @AIOZAI

Melanoma Skin Cancer Classification Challenge

Dataset for Melanoma Skin Cancer Classification Challenge

CC0-1.0

1K – 10K

Image Classification

English

by @AIOZAI

Iris Flower Classification Challenge

Dataset for Iris Flower Classification Challenge

CC-BY-4.0

1K – 10K

Tabular Classification

English

by @AIOZAI

Email Spam Classification Challenge

Dataset for Email Spam Classification Challenge

CC0-1.0

1K – 10K

Text Classification

English

by @AIOZAI

Pneumonia Chest X-Ray Classification Challenge

Dataset for Pneumonia Chest X-Ray Classification Challenge

CC-BY-4.0

1K – 10K

Image Classification

English

by @AIOZAI

Pothole Detection Challenge

Data for Pothole Detection Challenge

CC-BY-4.0

< 1K

Object Detection

English

by @AIOZAI

136

Face Anti-Spoofing Challenge

Dataset for Face Anti-Spoofing Challenge

other

100M – 1B

Image Classification

English

by @AIOZAI

131

Movie Reviews Challenge

Dataset for Movie Reviews Challenge

other

10M – 100M

Text Classification

English

by @AIOZAI

226

147

Housing Prices Challenge

Dataset for Housing Prices Challenge

MIT

10K – 100K

Tabular Regression

English

by @AIOZAI

273

154

CodeComplex: Code Complexity Prediction Dataset

CodeComplex consists of 4,500 Java codes submitted to programming competitions by human programmers and their complexity labels annotated by a group of algorithm experts.

Apache-2.0

1K – 10K

Text Generation

code

by @AIOZAI

158

147

DocLayNet

DocLayNet is a human-annotated document layout segmentation dataset, containing 80863 pages from a broad variety of document sources.

cdla-permissive-1.0

10K – 100K

Object Detection

Image Segmentation

English

by @AIOZAI

155

156

SciTLDR

The SciTLDR dataset provides a valuable resource for research and development in the field of summarization.

Apache-2.0

1K – 10K

Summarization

English

by @AIOZAI

163

150

'Story Cloze Test' is a new commonsense reasoning framework for evaluating story understanding, story generation, and script learning. This test requires a system to choose the correct ending to a four-sentence story.

CC BY-SA 4.0

10K – 100K

English

Russian

Chinese

Spanish

Arabic

Hindi

Indonesian

Telugu

Swahili

Basque

Burmese

by @AIOZAI

173

152

TOFU: Task of Fictitious Unlearning

The TOFU dataset serves as a benchmark for evaluating unlearning performance of large language models on realistic tasks.

MIT

10K – 100K

Question Answering

English

by @AIOZAI

153

145

TextVQA

TextVQA is a dataset to benchmark visual reasoning based on text in images. TextVQA requires models to read and reason about text in images to answer questions about them.

CC-BY-4.0

10K – 100K

Visual Question Answering

English

by @AIOZAI

166

156

XQuAD

This dataset is a great resource for researchers who want to evaluate cross-lingual question answering performance.

CC BY-SA 4.0

10K – 100K

Question Answering

English

Arabic

German

Greek

Spanish

Hindi

Romanian

Russian

Thai

Turkish

Chinese

Vietnamese

by @AIOZAI

244

155

CommonGen

Building machines with commonsense to compose realistically plausible sentences is challenging. CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday sce- nario using these concepts.

MIT

10K – 100K

Text2Text Generation

English

by @AIOZAI

240

158

All Datasets

Dataset July 2026

Smoker Classification Challenge

License Plate Recognition Challenge

Spaceship Titanic Prediction Challenge

Melanoma Skin Cancer Classification Challenge

Iris Flower Classification Challenge

Email Spam Classification Challenge

Pneumonia Chest X-Ray Classification Challenge

Pothole Detection Challenge

Face Anti-Spoofing Challenge

Movie Reviews Challenge

Housing Prices Challenge

CodeComplex: Code Complexity Prediction Dataset

DocLayNet

SciTLDR

XStoryCloze

TOFU: Task of Fictitious Unlearning

TextVQA

XQuAD

CommonGen