Start
23/03/2026
Close
∞
Email Spam Classification Challenge
Building models to identify spam email
Challenge Rewards:
knowledgeParticipants
43
Submissions
19
Email Spam Classification Dataset
Summary
Introduction
This dataset provides labeled email samples categorized as spam or ham (not spam), serving as the foundation for the Email Spam Classification challenge. The goal is to use these text-based messages to train a machine learning model that can automatically detect spam. With this dataset, participants will explore text preprocessing, feature extraction, and classification methods to build an effective spam-classification model.
Dataset Structure
Download Email Spam Classification Data.
Files
The dataset consists of two files:
- train.csv:
- 2250 labeled emails for training machine learning models.
- test.csv:
- 1311 unlabeled emails for prediction submission.
- Note:
- The test data (test.csv) is used to predict labels for generating the submission file.
- Submissions are evaluated and ranked on the Public Leaderboard.
An example from train.csv:
| email_index | label | |
|---|---|---|
| 1 | "Dear friend, how are you?" | 0 |
An example from test.csv:
| email_index | |
|---|---|
| 1 | "Dear friend, how are you?" |
Data Fields
The training dataset contains the following fields:
email_index: A unique identifier for each email.email: The text content of the email.label: The class label indicating whether the email is spam or not.0represents a non-spam email (ham), and1represents a spam email. This is the target variable for the classification task.
Note: The test dataset includes the same fields, except it does not contain the label.
Reference
For more information about the dataset, please visit Kaggle — Email Spam Dataset.
License
This dataset is released under CC0: Public Domain