Email Spam Classification Dataset

Summary

Introduction
Dataset Structure
Reference
License

Introduction

This dataset provides labeled email samples categorized as spam or ham (not spam), serving as the foundation for the Email Spam Classification challenge. The goal is to use these text-based messages to train a machine learning model that can automatically detect spam. With this dataset, participants will explore text preprocessing, feature extraction, and classification methods to build an effective spam-classification model.

Dataset Structure

Download Email Spam Classification Data.

Files

The dataset consists of two files:

train.csv:
- 2250 labeled emails for training machine learning models.
test.csv:
- 1311 unlabeled emails for prediction submission.
- Note:
  - The test data (test.csv) is used to predict labels for generating the submission file.
  - Submissions are evaluated and ranked on the Public Leaderboard.

An example from train.csv:

email_index	email	label
1	"Dear friend, how are you?"	0

An example from test.csv:

email_index	email
1	"Dear friend, how are you?"

Data Fields

The training dataset contains the following fields:

email_index: A unique identifier for each email.
email: The text content of the email.
label: The class label indicating whether the email is spam or not. 0 represents a non-spam email (ham), and 1 represents a spam email. This is the target variable for the classification task.

Note: The test dataset includes the same fields, except it does not contain the label.

Reference

For more information about the dataset, please visit Kaggle — Email Spam Dataset.

License

This dataset is released under CC0: Public Domain

All Challenges

Email Spam Classification Challenge