Skip to content

SMS Spam Collection

This is a text corpus of over 5,500 English SMS messages with ~13% labeled as spam. The text file contains one message per line with two columns: the label ("ham" or "spam") and the raw text of the message. Messages labeled as "ham" are non-spam messages that can be considered legitimate.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd
spam = pd.read_csv("SMSSpamCollection.csv", header=None)
print(spam.shape)
spam.head(100)

Source of dataset. This corpus was created by Tiago A. Almeida and José María Gómez Hidalgo.

Citations:

  • Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

  • Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012.

  • Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P. Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science (IJISS), 2(1), 1-18, 2013.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

  • 🗺️ Explore: What are the most common words in spam versus normal messages?
  • 📊 Visualize: Create a word cloud visualizing the most common words in the dataset.
  • 🔎 Analyze: What word is most likely to indicate that a message is spam?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You work for a telecom company that is launching a new messaging app. Unfortunately, previous spam filters that they have used are out of date and no longer effective. They have asked you whether you can use new data they have supplied to distinguish between spam and regular messages accurately. They have also told you that it is essential that regular messages are rarely, if ever, categorized as spam.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

spam.head()
spam.columns = ['label', 'message']

spam.head()
import pandas as pd
from collections import Counter
import re
from nltk.corpus import stopwords
import nltk

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Assuming the dataframe `spam` has columns "label" and "message"
# Separate spam and ham messages
spam_messages = spam[spam['label'] == 'spam']['message']
ham_messages = spam[spam['label'] == 'ham']['message']

# Function to preprocess and tokenize messages, removing stopwords
def preprocess_and_tokenize(messages):
    stop_words = set(stopwords.words('english'))
    words = []
    for message in messages:
        # Remove non-alphanumeric characters and convert to lowercase
        message = re.sub(r'\W+', ' ', message).lower()
        # Tokenize and remove stopwords
        words.extend([word for word in message.split() if word not in stop_words])
    return words

# Get words for spam and ham messages
spam_words = preprocess_and_tokenize(spam_messages)
ham_words = preprocess_and_tokenize(ham_messages)

# Get the most common words
spam_word_counts = Counter(spam_words)
ham_word_counts = Counter(ham_words)

# Display the 10 most common words in spam and ham messages
most_common_spam_words = spam_word_counts.most_common(10)
most_common_ham_words = ham_word_counts.most_common(10)

most_common_spam_words, most_common_ham_words
# Find common and unique words between spam and ham word counts
common_words = set(spam_word_counts.keys()).intersection(set(ham_word_counts.keys()))
unique_spam_words = set(spam_word_counts.keys()).difference(set(ham_word_counts.keys()))

# Number of common words
num_common_words = len(common_words)

# Number of unique spam words
num_unique_spam_words = len(unique_spam_words)

num_common_words, num_unique_spam_words
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate word cloud for common words
common_words_dict = {word: spam_word_counts[word] for word in common_words}
common_wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(common_words_dict)

# Generate word cloud for unique spam words
unique_spam_words_dict = {word: spam_word_counts[word] for word in unique_spam_words}
unique_spam_wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(unique_spam_words_dict)

# Plot the word clouds
plt.figure(figsize=(15, 8))

# Common words word cloud
plt.subplot(1, 2, 1)
plt.imshow(common_wordcloud, interpolation='bilinear')
plt.title('Common Words')
plt.axis('off')

# Unique spam words word cloud
plt.subplot(1, 2, 2)
plt.imshow(unique_spam_wordcloud, interpolation='bilinear')
plt.title('Unique Spam Words')
plt.axis('off')

plt.show()
# List of unique spam words
unique_spam_words_list = list(unique_spam_words)
unique_spam_words_list[:10]
# Change label to 1 & 0 with 1 being spam
spam['label'] = (spam['label'] == 'spam')*1

spam.head()
spam['label'].sum()
# Feature extraction from the 'message' column to identify spam (1) vs ham (0)

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Example features to extract:
# 1. Length of the message
spam['message_length'] = spam['message'].apply(len)

# 2. Number of words in the message
spam['word_count'] = spam['message'].apply(lambda x: len(x.split()))

# 3. Average word length in the message
spam['avg_word_length'] = spam['message'].apply(lambda x: np.mean([len(word) for word in x.split()]))

# 4. Count of special characters (e.g., '!', '$')
spam['special_char_count'] = spam['message'].apply(lambda x: sum([1 for char in x if char in '!$']))

# 5. Presence of specific keywords (e.g., 'free', 'win', 'offer')
keywords = ['free', 'win', 'offer']
for keyword in keywords:
    spam[f'contains_{keyword}'] = spam['message'].apply(lambda x: 1 if keyword in x.lower() else 0)

# 6. TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_features=100)  # Limiting to top 100 features for simplicity
tfidf_features = tfidf_vectorizer.fit_transform(spam['message']).toarray()
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_features, columns=[f'tfidf_{name}' for name in tfidf_feature_names])

# Concatenate TF-IDF features with the original dataframe
spam = pd.concat([spam, tfidf_df], axis=1)

# Display the first few rows of the dataframe with new features
spam.head()
spam['label'].sum()
spam.columns