Skip to content

SMS Spam Collection

This is a text corpus of over 5,500 English SMS messages with ~13% labeled as spam. The text file contains one message per line with two columns: the label ("ham" or "spam") and the raw text of the message. Messages labeled as "ham" are non-spam messages that can be considered legitimate.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd
spam = pd.read_csv("SMSSpamCollection.csv", header=None)
print(spam.shape)
spam.head(100)

Source of dataset. This corpus was created by Tiago A. Almeida and José María Gómez Hidalgo.

Citations:

  • Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

  • Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012.

  • Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P. Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science (IJISS), 2(1), 1-18, 2013.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

  • 🗺️ Explore: What are the most common words in spam versus normal messages?
  • 📊 Visualize: Create a word cloud visualizing the most common words in the dataset.
  • 🔎 Analyze: What word is most likely to indicate that a message is spam?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You work for a telecom company that is launching a new messaging app. Unfortunately, previous spam filters that they have used are out of date and no longer effective. They have asked you whether you can use new data they have supplied to distinguish between spam and regular messages accurately. They have also told you that it is essential that regular messages are rarely, if ever, categorized as spam.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

# Import necessary libraries
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Load the dataset
# Assuming the dataset is a CSV file with columns 'message' and 'label' where 'label' is 'spam' or 'ham'
# df = pd.read_csv('messages.csv')
# For the purpose of this example, let's create a sample dataframe
data = {
    'message': [
        'Congratulations! You have won a $1000 Walmart gift card. Go to http://bit.ly/123456 to claim now.',
        'Hey, are we still meeting for lunch tomorrow?',
        'Free entry in 2 a weekly competition to win FA Cup final tickets. Text FA to 12345 to enter.',
        'Can you call me back when you get a chance?',
        'URGENT! Your mobile number has been selected for a $500 prize. Call 09012345678 now.'
    ],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam']
}
df = pd.DataFrame(data)

# Function to preprocess text
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\b\w{1,2}\b', '', text)  # Remove short words
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = ' '.join([word for word in text.split() if word not in ENGLISH_STOP_WORDS])  # Remove stopwords
    return text

# Apply preprocessing
df['cleaned_message'] = df['message'].apply(preprocess_text)

# Separate spam and ham messages
spam_messages = df[df['label'] == 'spam']['cleaned_message']
ham_messages = df[df['label'] == 'ham']['cleaned_message']

# Get the most common words in spam and ham messages
spam_words = ' '.join(spam_messages).split()
ham_words = ' '.join(ham_messages).split()

spam_word_counts = Counter(spam_words)
ham_word_counts = Counter(ham_words)

# Create word clouds
spam_wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(spam_word_counts)
ham_wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(ham_word_counts)

# Plot the word clouds
plt.figure(figsize=(16, 8))

plt.subplot(1, 2, 1)
plt.imshow(spam_wordcloud, interpolation='bilinear')
plt.title('Spam Messages Word Cloud')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(ham_wordcloud, interpolation='bilinear')
plt.title('Ham Messages Word Cloud')
plt.axis('off')

plt.show()

# Find the word most likely to indicate spam
spam_indicator_word = spam_word_counts.most_common(1)[0][0]
spam_indicator_word