Python Bag of Words Model: A Complete Guide

Explore everything you need to know about how to implement the bag of words model in Python.

Nov 5, 2024 · 12 min read

Bag of Words (BoW) is a technique in Natural Language Processing (NLP). It is widely used to transform textual data into machine-readable format, specifically numerical values, without considering grammar and word order. Understanding BoW is important for anyone working with text data. Python provides multiple tools and libraries to implement Bag of Words effectively.

In this tutorial, we'll dive into BoW, introduce its concepts, cover its uses, and walk through a detailed implementation in Python. By the end of this tutorial, you'll be able to apply the Bag of Words model to real-world problems. If you’re new to NLP, check out our Natural Language Processing in Python skill track to learn more.

What is Bag of Words?

Bag of Words is a technique for extracting features from text data for machine learning tasks, such as text classification and sentiment analysis. This is important because machine learning algorithms can’t process textual data. The process of converting the text to numbers is known as feature extraction or feature encoding.

A Bag of Words is based on the occurrence of words in a document. The process starts with finding the vocabulary in the text and measuring their occurrence. It is called a bag because the order and structure of words are not considered, just their occurrence.

The Bag of Words model is different from the Continuous Bag of Words Model (CBOW) which learns dense word embeddings by using surrounding words to predict a target word, capturing semantic relationships between words. CBOW requires training on a large corpus and produces low-dimensional vectors that are valuable for complex NLP applications where word context is important.

Aspect	BOW	CBOW
Purpose	Counts occurrences of each word	Predicts target word based on context
Output Type	High-dimensional, sparse vector	Low-dimensional, dense embedding
Considers Context	No (ignores word order)	Yes (uses surrounding words)
Representation	Sparse frequency vector	Dense vector capturing semantics
Complexity	Low (no training required)	High (requires training on large corpus)
Typical Applications	Text classification, sentiment analysis	Word embeddings, NLP tasks needing context

Why Use Bag of Words?

Bag of Words is useful in many NLP tasks, some reasons for its usage include:

Feature extraction: It converts unstructured text data into structured data, which can be used as input to various machine learning algorithms.
Simplicity and efficiency: BoW is computationally simple to implement, and works well for small to medium-sized text corpora.
Document similarity: It can be used to calculate the similarity between text documents using techniques such as cosine similarity.
Text classification: When combined with techniques like Naive Bayes, BoW is effective for text classification tasks such as spam classification, and sentiment analysis.

However, there are also drawbacks, such as not considering semantics, word structure, or word ordering.

Steps to Implement Bag of Words in Python

To create a bag-of-words model, we take all the words in a corpus and create a column with each word. The rows represent the sentences. If a certain word exists in the sentence, it’s represented by a 1, and if the word doesn’t exist, it's represented by a 0. Each word in the column represents a single feature.

In the end, we obtain a sparse matrix. A sparse matrix is a matrix with many zeros.

Data preprocessing

To create a Bag of Words model in Python, we need to take a few preprocessing steps. These steps include tokenization and removing stopwords.

Tokenization is the process of breaking down a piece of text into smaller units, typically words. You can perform tokenization using NLTK.

Stop words are common words in English, such as "the," "that," and "a," which don’t contribute to the polarity of a sentence.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords and tokenizer if you haven't already
nltk.download("punkt")
nltk.download("stopwords")

# Example sentence
sentence = "This is an example showing how to remove stop words from a sentence."

# Tokenize the sentence into words
words = word_tokenize(sentence)

# Get the list of stop words in English
stop_words = set(stopwords.words("english"))

# Remove stop words from the sentence
filtered_sentence = [word for word in words if word.lower() not in stop_words]

# Join the words back into a sentence
filtered_sentence = " ".join(filtered_sentence)
print(filtered_sentence)

Output:

example showing remove stop words sentence.

Creating a vocabulary

A vocabulary is a collection of unique words found in a corpus of text. Building a vocabulary involves gathering all the unique words from the corpus and counting their occurrences. This vocabulary is useful for various NLP tasks like language modeling, word embeddings, and text classification.

This code below creates a simple frequency distribution of words in the corpus, useful for basic NLP tasks such as building a vocabulary or understanding text content:

The corpus variable holds a few example sentences. In real applications, this would contain larger, more varied text data.
vocab = defaultdict(int) simplifies word frequency counting, automatically initializing any new word with a count of 0, allowing direct incrementation without checks.
Each sentence is tokenized by converting it to lowercase and extracting words using regular expressions. The pattern \b\w+\b identifies words with alphanumeric characters only, ignoring punctuation and other symbols.
Each word’s count is updated in the vocab dictionary.
The vocabulary is sorted by frequency in descending order, making it easy to see the most common words at the top, and is displayed for reference.

import re  
# Import the regular expressions module to help with text processing
from collections import (
    defaultdict,
)  

# Import defaultdict to easily handle word frequency counting

# Sample corpus of text - a small dataset of sentences to analyze
corpus = [
    "Tokenization is the process of breaking text into words.",
    "Vocabulary is the collection of unique words.",
    "The process of tokenizing is essential in NLP.",
]

# Initialize a defaultdict with integer values to store word frequencies

# defaultdict(int) initializes each new key with a default integer value of 0
vocab = defaultdict(int)

# Loop through each sentence in the corpus to tokenize and normalize
for sentence in corpus:
    # Convert the sentence to lowercase to ensure consistency in counting (e.g., 'Tokenization' and 'tokenization' are treated as the same word)
    # Use regular expressions to find words composed of alphanumeric characters only
    words = re.findall(r"\b\w+\b", sentence.lower())
    # For each word found, increment its count in the vocab dictionary
    for word in words:
        vocab[word] += 1

# Convert the defaultdict vocab to a regular dictionary for easier handling and sorting

# Sort the dictionary by word frequency in descending order and convert it to a new dictionary
sorted_vocab = dict(sorted(vocab.items(), key=lambda x: x[1], reverse=True))

# Display the sorted vocabulary with each word and its frequency count
print("Vocabulary with Frequencies:", sorted_vocab)

Output:

Vocabulary with Frequencies: {'is': 3, 'the': 3, 'of': 3, 'process': 2, 'words': 2, 'tokenization': 1, 'breaking': 1, 'text': 1, 'into': 1, 'vocabulary': 1, 'collection': 1, 'unique': 1, 'tokenizing': 1, 'essential': 1, 'in': 1, 'nlp': 1}

Manually building a vocabulary can be time-consuming, especially for large corpora. Scikit-learn's CountVectorizer automates this process and allows for more flexible text processing as we will see later.

Bag of Words Implementation Using Python (From Scratch)

Let’s start with a simple implementation of Bag of Words from scratch in Python. This will help you understand the building blocks and mechanics of how it works under the hood.

Manual implementation

Step 1: Preprocessing the Text Data

We'll start by defining a simple function to process text, including tokenization, lowercasing, and removing punctuation.

from collections import defaultdict
import string

# Sample text data: sentences
corpus = [
    "Python is amazing and fun.",
    "Python is not just fun but also powerful.",
    "Learning Python is fun!",
]
# Function to preprocess text
def preprocess(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize: split the text into words
    tokens = text.split()
    return tokens

# Apply preprocessing to the sample corpus
processed_corpus = [preprocess(sentence) for sentence in corpus]
print(processed_corpus)

Output:

[['python', 'is', 'amazing', 'and', 'fun'], ['python', 'is', 'not', 'just', 'fun', 'but', 'also', 'powerful'], ['learning', 'python', 'is', 'fun']]

Step 2: Build Vocabulary

Now, we need to scan through all the documents and build a complete list of unique words, that is our vocabulary.

# Initialize an empty set for the vocabulary
vocabulary = set()

# Build the vocabulary
for sentence in processed_corpus:
    vocabulary.update(sentence)

# Convert to a sorted list
vocabulary = sorted(list(vocabulary))
print("Vocabulary:", vocabulary)

Step 3: Calculate Word Frequencies and Vectorize

We'll now calculate the frequency of each word in the vocabulary for every document in the processed corpus.

def create_bow_vector(sentence, vocab):
    vector = [0] * len(vocab)  # Initialize a vector of zeros
    for word in sentence:
        if word in vocab:
            idx = vocab.index(word)  # Find the index of the word in the vocabulary
            vector[idx] += 1  # Increment the count at that index
    return vector

At this point, you will have created a Bag of Words representation for each document in your corpus.

# Create BoW vector for each sentence in the processed corpus
bow_vectors = [create_bow_vector(sentence, vocabulary) for sentence in processed_corpus]
print("Bag of Words Vectors:")
for vector in bow_vectors:
    print(vector)

Output:

Bag of Words Vectors:
[0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1]
[1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1]
[0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1]

Using Scikit-learn’s CountVectorizer

Building a Bag of Words model manually is good for learning, but for production applications, you will want to use efficient, optimized libraries like Scikit-learn.

The Python function that we use for tokenization is CountVectorizer, which is imported from the sklearn.feature_extraction.text. One of the features of CountVectorizer is max_features, which represents the maximum number of words you’d like to have in the bag of words model. In this case, we use None, meaning that all features will be used.

After creating an instance of CountVectorizer, use the .fit_transform() method to create the bag of words model. Next, use the .toarray() to convert the bag of words model to numpy arrays that can be fed to a machine learning model.

Once fitted, CountVectorizer has built a dictionary of feature indices. The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.

from sklearn.feature_extraction.text import CountVectorizer
# Original corpus
corpus = [
    "Python is amazing and fun.",
    "Python is not just fun but also powerful.",
    "Learning Python is fun!",
]
# Create a CountVectorizer Object
vectorizer = CountVectorizer()
# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)
# Print the generated vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())
# Print the Bag-of-Words matrix
print("BoW Representation:")
print(X.toarray())

Output:

markdownVocabulary: ['also' 'amazing' 'and' 'but' 'fun' 'is' 'just' 'learning' 'not'

 'powerful' 'python']

BoW Representation:

[[0 1 1 0 1 1 0 0 0 0 1]

 [1 0 0 1 1 1 1 0 1 1 1]

 [0 0 0 0 1 1 0 1 0 0 1]]

Example: Applying Bag of Words

Let's now apply the BoW model to a small text corpus consisting of three movie reviews to illustrate the entire process.

We'll use Scikit-learn's CountVectorizer to apply the BoW model to this small text corpus.

Here are the steps that we will take:

CountVectorizer tokenizes the text, removes punctuation, and lowercase the words automatically.
.fit_transform(corpus) converts the corpus into a document-term matrix, where each row represents a document and each column represents a word from the vocabulary.
X_dense is the dense matrix that represents the frequency of each word in each document.

from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print("Document-Term Matrix:\n", X_dense)

Output:

Vocabulary: ['but' 'fantastic' 'great' 'hated' 'it' 'loved' 'movie' 'not' 'okay' 'terrible' 'the' 'was']
Document-Term Matrix:
 [[0 1 0 0 1 1 1 0 0 0 1 1]  # First review: "I loved the movie, it was fantastic!"
  [1 0 1 0 1 0 1 1 1 0 1 1]  # Second review: "The movie was okay, but not great."
  [0 0 0 1 1 0 1 0 0 1 1 1]] # Third review: "I hated the movie, it was terrible."

Here is how we can interpret the above output:

Each unique word in the corpus is assigned an index, and the words are ordered alphabetically. For example, "but" is at index 0, "fantastic" is at index 1, "movie" is at index 6, and so on.
Each row in the document matrix represents a movie review, and each column corresponds to a word from the vocabulary. The values in the matrix represent the frequency of each word in that particular document.

First Review: [0 1 0 0 1 1 1 0 0 0 1 1] indicates that:

The word "fantastic" appears once (1 at index 1),
The word "loved" appears once (1 at index 5),
The word "movie" appears once (1 at index 6),
The word "it" appears once (1 at index 4),
And so on.

The BoW vector can be interpreted as follows:

Each document is a vector of numbers representing word counts. The dimensions of the vector are equal to the size of the vocabulary. In this case, the vocabulary has 12 words, so each review is transformed into a 12-dimensional vector.
Most words in each row are zeros because not every document contains every word from the vocabulary. Hence, BoW models are often sparse, that is, they have many zeroes.

Advantages and Limitations of Bag of Words

Let’s now cover some of the advantages and limitations of the Bag of Words model.

Advantages

Simple to implement and interpret: The Bag of Words model is one of the most straightforward text representation techniques, making it ideal for beginners. Its simplicity allows for fast implementation without the need for complex preprocessing or specialized models.
Easy to use for text classification tasks: Bag of Words is well-suited for basic tasks like text classification, sentiment analysis, and spam detection. These tasks often don’t require sophisticated language models, so a BOW representation is sufficient and efficient.

Limitations

Vocabulary size affects sparsity of representations: The larger the vocabulary, the more sparse and high-dimensional the representation becomes. This sparsity can make it harder for models to learn effectively and requires careful tuning of vocabulary size to avoid excessive computational costs.
Produces sparse matrices that are computationally expensive: Since each document is represented by the frequency of each word in a potentially large vocabulary, the resulting matrices are often mostly zeros, which can be inefficient to store and process in machine learning pipelines. Sparse matrices consume significant memory and often require specialized tools and libraries for efficient storage and computation, especially with large datasets.
Loses meaning and context: BOW disregards word order and sentence structure, which results in the loss of grammatical relationships and meaning. This limitation makes it less suitable for tasks where context, nuance, and word order matter, such as translation or sentiment detection in complex sentences.

The following strategies can be used to decrease the size of the vocabulary in the Bag of Words:

Ignoring case.
Removing punctuations.
Removing stopwords, that is common words such as the and a.
Ensuring all words are properly spelled.
Using stemming techniques to reduce words to their root form.

Next Steps: Beyond Bag of Words

One limitation of the Bag of Words model is that it treats all words equally. Unfortunately, this can lead to issues where some words are given more importance simply because they appear frequently.

TF-IDF (Term Frequency-Inverse Document Frequency) is a solution to this problem, as it adjusts the weight of words based on how frequently they appear across all documents.

TF-IDF: An Extension to Bag of Words

Term Frequency (TF) represents the frequency of a term in a document. Inverse Document Frequency (IDF) reduces the impact of commonly occurring words across multiple documents. The TF-IDF score is calculated by multiplying the two metrics.

Consider a document containing 200 words, where the word love appears 5 times. The TF for love is then (5 / 200) = 0.025. Assuming we had one million documents and the word love occurs in one thousand of these, the inverse document frequency (i.e., IDF) is calculated as log(1000000 / 1000) = 3. The TF-IDF weight is the product of these quantities: 0.025 * 3 = 0.075.

In Scikit-learn, this is relatively easy to calculate using the TfidfVectorizer class.

from sklearn.feature_extraction.text import TfidfVectorizer
# Sample corpus
corpus = [
    "Python is amazing and fun.",
    "Python is not just fun but also powerful.",
    "Learning Python is fun!",
]

# Create the Tf-idf vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Show the Vocabulary
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

# Show the TF-IDF Matrix
print("TF-IDF Representation:")
print(X_tfidf.toarray())

Output:

Vocabulary: ['also' 'amazing' 'and' 'but' 'fun' 'is' 'just' 'learning' 'not'
 'powerful' 'python']
TF-IDF Representation:
[[0.         0.57292883 0.57292883 0.         0.338381   0.338381
  0.         0.         0.         0.         0.338381  ]
 [0.40667606 0.         0.         0.40667606 0.24018943 0.24018943
  0.40667606 0.         0.40667606 0.40667606 0.24018943]
 [0.         0.         0.         0.         0.41285857 0.41285857
  0.         0.69903033 0.         0.         0.41285857]]

The TF-IDF matrix implemented above gives you a weighted measure instead of raw frequencies.

While the Bag of Words model has its limitations, especially for larger and more complex datasets, it’s still an essential building block in many NLP applications. Understanding it will assist you when exploring more advanced models like word embeddings and Transformers.

From here, you could experiment with BoW in your projects, including spam detection, sentiment analysis, document clustering, and more.

If you want further improvement beyond Bag of Words, you can explore methods like Word2Vec, and GloVe, or deep learning models like BERT.

Final Thoughts

The Bag of Words technique is a fundamental technique used in Natural Language Processing. It serves as a simple yet effective way to convert unstructured text into numerical features usable by machine learning algorithms. In this tutorial, we’ve covered:

What is the Bag of Words (BoW) model?
The benefits of the Bag of Word model in building machine learning models.
How to implement the Bag of Words model in Python.
Advantages and limitations of Bag of Words.
The theory and motivation behind the Bag of Words model.
Introducing TF-IDF as an improvement to the traditional Bag of Words approach.

Check out our Natural Language Processing in Python skill track, to dive deep into natural language processing.

Author

Derrick Mwiti

Topics

Python

Natural Language Processing

Top DataCamp Courses

Track

Natural Language Processing in Python

20 hr

Learn how to transcribe, and extract exciting insights from books, review sites, and online articles with Natural Language Processing (NLP) in Python.

See Details

Start Course

Course

Text Mining with Bag-of-Words in R

4 hr

44K

Learn the bag of words technique for text mining with R.

See Details

Start Course

Course

Feature Engineering for NLP in Python

4 hr

28.3K

Learn techniques to extract useful information from text and process them into a format suitable for machine learning.

See Details

Start Course

Tutorial

Python Dictionaries Tutorial: The Definitive Guide

Learn all about the Python Dictionary and its potential. You will also learn how to create word frequency using the Dictionary.

Aditya Sharma

Tutorial

Naive Bayes Classification Tutorial using Scikit-learn

Learn how to build and evaluate a Naive Bayes Classifier using Python's Scikit-learn package.

Abid Ali Awan

Tutorial

Generating WordClouds in Python Tutorial

Learn how to perform Exploratory Data Analysis for Natural Language Processing using WordCloud in Python.

Duong Vu

Tutorial

NLTK Sentiment Analysis Tutorial for Beginners

Python NLTK (natural language toolkit) sentiment analysis tutorial. Learn how to create and develop sentiment analysis using Python. Follow specific steps to mine and analyze text for natural language processing.

Moez Ali

Tutorial

Python Tutorial for Beginners

Get a step-by-step guide on how to install Python and use it for basic data science functions.

Matthew Przybyla

Tutorial

Stemming and Lemmatization in Python

This tutorial covers stemming and lemmatization from a practical standpoint using the Python Natural Language ToolKit (NLTK) package.

Kurtis Pykes

See More See More

What is Bag of Words?

Why Use Bag of Words?

Steps to Implement Bag of Words in Python

Data preprocessing

Creating a vocabulary

Bag of Words Implementation Using Python (From Scratch)

Step 1: Preprocessing the Text Data

Step 2: Build Vocabulary

Step 3: Calculate Word Frequencies and Vectorize

Using Scikit-learn’s CountVectorizer

Example: Applying Bag of Words

Advantages and Limitations of Bag of Words

Advantages

Limitations

Next Steps: Beyond Bag of Words

TF-IDF: An Extension to Bag of Words

Final Thoughts

Python Dictionaries Tutorial: The Definitive Guide

Naive Bayes Classification Tutorial using Scikit-learn

Generating WordClouds in Python Tutorial

NLTK Sentiment Analysis Tutorial for Beginners

Python Tutorial for Beginners

Stemming and Lemmatization in Python

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Natural Language Processing in Python

Text Mining with Bag-of-Words in R

Feature Engineering for NLP in Python

Python Dictionaries Tutorial: The Definitive Guide

Naive Bayes Classification Tutorial using Scikit-learn

Generating WordClouds in Python Tutorial

NLTK Sentiment Analysis Tutorial for Beginners

Python Tutorial for Beginners

Stemming and Lemmatization in Python

Natural Language Processing in Python