Course
Python Bag of Words Model: A Complete Guide
Bag of Words (BoW) is a technique in Natural Language Processing (NLP). It is widely used to transform textual data into machine-readable format, specifically numerical values, without considering grammar and word order. Understanding BoW is important for anyone working with text data. Python provides multiple tools and libraries to implement Bag of Words effectively.
In this tutorial, we'll dive into BoW, introduce its concepts, cover its uses, and walk through a detailed implementation in Python. By the end of this tutorial, you'll be able to apply the Bag of Words model to real-world problems. If you’re new to NLP, check out our Natural Language Processing in Python skill track to learn more.
What is Bag of Words?
Bag of Words is a technique for extracting features from text data for machine learning tasks, such as text classification and sentiment analysis. This is important because machine learning algorithms can’t process textual data. The process of converting the text to numbers is known as feature extraction or feature encoding.
A Bag of Words is based on the occurrence of words in a document. The process starts with finding the vocabulary in the text and measuring their occurrence. It is called a bag because the order and structure of words are not considered, just their occurrence.
The Bag of Words model is different from the Continuous Bag of Words Model (CBOW) which learns dense word embeddings by using surrounding words to predict a target word, capturing semantic relationships between words. CBOW requires training on a large corpus and produces low-dimensional vectors that are valuable for complex NLP applications where word context is important.
Aspect |
BOW |
CBOW |
Purpose |
Counts occurrences of each word |
Predicts target word based on context |
Output Type |
High-dimensional, sparse vector |
Low-dimensional, dense embedding |
Considers Context |
No (ignores word order) |
Yes (uses surrounding words) |
Representation |
Sparse frequency vector |
Dense vector capturing semantics |
Complexity |
Low (no training required) |
High (requires training on large corpus) |
Typical Applications |
Text classification, sentiment analysis |
Word embeddings, NLP tasks needing context |
Why Use Bag of Words?
Bag of Words is useful in many NLP tasks, some reasons for its usage include:
- Feature extraction: It converts unstructured text data into structured data, which can be used as input to various machine learning algorithms.
- Simplicity and efficiency: BoW is computationally simple to implement, and works well for small to medium-sized text corpora.
- Document similarity: It can be used to calculate the similarity between text documents using techniques such as cosine similarity.
- Text classification: When combined with techniques like Naive Bayes, BoW is effective for text classification tasks such as spam classification, and sentiment analysis.
However, there are also drawbacks, such as not considering semantics, word structure, or word ordering.
Steps to Implement Bag of Words in Python
To create a bag-of-words model, we take all the words in a corpus and create a column with each word. The rows represent the sentences. If a certain word exists in the sentence, it’s represented by a 1, and if the word doesn’t exist, it's represented by a 0. Each word in the column represents a single feature.
In the end, we obtain a sparse matrix. A sparse matrix is a matrix with many zeros.
Data preprocessing
To create a Bag of Words model in Python, we need to take a few preprocessing steps. These steps include tokenization and removing stopwords.
Tokenization is the process of breaking down a piece of text into smaller units, typically words. You can perform tokenization using NLTK.
Stop words are common words in English, such as "the," "that," and "a," which don’t contribute to the polarity of a sentence.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download stopwords and tokenizer if you haven't already
nltk.download("punkt")
nltk.download("stopwords")
# Example sentence
sentence = "This is an example showing how to remove stop words from a sentence."
# Tokenize the sentence into words
words = word_tokenize(sentence)
# Get the list of stop words in English
stop_words = set(stopwords.words("english"))
# Remove stop words from the sentence
filtered_sentence = [word for word in words if word.lower() not in stop_words]
# Join the words back into a sentence
filtered_sentence = " ".join(filtered_sentence)
print(filtered_sentence)
Output:
example showing remove stop words sentence.
Creating a vocabulary
A vocabulary is a collection of unique words found in a corpus of text. Building a vocabulary involves gathering all the unique words from the corpus and counting their occurrences. This vocabulary is useful for various NLP tasks like language modeling, word embeddings, and text classification.
This code below creates a simple frequency distribution of words in the corpus, useful for basic NLP tasks such as building a vocabulary or understanding text content:
- The corpus variable holds a few example sentences. In real applications, this would contain larger, more varied text data.
- vocab =
defaultdict(int)
simplifies word frequency counting, automatically initializing any new word with a count of 0, allowing direct incrementation without checks. - Each sentence is tokenized by converting it to lowercase and extracting words using regular expressions. The pattern
\b\w+\b
identifies words with alphanumeric characters only, ignoring punctuation and other symbols. - Each word’s count is updated in the vocab dictionary.
- The vocabulary is sorted by frequency in descending order, making it easy to see the most common words at the top, and is displayed for reference.
import re
# Import the regular expressions module to help with text processing
from collections import (
defaultdict,
)
# Import defaultdict to easily handle word frequency counting
# Sample corpus of text - a small dataset of sentences to analyze
corpus = [
"Tokenization is the process of breaking text into words.",
"Vocabulary is the collection of unique words.",
"The process of tokenizing is essential in NLP.",
]
# Initialize a defaultdict with integer values to store word frequencies
# defaultdict(int) initializes each new key with a default integer value of 0
vocab = defaultdict(int)
# Loop through each sentence in the corpus to tokenize and normalize
for sentence in corpus:
# Convert the sentence to lowercase to ensure consistency in counting (e.g., 'Tokenization' and 'tokenization' are treated as the same word)
# Use regular expressions to find words composed of alphanumeric characters only
words = re.findall(r"\b\w+\b", sentence.lower())
# For each word found, increment its count in the vocab dictionary
for word in words:
vocab[word] += 1
# Convert the defaultdict vocab to a regular dictionary for easier handling and sorting
# Sort the dictionary by word frequency in descending order and convert it to a new dictionary
sorted_vocab = dict(sorted(vocab.items(), key=lambda x: x[1], reverse=True))
# Display the sorted vocabulary with each word and its frequency count
print("Vocabulary with Frequencies:", sorted_vocab)
Output:
Vocabulary with Frequencies: {'is': 3, 'the': 3, 'of': 3, 'process': 2, 'words': 2, 'tokenization': 1, 'breaking': 1, 'text': 1, 'into': 1, 'vocabulary': 1, 'collection': 1, 'unique': 1, 'tokenizing': 1, 'essential': 1, 'in': 1, 'nlp': 1}
Manually building a vocabulary can be time-consuming, especially for large corpora. Scikit-learn's CountVectorizer automates this process and allows for more flexible text processing as we will see later.
Bag of Words Implementation Using Python (From Scratch)
Let’s start with a simple implementation of Bag of Words from scratch in Python. This will help you understand the building blocks and mechanics of how it works under the hood.
Manual implementation
Step 1: Preprocessing the Text Data
We'll start by defining a simple function to process text, including tokenization, lowercasing, and removing punctuation.
from collections import defaultdict
import string
# Sample text data: sentences
corpus = [
"Python is amazing and fun.",
"Python is not just fun but also powerful.",
"Learning Python is fun!",
]
# Function to preprocess text
def preprocess(text):
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans("", "", string.punctuation))
# Tokenize: split the text into words
tokens = text.split()
return tokens
# Apply preprocessing to the sample corpus
processed_corpus = [preprocess(sentence) for sentence in corpus]
print(processed_corpus)
Output:
[['python', 'is', 'amazing', 'and', 'fun'], ['python', 'is', 'not', 'just', 'fun', 'but', 'also', 'powerful'], ['learning', 'python', 'is', 'fun']]
Step 2: Build Vocabulary
Now, we need to scan through all the documents and build a complete list of unique words, that is our vocabulary.
# Initialize an empty set for the vocabulary
vocabulary = set()
# Build the vocabulary
for sentence in processed_corpus:
vocabulary.update(sentence)
# Convert to a sorted list
vocabulary = sorted(list(vocabulary))
print("Vocabulary:", vocabulary)
Step 3: Calculate Word Frequencies and Vectorize
We'll now calculate the frequency of each word in the vocabulary for every document in the processed corpus.
def create_bow_vector(sentence, vocab):
vector = [0] * len(vocab) # Initialize a vector of zeros
for word in sentence:
if word in vocab:
idx = vocab.index(word) # Find the index of the word in the vocabulary
vector[idx] += 1 # Increment the count at that index
return vector
At this point, you will have created a Bag of Words representation for each document in your corpus.
# Create BoW vector for each sentence in the processed corpus
bow_vectors = [create_bow_vector(sentence, vocabulary) for sentence in processed_corpus]
print("Bag of Words Vectors:")
for vector in bow_vectors:
print(vector)
Output:
Bag of Words Vectors:
[0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1]
[1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1]
[0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1]
Using Scikit-learn’s CountVectorizer
Building a Bag of Words model manually is good for learning, but for production applications, you will want to use efficient, optimized libraries like Scikit-learn.
The Python function that we use for tokenization is CountVectorizer, which is imported from the sklearn.feature_extraction.text
. One of the features of CountVectorizer
is max_features
, which represents the maximum number of words you’d like to have in the bag of words model. In this case, we use None, meaning that all features will be used.
After creating an instance of CountVectorizer
, use the .fit_transform(
) method to create the bag of words model. Next, use the .toarray()
to convert the bag of words model to numpy arrays that can be fed to a machine learning model.
Once fitted, CountVectorizer has built a dictionary of feature indices. The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.
from sklearn.feature_extraction.text import CountVectorizer
# Original corpus
corpus = [
"Python is amazing and fun.",
"Python is not just fun but also powerful.",
"Learning Python is fun!",
]
# Create a CountVectorizer Object
vectorizer = CountVectorizer()
# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)
# Print the generated vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())
# Print the Bag-of-Words matrix
print("BoW Representation:")
print(X.toarray())
Output:
markdownVocabulary: ['also' 'amazing' 'and' 'but' 'fun' 'is' 'just' 'learning' 'not'
'powerful' 'python']
BoW Representation:
[[0 1 1 0 1 1 0 0 0 0 1]
[1 0 0 1 1 1 1 0 1 1 1]
[0 0 0 0 1 1 0 1 0 0 1]]
Example: Applying Bag of Words
Let's now apply the BoW model to a small text corpus consisting of three movie reviews to illustrate the entire process.
We'll use Scikit-learn's CountVectorizer to apply the BoW model to this small text corpus.
Here are the steps that we will take:
CountVectorizer
tokenizes the text, removes punctuation, and lowercase the words automatically..fit_transform(corpus)
converts the corpus into a document-term matrix, where each row represents a document and each column represents a word from the vocabulary.X_dense
is the dense matrix that represents the frequency of each word in each document.
from sklearn.feature_extraction.text import CountVectorizer
# Sample corpus of movie reviews
corpus = [
"I loved the movie, it was fantastic!",
"The movie was okay, but not great.",
"I hated the movie, it was terrible.",
]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)
# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()
# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()
# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print("Document-Term Matrix:\n", X_dense)
Output:
Vocabulary: ['but' 'fantastic' 'great' 'hated' 'it' 'loved' 'movie' 'not' 'okay' 'terrible' 'the' 'was']
Document-Term Matrix:
[[0 1 0 0 1 1 1 0 0 0 1 1] # First review: "I loved the movie, it was fantastic!"
[1 0 1 0 1 0 1 1 1 0 1 1] # Second review: "The movie was okay, but not great."
[0 0 0 1 1 0 1 0 0 1 1 1]] # Third review: "I hated the movie, it was terrible."
Here is how we can interpret the above output:
- Each unique word in the corpus is assigned an index, and the words are ordered alphabetically. For example, "but" is at index 0, "fantastic" is at index 1, "movie" is at index 6, and so on.
- Each row in the document matrix represents a movie review, and each column corresponds to a word from the vocabulary. The values in the matrix represent the frequency of each word in that particular document.
- First Review: [0 1 0 0 1 1 1 0 0 0 1 1] indicates that:
- The word "fantastic" appears once (1 at index 1),
- The word "loved" appears once (1 at index 5),
- The word "movie" appears once (1 at index 6),
- The word "it" appears once (1 at index 4),
- And so on.
The BoW vector can be interpreted as follows:
- Each document is a vector of numbers representing word counts. The dimensions of the vector are equal to the size of the vocabulary. In this case, the vocabulary has 12 words, so each review is transformed into a 12-dimensional vector.
- Most words in each row are zeros because not every document contains every word from the vocabulary. Hence, BoW models are often sparse, that is, they have many zeroes.
Advantages and Limitations of Bag of Words
Let’s now cover some of the advantages and limitations of the Bag of Words model.
Advantages
- Simple to implement and interpret: The Bag of Words model is one of the most straightforward text representation techniques, making it ideal for beginners. Its simplicity allows for fast implementation without the need for complex preprocessing or specialized models.
- Easy to use for text classification tasks: Bag of Words is well-suited for basic tasks like text classification, sentiment analysis, and spam detection. These tasks often don’t require sophisticated language models, so a BOW representation is sufficient and efficient.
Limitations
- Vocabulary size affects sparsity of representations: The larger the vocabulary, the more sparse and high-dimensional the representation becomes. This sparsity can make it harder for models to learn effectively and requires careful tuning of vocabulary size to avoid excessive computational costs.
- Produces sparse matrices that are computationally expensive: Since each document is represented by the frequency of each word in a potentially large vocabulary, the resulting matrices are often mostly zeros, which can be inefficient to store and process in machine learning pipelines. Sparse matrices consume significant memory and often require specialized tools and libraries for efficient storage and computation, especially with large datasets.
- Loses meaning and context: BOW disregards word order and sentence structure, which results in the loss of grammatical relationships and meaning. This limitation makes it less suitable for tasks where context, nuance, and word order matter, such as translation or sentiment detection in complex sentences.
The following strategies can be used to decrease the size of the vocabulary in the Bag of Words:
- Ignoring case.
- Removing punctuations.
- Removing stopwords, that is common words such as the and a.
- Ensuring all words are properly spelled.
- Using stemming techniques to reduce words to their root form.
Next Steps: Beyond Bag of Words
One limitation of the Bag of Words model is that it treats all words equally. Unfortunately, this can lead to issues where some words are given more importance simply because they appear frequently.
TF-IDF (Term Frequency-Inverse Document Frequency) is a solution to this problem, as it adjusts the weight of words based on how frequently they appear across all documents.
TF-IDF: An Extension to Bag of Words
Term Frequency (TF) represents the frequency of a term in a document. Inverse Document Frequency (IDF) reduces the impact of commonly occurring words across multiple documents. The TF-IDF score is calculated by multiplying the two metrics.
Consider a document containing 200 words, where the word love appears 5 times. The TF for love is then (5 / 200) = 0.025. Assuming we had one million documents and the word love occurs in one thousand of these, the inverse document frequency (i.e., IDF) is calculated as log(1000000 / 1000) = 3. The TF-IDF weight is the product of these quantities: 0.025 * 3 = 0.075.
In Scikit-learn, this is relatively easy to calculate using the TfidfVectorizer class.
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample corpus
corpus = [
"Python is amazing and fun.",
"Python is not just fun but also powerful.",
"Learning Python is fun!",
]
# Create the Tf-idf vectorizer
tfidf_vectorizer = TfidfVectorizer()
# Fit and transform the corpus
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
# Show the Vocabulary
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
# Show the TF-IDF Matrix
print("TF-IDF Representation:")
print(X_tfidf.toarray())
Output:
Vocabulary: ['also' 'amazing' 'and' 'but' 'fun' 'is' 'just' 'learning' 'not'
'powerful' 'python']
TF-IDF Representation:
[[0. 0.57292883 0.57292883 0. 0.338381 0.338381
0. 0. 0. 0. 0.338381 ]
[0.40667606 0. 0. 0.40667606 0.24018943 0.24018943
0.40667606 0. 0.40667606 0.40667606 0.24018943]
[0. 0. 0. 0. 0.41285857 0.41285857
0. 0.69903033 0. 0. 0.41285857]]
The TF-IDF matrix implemented above gives you a weighted measure instead of raw frequencies.
While the Bag of Words model has its limitations, especially for larger and more complex datasets, it’s still an essential building block in many NLP applications. Understanding it will assist you when exploring more advanced models like word embeddings and Transformers.
From here, you could experiment with BoW in your projects, including spam detection, sentiment analysis, document clustering, and more.
If you want further improvement beyond Bag of Words, you can explore methods like Word2Vec, and GloVe, or deep learning models like BERT.
Final Thoughts
The Bag of Words technique is a fundamental technique used in Natural Language Processing. It serves as a simple yet effective way to convert unstructured text into numerical features usable by machine learning algorithms. In this tutorial, we’ve covered:
- What is the Bag of Words (BoW) model?
- The benefits of the Bag of Word model in building machine learning models.
- How to implement the Bag of Words model in Python.
- Advantages and limitations of Bag of Words.
- The theory and motivation behind the Bag of Words model.
- Introducing TF-IDF as an improvement to the traditional Bag of Words approach.
Check out our Natural Language Processing in Python skill track, to dive deep into natural language processing.

Top DataCamp Courses
Track
Natural Language Processing
Course
Feature Engineering for NLP in Python
Tutorial
Python Dictionaries Tutorial: The Definitive Guide
Tutorial
Naive Bayes Classification Tutorial using Scikit-learn
Tutorial
Generating WordClouds in Python Tutorial
Tutorial
NLTK Sentiment Analysis Tutorial for Beginners
Tutorial
Python Tutorial for Beginners
Tutorial