course
MongoDB Tutorial: How to Set Up and Query MongoDB Databases
You’ve probably heard of or even worked with relational databases. The row-table format is the most popular and intuitive structure to store information. Unfortunately, you can’t store all the data that comes your way in rows and tables. In fact, so many problems in the real world require non-relational databases. So, are there alternatives?
The answer is YES! There are four types of databases that don’t have rows or tables. They are called NoSQL databases, as you can’t use SQL to query them. They are:
- Key-value databases
- Document databases
- Column family databases
- Graph databases
This article focuses on document databases and how to use a server called MongoDB. But before we jump into the technical details, let’s look at the use cases of document databases. You can check out our separate guide to graph databases for more information on them.
When to Use Document Databases?
One of the main use cases for choosing document databases is when you have data that doesn’t neatly fit into a predefined schema like a table. Many processes or applications in industries store these types of data. Here are some examples:
- Web and mobile apps: User profiles, preferences, content and interactions
- Content management systems: Storing a wider range of media such as text, images, video, GIFs, etc.
- E-commerce platforms: Product catalogs, customer information, order history, inventory, etc.
- Gaming: storing player profiles, leaderboard rankings
- Logging and data collection: large volumes of logs, events, metrics for analysis, etc.
Take a moment to think about how data collected from these industries would fit into tables. For example, e-commerce platforms would have difficulty storing product catalogs into a predefined schema. Different products have different attributes or, worse, different numbers of attributes. Do you need 10 columns to store 10 physical attributes of drones of 100 different brands or just 5–6 to store book information?
Table-based databases can’t help you in such scenarios. By using document databases such as MongoDB, you gain the following benefits:
- No upfront development cost to design a schema
- Documents (data) can vary over time (including the data types, the number of attributes, etc.)
- Document databases avoid joins, which results in much faster querying
- Intuitive for developers as document databases are mostly large JSON files, which are basically humongous dictionaries for Pythonistas.
- Document DBs scale horizontally, which means they don’t need increasingly more computing resources as the database grows.
Now, let’s take a look at the core concepts around document databases and MongoDB.
Get certified in your dream Data Analyst role
Our certification programs help you stand out and prove your skills are job-ready to potential employers.
Core Concepts Around MongoDB
I’ve been saying document databases over and over up to this point, but what actually are they? Here are the main concepts:
- Documents: data is stored in objects called documents. In simple terms, documents are similar to JSON key-value objects. A single document is equivalent to a row in a table. If you are a Python programmer, you can think of a document as a dictionary, and you would be OK. Just remember that documents can have nested documents, which is one of the main features of document databases.
- Collections: collections are like tables in relational databases, but instead of containing rows, they contain, you guessed it, documents. Collections are massive data structures that may contain thousands or millions of dictionaries.
- Schema-less: Here is where it gets interesting. A collection can have documents of varying sizes. For example, document 1 can have 10 key-value pairs while document 2 can have 15 (as long as the information is about the same category/topic, like products for the collection to make sense).
Here is a great table summarizing the differences between relational databases and document databases:
Feature | Document Databases (e.g., MongoDB) | Relational Databases (e.g., MySQL, PostgreSQL) |
---|---|---|
Data Structure | Stores data as documents (e.g., JSON, BSON), allowing for flexible, hierarchical structures. | Stores data in tables with rows and columns, following a predefined schema. |
Schema Flexibility | Schema-less: Documents can have varying structures, allowing different fields and data types. | Fixed schema: Requires a predefined schema with specific columns and data types. |
Query Language | Uses MongoDB Query Language (MQL) or similar, which is object-based and more flexible. | Uses SQL (Structured Query Language) for querying structured data. |
Joins | Avoids joins by embedding related data inside documents (denormalization). | Supports complex joins across tables (normalization). |
Performance | Faster reads and writes for unstructured or semi-structured data. Avoids overhead from joins. | Strong performance for structured data, but joins can slow down queries. |
Scalability | Horizontally scalable: Can distribute data across multiple servers using sharding. | Typically vertically scalable: Relies on more powerful hardware, though some support horizontal scaling (e.g., with partitions). |
Transaction Support | Supports multi-document ACID transactions (from MongoDB 4.0 and above), but was initially designed for non-transactional operations. | Full support for ACID-compliant transactions, providing strong consistency and reliability. |
Use Cases | Best for unstructured or semi-structured data like user profiles, logs, catalogs, and flexible data structures. | Ideal for structured data with clear relationships, such as financial records or enterprise resource planning (ERP). |
Data Relationships | Supports embedded data (denormalization), which makes it easy to retrieve related information from a single query. | Relational databases rely on foreign keys to establish relationships across tables (normalization). |
Indexing | Supports indexing but lacks the variety and sophistication of indexing available in relational databases. | Strong indexing capabilities, supporting multiple index types (e.g., B-tree, hash) for better performance optimization. |
Consistency | Provides eventual consistency in distributed setups but also offers strong consistency when needed (via ACID transactions). | Ensures strong consistency in most cases due to ACID transactions and relational integrity. |
Scaling Data Volume | Easily scales to accommodate large amounts of data by adding servers (sharding). | Can scale vertically, but horizontal scaling requires more complex configuration (e.g., partitioning). |
Data Integrity | Data integrity is managed within each document, but managing relationships between documents can be more challenging. | Strong built-in support for data integrity through primary and foreign keys, as well as constraints like UNIQUE and NOT NULL. |
Developer Friendliness | Developer-friendly: Flexible data modeling, works well with modern applications (e.g., JSON, REST APIs). | Rigid data modeling but well-understood by developers familiar with SQL and structured data. |
Let’s actually start working with documents in MongoDB!
MongoDB Setup: Connecting to Data Sources
To query document databases, we need to install the MongoDB server. Here are the platform-specific instructions:
- For Windows, follow the instructions on this link.
- For Unix-like systems, you can install MongoDB from the terminal:
$ sudo apt-get install -y mongodb
Then, inside a virtual environment, install the libraries pymongo
and requests
. pymongo
is the official Python adapter for the MongoDB server. We will need the requests
library to pull data from an API.
$ pip install pymongo
$ pip install requests
Then, from the terminal, start the MongoDB server with the following command:
$ sudo service mongodb start
Now, we are ready to load some data into a document database. There are two scenarios when doing so:
- You have data locally in appropriate formats like JSON, BSON, YAML, or XML.
- You need to pull the data from external sources, typically APIs.
We will cover both. First, let’s load a collection named drone_races.json
locally. Here is the snippet to do so:
import json
from pymongo import MongoClient
# Establish connection to MongoDB
client = MongoClient("localhost", 27017)
# Create a database named "drones"
drones = client["drones"]
# Create a collection named "races"
races = drones["races"]
# Load dataset into MongoDB
with open("data/drone_races.json", "r") as file:
data = json.load(file)
races.insert_many(data)
The two most important objects for us are drones
(a database) and races
(a collection). Most of the functions and methods will be related to collections. Database objects are mostly used for managing collections.
Now, let’s see how to load the same data using an API. I’ve stored the information as an API using a service called Mockaroo. Here is the snippet:
import requests
from pymongo import MongoClient
# Fetch data from the API
api_url = (
"https://my.api.mockaroo.com/drone_race_matches.json?key=6f5a6b50"
)
response = requests.get(api_url)
if response.status_code == 200:
data = response.json() # Get the JSON data from the API
# Establish a connection to MongoDB
client = MongoClient()
# Access or create a specific database
drones = client["drones"]
# Access or create a specific collection within the database
races = drones["races"]
# Insert the fetched data into the MongoDB collection
races.insert_many(data)
else:
print("Failed to fetch data from the API.")
We’ve loaded some data into the races
collection of the drones
document database, or did we? Let's check by using queries!
Elementary MongoDB Queries
Counting documents in MongoDB
We need to count the documents inside to find out if any data exists in a collection. We will use count_documents
method to do so:
>>> races.count_documents({})
9040
Notice the empty dictionary passed to count_documents
. That dictionary is called a filter in MongoDB. As we go through the tutorial, we will learn how to fill the dictionary to create different kinds of filters. Right now, we have no filter. The above code is the same as SELECT COUNT(*) FROM table_name
in SQL.
We’ve got 9040 documents — yay! Now, let’s look at some data.
Extracting one document in MongoDB
To look at one document with pymongo
, we can use the find_one
method:
from pprint import pprint
>>> pprint(races.find_one())
{'_id': ObjectId('659d31e9255ec0cf4bab529d'),
'laps': 3,
'league': 'F1 Drones',
'location': {'city': 'Ford',
'country': 'United Kingdom',
'date': 'error: invalid date "2024-10-25"',
'venue': 'Manhattan Seas'},
'name': 'Honorable',
'pilots': {'drone': 'DJI3-old',
'finishing_position': 66,
'name': 'Kariotta Cow',
'qualification_time': 27.39,
'team': 'Sky Crusaders',
'telemetry': {'altitude': 34.3,
'battery_voltage': 12.1,
'speed': 68.3,
'timestamp': 'error: invalid date '
'"2024-10-25T14:09:26Z"'}},
'sponsors': ['Fat Shark', 'DJI', 'Etisalat'],
'weather_conditions': 'snowy'}
Take note of the fields (keys) of this document. It stores information about a single drone race and includes information such as:
- The number of laps
- Weather conditions at the time of the match
- The pilots (not all of them) during the match
- Match sponsors and so on.
The document also has a required _id
field, which is a unique hash.
Selecting all documents in MongoDB
count_documents
always returns a number, but sometimes, we want to look at the data our query matches. To do so, we can use the big brother of find_one
, which is find
:
from pprint import pprint
for race in races.find():
pprint(race)
break
{'_id': ObjectId('659d31e9255ec0cf4bab529d'),
'laps': 3,
'league': 'F1 Drones',
'location': {'city': 'Ford',
'country': 'United Kingdom',
'date': 'error: invalid date "2024-10-25"',
'venue': 'Manhattan Seas'},
'name': 'Honorable',
'pilots': {'drone': 'DJI3-old',
'finishing_position': 66,
'name': 'Kariotta Cow',
'qualification_time': 27.39,
'team': 'Sky Crusaders',
'telemetry': {'altitude': 34.3,
'battery_voltage': 12.1,
'speed': 68.3,
'timestamp': 'error: invalid date '
'"2024-10-25T14:09:26Z"'}},
'sponsors': ['Fat Shark', 'DJI', 'Etisalat'],
'weather_conditions': 'snowy'}
find
with an empty query (no arguments) returns documents one-by-one but that's not what we want! We want to perform queries so that we can answer interesting questions about our data. This is where filter documents will prove useful.
Selecting based on a condition in MongoDB
Let’s start with the simplest filters — matching documents where some field equals some value. This would be the same as:
SELECT *
FROM table_name
WHERE field = value
Let’s do it in MongoDB:
criteria = {"sponsors": "Fat Shark"}
fat_shark_races = races.count_documents(criteria)
fat_shark_races
6194
Above, we are choosing the races with "Fat Shark" as their sponsors. The syntax is simply a dictionary that maps the sponsors
field to "Fat Shark".
MongoDB query language wouldn’t be a language if it didn’t have some common inequality operators. Here is how to use the “less than” operator:
criteria = {"pilots.qualification_time": {"$lt": 10}}
quick_races = races.count_documents(criteria)
quick_races
3061
The above query introduces four new features of the MongoDB query language (MQL):
- You can access sub-fields with dot notation.
pilots.qualification_time
extracts the nested qualification time within pilots fields. - Almost all operators in MQL start with a dollar sign.
- Operators are used in a nested document like above.
- The
$lt
is for the "less than" operator.
So, the result of this query tells us that there were 3061 matches where one pilot had less than 10 seconds of qualification time. This query was possible with the $lt operator. Here are its brothers and sisters:
$lte
: less than or equal$gt
: greater than$gte
: greater than or equal.
They have the same syntax as $lt.
Selecting with logical conditional operators in MongoDB
MQL also includes logical conditional operators such as $and
and $or
. Let's start with the latter.
We will retrieve races with either the United Kingdom as the location or Etisalat as the sponsor:
criteria = {
"$or": [
{"location.country": "United Kingdom"},
{"sponsors": "Etisalat"},
]
}
>>> races.count_documents(criteria)
6223
Again, use the Explain code
button for a detailed explanation.
There are 6223 documents matching our criteria. To use an OR
logic for multiple values for the same field, we can use the $in
operator.
For example, we can check for bad weather conditions the following way:
criteria = {
"weather_conditions": {"$in": ["rainy", "snowy", "cloudy"]}
}
>>> races.count_documents(criteria)
5508
This query would have been a pain to write with the $or
operator. Now, onto $and
.
This time, we want to find the races with Australia as the location AND Fat Shark as the sponsor. Here is how we can do it with $and
:
criteria = {
"$and": [
{"location.country": "Australia"},
{"sponsors": "Fat Shark"},
]
}
>>> races.count_documents(criteria)
193
But in practice, you will rarely use $and
as it can be implemented in a much simpler way:
criteria = {
"location.country": "Australia",
"sponsors": "Fat Shark",
}
races.count_documents(criteria)
193
Just add more key-value pairs to the filter document to achieve the AND logical operator.
Finally, there is $nin
operator, which checks for non-membership. For example, we can return all matches that weren't held in the United States, United Kingdom, or Australia:
criteria = {
"location.country": {
"$nin": ["United States", "United Kingdom", "Australia"]
}
}
>>> races.count_documents(criteria)
126
This only leaves the United Arab Emirates as the country, so the above query could actually be written as:
criteria = {"location.country": "United Arab Emirates"}
>>> races.count_documents(criteria)
126
But, you get the idea.
Querying for null or missing values in MongoDB
Checking for null or missing values is a universal operation in all data analysis tasks. As such, there is an operator for that in MongoDB — $exists
. Here are two examples that checks whether a certain field exists:
criteria = {"location.district": {"$exists": True}}
>>> races.count_documents(criteria)
0
Hmm, it turns out the district
The laps
field doesn't exist in any of the documents. However, it must exist in all documents as it is a key piece of information about races.
criteria = {"laps": {"$exists": True}}
races.count_documents(criteria)
9040
As expected, all documents have the laps
field. But what about fields that exist but have null value? We can check that too:
criteria = {"pilots.finishing_position": None}
races.count_documents(criteria)
0
By using the built-in None
object in Python, we can check any field's value for missingness.
There are some advanced scenarios that require null or existence checks, as well. For example, you may want to check whether certain elements of some massive nested arrays exist.
To do this, we can use array indexing syntax in MQL. For instance, to find the races with only one sponsor, we need to check whether the second element of the sponsors
array exists:
# Counting starts with 0 as always
criteria = {"sponsors.1": {"$exists": False}}
races.count_documents(criteria)
2929
And it is as easy as appending the index number of the element to they key. So, in our collection, almost 3000 races were sponsored by only one entity.
This array indexing syntax works for many other operators, not just $exists
.
Projections (restricting fields)
One last thing we are going to cover in the tutorial is projections. Up until this point, our query results included every single field in each document. This is not ideal when your documents have hundreds of fields. Imagine the eye-sore of outputs when you print them!
So, to choose the fields we want returned, we can use projections. Here is how:
criteria = {"pilots.telemetry.speed": {"$gte": 20}}
projection = {
"sponsors": 1,
"location.country": 1,
"pilots.telemetry.speed": 1,
"pilots.name": 1,
}
fast_pilots = races.find(criteria, projection)
for pilot in fast_pilots:
pprint(pilot)
break
In the above case, we are writing our filter criteria as usual but this time, we are defining another document with four fields set to 1. If we pass this projection
document as the second argument to find
or count_documents
, we will only get the fields set to 1 in the output.
{'_id': ObjectId('659d31e9255ec0cf4bab529d'),
'location': {'country': 'United Kingdom'},
'pilots': {'name': 'Kariotta Cow', 'telemetry': {'speed': 68.3}},
'sponsors': ['Fat Shark', 'DJI', 'Etisalat']}
Even though we chose only four fields, the pesky _id
field got squeezed in somehow. To suppress this behavior, set it to 0 in the projection
dictionary:
criteria = {"pilots.telemetry.speed": {"$gte": 20}}
projection = {
"sponsors": 1,
"location.country": 1,
"pilots.telemetry.speed": 1,
"pilots.name": 1,
"_id": 0,
}
fast_pilots = races.find(criteria, projection)
for pilot in fast_pilots:
pprint(pilot)
break
{'location': {'country': 'United Kingdom'},
'pilots': {'name': 'Kariotta Cow', 'telemetry': {'speed': 68.3}},
'sponsors': ['Fat Shark', 'DJI', 'Etisalat']}
Now, this is prettier.
Finally, to return all but a few fields, we can them fields to 0:
projection = {"_id": 0, "league": 0, "pilots": 0}
# Empty criteria for this one
races.find_one({}, projection)
{'name': 'Honorable',
'location': {'venue': 'Manhattan Seas',
'city': 'Ford',
'country': 'United Kingdom',
'date': 'error: invalid date "2024-10-25"'},
'sponsors': ['Fat Shark', 'DJI', 'Etisalat'],
'laps': 3,
'weather_conditions': 'snowy'}
As you can see, this time, we have all the fields but _id
, league
, and pilots
.
Conclusion
This tutorial doesn’t do justice to the massive size of MongoDB as a database management tool. Today, we only covered GET
queries (queries to retrieve information), but MongoDB also allows data specialists to insert, update, or delete information in document databases. We’ve also left out a whole class of queries — aggregations.
All these topics are beyond the scope of the article and requires additional resources to learn them. Why don’t you check these out:
- Introduction to using MongoDB in data science with Python course — a comprehensive course that also covers advanced queries like aggregations
- Intro to MongoDB in Python — Another MongoDB tutorial that covers CRUD (create, read, update and delete) operations with pymongo.
- NoSQL concepts course — if you are interested in NoSQL databases, this course is for you. Apart from document databases, it covers other non-relational databases.
Become SQL Certified
FAQs
How do document databases like MongoDB differ from relational databases?
Document databases, such as MongoDB, store data in documents (often in JSON-like formats), which can contain nested data structures. This differs from relational databases, which store data in rows and tables with a fixed schema. Document databases allow for more flexibility as the schema is dynamic, meaning that each document can have different fields and data types. This makes MongoDB suitable for unstructured or semi-structured data, unlike relational databases, which require a predefined schema.
Why should I use MongoDB instead of a relational database?
MongoDB is beneficial when working with data that doesn't fit neatly into a tabular structure. Use MongoDB if your data has a flexible schema, if you anticipate frequent changes in data structure, or if you need to handle large volumes of unstructured data. It's also a good choice for applications requiring high-speed read and write operations at scale, such as e-commerce, logging, and content management systems.
What programming languages are compatible with MongoDB?
MongoDB offers compatibility with a wide range of programming languages, including Python, Java, JavaScript, Node.js, Go, Ruby, and C#, through official drivers and libraries. The Python library pymongo is commonly used for interacting with MongoDB in data science applications. MongoDB also integrates well with modern frameworks such as Django, Flask, and Express.js.
How does MongoDB handle large-scale data and horizontal scaling?
MongoDB is designed for horizontal scaling through sharding, where data is distributed across multiple servers to manage large-scale data efficiently. As your data grows, MongoDB can distribute the load across multiple machines, allowing for better performance and capacity. This makes MongoDB ideal for big data applications or those experiencing rapid growth in data volume.
Can MongoDB handle complex queries like SQL databases?
Yes, MongoDB can handle complex queries, but its query language (MQL, MongoDB Query Language) is quite different from SQL. MongoDB supports filters, projections, logical operators, and aggregations to perform sophisticated queries, allowing you to retrieve, filter, and transform data. However, unlike SQL databases, MongoDB does not support joins in the same way, as it is designed to denormalize data into flexible document structures.
Is MongoDB suitable for real-time analytics?
MongoDB can be used for real-time analytics, but its performance largely depends on how the data is structured and indexed. Using MongoDB’s powerful indexing and aggregation framework, you can run real-time queries and generate insights efficiently. However, for more complex analytical tasks, you might consider integrating MongoDB with tools like Apache Spark or using its aggregation framework to handle large-scale, real-time processing.
What are the security features in MongoDB?
MongoDB provides several security features, including authentication, authorization (role-based access control), encryption (both in-transit and at-rest), and auditing. MongoDB’s Enterprise Edition offers additional security features like LDAP integration and Kerberos authentication for enterprise-level security. These features help secure sensitive data while maintaining compliance with industry regulations.
Can MongoDB handle ACID transactions?
Yes, MongoDB supports ACID-compliant transactions, particularly from version 4.0 onward. This allows for multi-document transactions, similar to those in relational databases, ensuring atomicity, consistency, isolation, and durability for operations involving multiple documents or collections. This makes MongoDB more suitable for scenarios that require transaction guarantees.
What is the difference between JSON and BSON in MongoDB?
While JSON is a human-readable format commonly used for representing data, BSON (Binary JSON) is MongoDB’s storage format. BSON allows for more efficient storage and retrieval of data and supports additional data types like dates and binary data, which JSON does not natively handle. BSON also adds more metadata, which improves performance during document storage and retrieval.
I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn.
Start Your Non-Relational Database Journey on DataCamp Today!
course
NoSQL Concepts
course
Introduction to NoSQL
tutorial
Introduction to MongoDB and Python
tutorial
MySQL in Python Tutorial: Getting Started
tutorial
A Comprehensive NoSQL Tutorial Using MongoDB
tutorial
SQLAlchemy Tutorial With Examples
tutorial
How to Execute SQL Queries in Python and R Tutorial
tutorial