Course
Vector indexing has become a powerful tool for building modern applications. It allows you to perform fast and efficient similarity searches on high-dimensional data, often referred to as vector embeddings. This capability is now seamlessly integrated into MongoDB, enabling developers to build sophisticated features directly within their databases.
This article is a practical guide to setting up and using vector indexing in MongoDB. We'll walk through the process step by step, from creating your first index to running complex queries. You'll also learn best practices and explore a real-world example of building a product recommendation system for an e-commerce store.
What is Vector Indexing?
At its core, vector indexing is a technique for organizing vector embeddings to accelerate similarity searches. Unlike traditional indexes that work on scalar values (like numbers or strings), a vector index is optimized for finding "neighboring" vectors in a multidimensional space.
Vector embeddings are numerical representations of complex data, including text, images, and audio. For example, a product description can be converted into a vector where similar products have vectors that are closer together in space. This proximity allows for powerful applications like:
- Semantic search: Finding products based on the meaning of a query, not just keyword matches.
- Recommendation systems: Suggesting products or content to users based on their preferences.
- Image recognition: Identifying similar images from a large database.
Setting up vector indexing in MongoDB
To get started, you'll need a MongoDB Atlas cluster. Vector search is a cloud-native feature. You can try this feature either on MongoDB Atlas, or the free Community edition. In this article, we will be using MongoDB Atlas.
Step 1: Create a MongoDB Atlas cluster
Create a new cluster in MongoDB Atlas. The free M0 tier is sufficient for testing purposes.
Step 2: Create a collection and insert data
First, connect to your cluster using MongoDB Shell (mongosh) or your preferred driver. Then, create a new collection called products and insert documents containing a vector field. Each document should represent a product and include an embedding field, which will hold the product's vector representation.
Here is a more complete example with three distinct product documents, each with a randomly generated vector to demonstrate what the data might look like. In a real-world scenario, you would replace these with embeddings generated by an AI model.
db.products.insertMany([
{
"name": "Wireless Noise-Canceling Headphones",
"category": "Electronics",
"description": "Premium over-ear headphones with active noise cancellation and a 30-hour battery life.",
"price": 249.99,
"embedding": [0.012, 0.054, 0.089, -0.031, 0.067, 0.021, -0.045, 0.098, 0.011, -0.076, 0.033, -0.015, -0.087, 0.059, 0.009, 0.042]
},
{
"name": "Smart Coffee Maker",
"category": "Home Goods",
"description": "Programmable coffee machine that can be controlled from your smartphone. Brews up to 12 cups.",
"price": 129.99,
"embedding": [0.056, 0.023, 0.011, 0.078, -0.029, 0.041, 0.087, -0.019, 0.065, 0.002, -0.093, 0.018, 0.057, -0.041, 0.022, 0.064]
},
{
"name": "Ultra-light Backpack",
"category": "Travel",
"description": "Durable and lightweight backpack with multiple compartments, perfect for hiking or daily commutes.",
"price": 79.99,
"embedding": [-0.072, 0.034, 0.019, 0.068, -0.055, 0.081, 0.004, -0.025, 0.039, 0.016, -0.044, 0.062, 0.008, 0.071, -0.013, 0.053]
}
]);
Each document contains a name, category, description, and price, along with the crucial embedding field. The embedding is a fixed-size array of numbers that represents the semantic meaning of the product's description. The size of this array, or dimensionality, depends on the model used to create the embeddings. In this example, the dimensionality is 128.
This is the data on which we will build our vector index to enable semantic search and product recommendations.
Step 3: Create a vector index
Once you have your data with the embeddings in a collection, the next step is to create a vector index. This index is crucial for enabling fast and efficient similarity searches. The command below uses createSearchIndex to build a vector index on the embedding field.
db.products.createSearchIndex({
"name": "vector_index",
"definition": {
"mappings": {
"dynamic": true,
"fields": [
{
"type": "vector",
"path": "embedding",
"numDimensions": 128,
"similarity": "cosine"
}
]
},
}
});
Understanding the configuration
The index configuration is defined within the definition object's fields array. This structure is the blueprint for how MongoDB Atlas organizes your vectors for high-speed retrieval.
The core parameters for vector indexing are:
- path: This specifies the exact name of the field in your documents that holds the vector array, which is "embedding" in this example.
- numDimensions: This is the most crucial parameter. It must exactly match the number of elements in the vector array produced by your embedding model (e.g., 128, 768, 1536).
- similarity: This defines the mathematical metric used to calculate the "closeness" between two vectors during a search.
- cosine is typically recommended for semantic search, as it measures the angle between vectors, focusing on semantic direction.
- Other options include euclidean (straight-line distance) and dotProduct.
Optional performance parameters
For advanced performance tuning, MongoDB Atlas allows you to specify optional parameters to optimize storage and memory:
- quantization: This parameter is used to compress your vectors, which significantly reduces the index size and memory footprint.
- scalar or binary can be specified to quantize the vectors, trading a small amount of recall for major gains in storage and speed.
Performing Vector Search Queries
Once the index is built, you can use the $vectorSearch aggregation stage to perform similarity queries.
Let's find the products most similar to a query vector. This query would typically be generated from a user's search input.
// A sample query vector, generated from a user's search for "wireless audio device"
const queryVector = [0.11, 0.22, 0.33, ...];
db.products.aggregate([
{
"$vectorSearch": {
"queryVector": queryVector,
"path": "embedding",
"numCandidates": 100, // Number of documents to scan
"limit": 5 // Number of results to return
}
},
{
"$project": {
"name": 1,
"description": 1,
"_id": 0
}
}
]);
- queryVector: The vector you are searching for.
- path: The name of the field containing the vector embeddings.
- numCandidates: The number of candidate documents to scan in the index. A higher number improves accuracy but increases latency.
- limit: The number of top results to return.
The $project stage is optional but useful for shaping the output and excluding the vector data from the result.
Best Practices for Vector Indexing
When tuning your vector index for optimal performance, you need to balance accuracy and speed. The key is to adjust the index's parameters based on your specific dataset and performance goals.
Choosing the right similarity metric
The choice of metric is fundamental as it defines how "similar" two vectors are. You can refer to the MongoDB docs to learn more about the fields.
- cosine: Best for semantic similarity, especially with text. It measures the angle between vectors, ignoring their length. This is crucial for models where vector length isn't meaningful.
- euclidean: Measures the straight-line distance between two vectors. Use this if the magnitude (or length) of the vectors is significant to their meaning, such as in certain data types where a larger value signifies a stronger feature.
- dotProduct: Calculates the dot product of two vectors. This metric is often used for recommendation systems. When vector embeddings are normalized to a length of 1, the dot product is functionally equivalent to cosine similarity.
Managing embedding dimensions
The dimensions parameter is critical and must exactly match the output of your embedding model. A common mistake is to pick a number that doesn't match the model's output. A model with a higher number of dimensions can capture more detail, but will result in a larger index and potentially slower queries. Always check your model's documentation to confirm its dimensionality (e.g., OpenAI's text-embedding-ada-002 outputs 1536 dimensions).
Index tuning
The numCandidates and numLists parameters are crucial for balancing speed and accuracy. A good rule of thumb is to start with a low numCandidates and increase it until you achieve a satisfactory level of accuracy. You can also experiment with different numLists for ivf indexes to find the sweet spot for your dataset.
Example use case: Building a simple recommendation system
Let's apply these concepts to a practical example: an e-commerce product recommendation system.
Step 1: Store product data with embeddings
First, we'll store product names, categories, and a vector embedding for each. The embedding can be generated from the product's description or features.
db.products.insertMany([
{
"name": "Smart Speaker",
"category": "Electronics",
"embedding": [0.8, 0.9, 0.1, ...]
},
{
"name": "Wireless Earbuds",
"category": "Electronics",
"embedding": [0.7, 0.8, 0.2, ...]
},
{
"name": "Leather Wallet",
"category": "Fashion",
"embedding": [0.1, 0.2, 0.9, ...]
}
]);
Step 2: Create a vector index on embedding
db.products.createSearchIndex({
"name": "vector_index",
"definition": {
"mappings": {
"fields": [
{
"type": "vector",
"path": "embedding", // The field containing the vectors
"numDimensions": 128, // The correct field name is 'numDimensions'
"similarity": "cosine" // The correct field name is 'similarity'
// Index algorithm (like HNSW) is managed automatically by Atlas
}
]
}
}
});
Step 3: Query for similar products
Now, let's find products similar to the Smart Speaker. We'll use the embedding for Smart Speaker as our query vector.
const smartSpeakerEmbedding = [0.8, 0.9, 0.1, ...];
db.products.aggregate([
{
"$vectorSearch": {
"queryVector": smartSpeakerEmbedding,
"path": "embedding",
"numCandidates": 50,
"limit": 3
}
},
{
"$project": {
"name": 1,
"category": 1,
"_id": 0
}
}
]);
This query will return the top three products whose vector embeddings are closest to the Smart Speaker's embedding, effectively recommending similar items, like Wireless Earbuds.
Conclusion
MongoDB's vector search capabilities unlock new possibilities for building intelligent and personalized applications. By integrating vector indexing directly into the database, MongoDB simplifies the architecture, making it easier to manage data and perform fast similarity searches.
With the power of vector search, you can move beyond simple keyword matching and create features that understand the context and meaning of your data. We encourage you to start experimenting with the MongoDB Atlas free tier and explore how vector indexing can enhance your applications.
FAQs
What is a vector embedding and why is it needed for vector search?
A vector embedding is a numerical representation of data like text or images. It's needed for vector search because it allows the database to measure the semantic similarity between items by calculating the distance between their vectors, going beyond simple keyword matching.
Do I need a specific MongoDB version to use Vector Search?
Yes. Vector search is an integrated feature of MongoDB Atlas, the cloud-hosted version of MongoDB. It is not available in the community or on-premise editions of the database.
Do I need to generate vector embeddings myself?
Yes, you need to generate the vector embeddings for your data before you can store them in MongoDB and create a vector index. MongoDB does not generate these embeddings for you. You will typically use a third-party service or a pre-trained model from a platform like OpenAI, Hugging Face, or Google AI to convert your text, images, or other data into numerical vectors.
Can I combine vector search with other MongoDB queries?
Yes, you can. Vector search is an aggregation stage ($vectorSearch), which means you can use it within a larger MongoDB aggregation pipeline. This allows you to combine vector search with other powerful stages like $match, $project, $limit, and $lookup to create sophisticated queries that perform filtering, sorting, and data enrichment in a single operation. This capability is known as hybrid search and is essential for building real-world applications.
What is the difference between the IVF and HNSW index types?
Both IVF and HNSW are algorithms used for approximate nearest neighbor (ANN) search. The main difference lies in how they structure the data.
Karen is a Data Engineer with a passion for building scalable data platforms. She has experience in infrastructure automation with Terraform and is excited to share her learnings in blog posts and tutorials. Karen is a community builder, and she is passionate about fostering connections among data professionals.