Semantic Search with Pinecone

In this project you'll explore the world of vector databases and their use as the underlying data storage infrastructure for AI.

The project will introduce you to one of the most popular vector databases in industry, Pinecone. You'll

Learn when and how semantic search can be used in products.
Learn why semantic search often performs better than traditional keyword search.
Learn how semantic search works, by exploring text about bees.
Use semantic search on the Stanford Question Answering Dataset to answer questions about Beyoncé, Chopin, and other culture.

Terminology

A vector database is a type of database that only stores numeric vectors (unlike SQL databases that can store many different data types). By focusing on just one data format, vector databases can work quickly on 100s of billions of records.

Embedding is the process of converting data types like text to a vector format suitable for storage in a vector database.

Vector search is when you find records in a vector database that are the best match to a query.

Semantic means the meaning of words.

Semantic search is when you do vector search on the meaning of text.

Uses of vector search

Vector search is an incredibly important technology that we all use every single day.

Vector search is how Amazon knows what you want to buy before even you do, it's how Netflix recommends TV shows and films, and it's how Google serves the most relevant results from the web at search time.

When searching using natural language (as in the Google example), semantic search can often perform much better than keyword matching (which is how traditional search works).

Note, orangutans are apes, not monkeys — but not every query will be perfect from users.

In this example, a traditional search that relies on keyword / term overlap will not perform well—despite the fact that this document is very relevant to the query. Here we need to search based on meaning, not keywords. It is in these natural language queries—ie queries structured in the way we, as human beings, think—that we are able to retrieve the relevant document to our query.

Use-cases for this type of search are broad, but a few of the most common we find for semantic search include:

Document search: a favorite use-case for organizations, particularly those with poor internal document discovery. Enabling their staff to find the information they need quicker is a huge optimization for many organizations.
Chatbot knowledge training: another very popular use-case with the rise of AI chatbots is the ability to augment chatbots or Large Language Models (LLMs) with external data. We use semantic search to retrieval this data—this process is commonly referred to as Retrieval Augmented Generation (RAG).
Language classification: by placing many classified sequences into a vector DB we are able to more quickly classify new sentences by simply comparing their semantic similarity to existing entries in the vector DB.
Agent/chatbot safety: an increasingly popular use-case for semantic search is to use it in chatbot safety—it functions similarily to language classification but instead focuses on identifying malicious or unwanted inputs / outputs between users and chatbots.

These are a few example use-cases of semantic search, there are many more out there in the world which you will undoubtable encounter and be ready to recognize after completing this chapter and gaining the skills and knowledge to build your own semantic search apps.

Before you begin

You'll need to get an OpenAI API key and Pinecone API key. You can refer to getting-started.ipynb for steps on how to store these API keys in Workspace.

Task 0: Setup

Before we start building our chatbot, we need to install some Python libraries. Here's a brief overview of what each library does:

openai: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate embeddings for Pinecone.
datasets: This library provides a vast array of datasets for machine learning. We'll use it to load our knowledge base for the chatbot.
pinecone-client: This is the official Pinecone Python client. We'll use it to interact with the Pinecone vector DB where we will store our semantic search database. You can install these libraries using pip like so:

!pip install -qU \
    openai==0.28.1 \
    datasets==2.14.5 \
    pinecone-client==2.2.4

Task 1: Semantic Similarity

We will start by understand what is actually happening under the hood of Pinecone. As mentioned, we're doing something called "semantic similarity". Semantic similarity is simply comparing the semantic meaning of two chunks of text.

For example, let's define a list of sentences and compare them based on their "meaning" as we (as humans) understand them.

Instructions

Run this code to define a list containing text data.

sentences = [
    "the hive of bees protect their queen",                         # 0
    "a beehive is an enclosed structure in which honey bees live",  # 1
    "a condominium is an enclosed structure in which people live",  # 2
    "the flying stinging insects guard the matriarch"               # 3
]

How similar are these sentences to humans?

It's clear to people that sentences 0 and 3 mean the same thing. Depending on the context we could view 1 and 2 as being similar in talking about where X lives, and 0, 1, and 3 are likewise similar in that they're talking about bees.

How similar are these sentences using keyword matching?

If we were to compare these using the more traditional approach of keyword matching we would very quickly run into problems. The sentences 1 and 2 might score well, but the other sentences have little-to-no overlap in keywords—so they would not be identified as similar.

Let's see how semantic search performs!

It is for these scenarios that we rely on semantic search. It works by teaching a language model to transform text into meaningful vector embeddings. We call them meaningful because the language model actually learns to transform semantically similar sentences into a similar vector space (ie, in vector space, the embeddings are nearby).

We can try creating these embeddings using OpenAI's Ada 002 model like so:

‌
‌
‌

Semantic Search with Pinecone

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Semantic Search with Pinecone

Terminology

Uses of vector search

Before you begin

Task 0: Setup

Task 1: Semantic Similarity

Instructions

How similar are these sentences to humans?

How similar are these sentences using keyword matching?

Let's see how semantic search performs!

Semantic Search with Pinecone