Using Retrieval-Augmented Generation to Search a Movie Database
Retrieval-augmented generation, or RAG, is a technique used with large language models to provide additional context without fine-tuning or retraining. It enhances the ability of language models to provide factual responses, which is a limitation of classical setups.
The goal of this project is to build a question-answering bot for movie-related questions. To achieve this, we will use RAG to provide factual information to the language model. We will upload movie descriptions to a vector database and use it to search for relevant context for the language model.
We will be using the following tools and models:
- OpenAI's
gpt-3.5-turbomodel for prompt completions - OpenAI's
text-embedding-ada-002model to create vector embeddings - Pinecone as the vector database to store the embeddings
- langchain as the tool to interact with OpenAI and Pinecone
The dataset used for this project is sourced from the Kaggle dataset IMDb Movies/Shows with Descriptions.
Maintenance note, May 2024
Since this code-along was released, the Python packages for working with the Pinecone and OpenAI APIs have changed their syntax. The instructions, hints, and code have been updated to use the latest syntax, but the video has not been updated. Consequently, it is now slightly out of sync. Trust the workbook, not the video.
Before you begin
To get started with this project, you'll need a developer account for OpenAI and Pinecone. Follow the steps in the getting-started.ipynb notebook to create an API key and store it in Workspace.
For this project, we will assume that you have already set the OPENAI_API_KEY and PINECONE_API_KEY environment variables.
Task 0: Setup
To perform this analysis, we need to install the following packages:
openai: for interacting with OpenAI.pinecone-client: for interacting with Pinecone.langchain: a framework for developing with generative AI.langchain-openaiandlangchain-pinecone: Langchain extension modules with functionality for OpenAI and Pinecone.tiktoken: a string encoder that generates tokens used by OpenAI. It is useful for estimating the number of tokens used.
Instructions
Run the cell below to install the corresponding packages.
# Install the openai package, locked to version 1.27
!pip install openai==1.27
# Install the pinecone-client package, locked to version 4.0.0
!pip install pinecone-client==4.0.0
# Install the langchain package, locked to version 0.1.19
!pip install langchain==0.1.19
# Install the langchain-openai package, locked to version 0.1.6
!pip install langchain-openai==0.1.6
# Update the langchain-pinecone package, locked to version 0.1.0
!pip install langchain-pinecone==0.1.0
# Update the tiktoken package, locked to version 0.7.0
!pip install tiktoken==0.7.0
# Update the typing_extensions package, locked to version 4.11.0
!pip install typing_extensions==4.11.0Task 1: Import the Movies Data
We'll start with importing the dataset we mentioned at the top of this project. You have the dataset available as a CSV in your workspace: "IMDB.csv". We need to import the dataset and transform it into a convenient format.
Instructions
- Import the
pandaspackage aspd - Import
"IMDB.csv"into a variablemovies_raw. - Print the head of
movies_raw.
# Import pandas as pd
import pandas as pd
# Import IMBD.csv. Assign to movies_raw.
movies_raw = pd.read_csv("IMDB.csv")
# Print the head of movies_raw
movies_raw.head()Instructions
Transform on movies_raw and assign to movies.
- Rename
primaryTitletomovie_titleandDescriptiontomovie_description - Create a column
sourcethat contains the identifier of the movie, prefixed by"https://www.imdb.com/title/". The end result should be a working link to the movie. The identifier can be found in the"tconst"column in"IMDB.csv". For example,"https://www.imdb.com/title/tt0102926/". - Filter out all rows that do not have
"movie"as atitleType - Select the
movie_title,movie_description,sourceandgenrescolumns - Show the head of
movies.