Skip to content

Using Retrieval-Augmented Generation to Search a Movie Database

Retrieval-augmented generation, or RAG, is a technique used with large language models to provide additional context without fine-tuning or retraining. It enhances the ability of language models to provide factual responses, which is a limitation of classical setups.

The goal of this project is to build a question-answering bot for movie-related questions. To achieve this, we will use RAG to provide factual information to the language model. We will upload movie descriptions to a vector database and use it to search for relevant context for the language model.

We will be using the following tools and models:

  • OpenAI's gpt-3.5-turbo model for prompt completions
  • OpenAI's text-embedding-ada-002 model to create vector embeddings
  • Pinecone as the vector database to store the embeddings
  • langchain as the tool to interact with OpenAI and Pinecone

The dataset used for this project is sourced from the Kaggle dataset IMDb Movies/Shows with Descriptions.

Maintenance note, May 2024

Since this code-along was released, the Python packages for working with the Pinecone and OpenAI APIs have changed their syntax. The instructions, hints, and code have been updated to use the latest syntax, but the video has not been updated. Consequently, it is now slightly out of sync. Trust the workbook, not the video.

Before you begin

To get started with this project, you'll need a developer account for OpenAI and Pinecone. Follow the steps in the getting-started.ipynb notebook to create an API key and store it in Workspace.

For this project, we will assume that you have already set the OPENAI_API_KEY and PINECONE_API_KEY environment variables.

Task 0: Setup

To perform this analysis, we need to install the following packages:

  • openai: for interacting with OpenAI.
  • pinecone-client: for interacting with Pinecone.
  • langchain: a framework for developing with generative AI.
  • langchain-openai and langchain-pinecone: Langchain extension modules with functionality for OpenAI and Pinecone.
  • tiktoken: a string encoder that generates tokens used by OpenAI. It is useful for estimating the number of tokens used.

Instructions

Run the cell below to install the corresponding packages.

# Install the openai package, locked to version 1.27
!pip install openai==1.27

# Install the pinecone-client package, locked to version 4.0.0
!pip install pinecone-client==4.0.0

# Install the langchain package, locked to version 0.1.19
!pip install langchain==0.1.19

# Install the langchain-openai package, locked to version 0.1.6
!pip install langchain-openai==0.1.6

# Update the langchain-pinecone package, locked to version 0.1.0
!pip install langchain-pinecone==0.1.0

# Update the tiktoken package, locked to version 0.7.0
!pip install tiktoken==0.7.0

# Update the typing_extensions package, locked to version 4.11.0
!pip install typing_extensions==4.11.0
Hidden output

Task 1: Import the Movies Data

We'll start with importing the dataset we mentioned at the top of this project. You have the dataset available as a CSV in your workspace: "IMDB.csv". We need to import the dataset and transform it into a convenient format.

Instructions

  • Import the pandas package as pd
  • Import "IMDB.csv" into a variable movies_raw.
  • Print the head of movies_raw.
# Import pandas as pd
import pandas as pd

# Import IMBD.csv. Assign to movies_raw.
movies_raw = pd.read_csv("IMDB.csv")

# Print the head of movies_raw
movies_raw.head()

Instructions

Transform on movies_raw and assign to movies.

  • Rename primaryTitle to movie_title and Description to movie_description
  • Create a column source that contains the identifier of the movie, prefixed by "https://www.imdb.com/title/". The end result should be a working link to the movie. The identifier can be found in the "tconst" column in "IMDB.csv". For example, "https://www.imdb.com/title/tt0102926/".
  • Filter out all rows that do not have "movie" as a titleType
  • Select the movie_title, movie_description, source and genres columns
  • Show the head of movies.