Live training | 2023-06-13 | Building AI Applications with LangChain and GPT
You've probably talked to ChatGPT using the web interface, or used the API with the openai
python package and wondered "what if I could teach it about my own data?". Today we're going to build such an application using LangChain, a framework for developing applications powered by language models.
In today's session, we'll build a chatbot powered by GPT-3.5 that can answer questions about LangChain, as it will have knowledge of the LangChain documentation. We'll cover:
- Getting setup with an OpenAI developer account and integration with Workspace;
- Install the LangChain package
- Preparing the data
- Embed the data using OpenAI's Embed API, and get a cost estimate for this operation
- Storing the data in a vector database
- How to query the vector database
- Putting together a basic chat application to "talk to the LangChain docs"
Before you begin
Unzip the required data by running the following cell
!test -f contents.zip && unzip contents.zip && rm contents.zip
Create a developer account with OpenAI
-
Go to the API signup page.
-
Create your account (you'll need to provide your email address and your phone number).
-
Go to the API keys page.
-
Create a new secret key.
- Take a copy of it. (If you lose it, delete the key and create a new one.)
Add a payment method
OpenAI sometimes provides free credits for the API, but it's not clear if that is worldwide or what the conditions are. You may need to add debit/credit card details.
We will use 2 APIs:
- The Chat API with the
gpt-3.5-turbo
model (cost: $0.002 / 1K tokens) - The Embedding API with the
Ada v2
model (cost: $0.0004 / 1K tokens)
In total, the Chat API (used for completions) should cost less than $0.1
and embedding should cost around $0.6
. This notebook provides embeddings already, so you can skip the embedding step.
-
Go to the Payment Methods page.
-
Click Add payment method.
- Fill in your card details.
Set up Environment Variables
- In Workspace, click on Environment.
- Click on the "Environment Variables" plus button.
- In the "Name" field, type
OPENAI_API_KEY
. In the "Value" field, paste in your secret key (starting withsk-
)
- Click "Create", and connect the new integration.
# install langchain (version 0.0.191)
!pip install langchain==0.0.191
# install chromadb
!pip install chromadb
# install tiktoken
!pip install tiktoken
Task 1: Load data
To be able to embed and store data, we need to provide LangChain with Documents. This is easy to achieve in LangChain thanks to Document Loaders. In our case, we're targeting a "Read the docs" documentation, for which there is a loader ReadTheDocsLoader
.
In the folder rtdocs
, you'll find all the HTML files from the LangChain documentation (https://python.langchain.com/en/latest/index.html).
How did we obtain the data
These file were downloaded by executing this linux command:
wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/
We urge you **NOT** to execute this during the live training, as it will scan and download the full langchain doc site (~1000 files). This operation may be heavy and could disrupt the site, especially if hundreds of learners do it all at once!
Our first task is to load these HTML files as documents that we can use with langchain: we're going to use the ReadTheDocsLoader
. It will read the directory containing all HTML files and transform them into Document
objects. ReadTheDocsLoader
will read each HTML file, remove HTML tags to only keep the text and return it as a Document
. At the end of this task, we'll have a variable raw_documents
containing a list of Document
: one Document
per HTML file.
Note that in this step we won't actually load the documents into a database, we're simply loading the documents in a list.
Instructions
- import
ReadTheDocsLoader
fromlangchain.document_loaders
- Create the loader, pointing to the
rtdocs/python.langchain.com/en/latest
directory and enabling the HTML parser feature withfeatures='html.parser'
- Load the data in
raw_documents
by callingloader.load()
# Import ReadTheDocsLoader
from langchain.document_loaders import ReadTheDocsLoader
# Create a loader for the `rtdocs/python.langchain.com/en/latest` folder
loader = ReadTheDocsLoader("rtdocs/python.langchain.com/en/latest", features="html.parser")
# Load the data
raw_documents = loader.load()
Task 2: Slice the documents into smaller chunks
In the previous step, we turned each HTML file into a Document. These files may be very long, and are potentially too large to embed fully. It's also a good practice to avoid embedding large documents:
- long documents often contain several concepts. Retrieval will be easier if each concept is indexed separately;
- retrieved documents will be injected in a prompt, so keeping them short will keep the prompt small(ish)
LangChain has a collection of tools to do this: Text Splitters. In our case, we'll be using the most straightfoward one and simplest to use: the Recursive Character Text Splitter. The recursive text splitter will recursively reduce the input by splitting it by paragraph, then sentences, then words as needed until the chunk is small enough.
Instructions
- Import the
RecursiveCharacterTextSplitter
fromlangchain.text_splitter
- Create a text splitter configured with
chunk_size=1000
andchunk_overlap=200
These values are arbitrary and you'll need to try different ones to see which best serve your use case - split the
raw_documents
and store them asdocuments
, using the.split_documents()
method
# Import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Create the text splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
# Split the documents
documents = splitter.split_documents(raw_documents)
documents[0]