Live training | 2023-06-13 | Building AI Applications with LangChain and GPT

You've probably talked to ChatGPT using the web interface, or used the API with the openai python package and wondered "what if I could teach it about my own data?". Today we're going to build such an application using LangChain, a framework for developing applications powered by language models.

In today's session, we'll build a chatbot powered by GPT-3.5 that can answer questions about LangChain, as it will have knowledge of the LangChain documentation. We'll cover:

Getting setup with an OpenAI developer account and integration with Workspace;
Install the LangChain package
Preparing the data
Embed the data using OpenAI's Embed API, and get a cost estimate for this operation
Storing the data in a vector database
How to query the vector database
Putting together a basic chat application to "talk to the LangChain docs"

Before you begin

Create a developer account with OpenAI

Go to the API signup page.
Create your account (you'll need to provide your email address and your phone number).

Go to the API keys page.
Create a new secret key.

Take a copy of it. (If you lose it, delete the key and create a new one.)

Add a payment method

OpenAI sometimes provides free credits for the API, but it's not clear if that is worldwide or what the conditions are. You may need to add debit/credit card details.

We will use 2 APIs:

The Chat API with the gpt-3.5-turbo model (cost: $0.002 / 1K tokens)
The Embedding API with the Ada v2 model (cost: $0.0004 / 1K tokens)

In total, the Chat API (used for completions) should cost less than $0.1 and embedding should cost around $0.6. This notebook provides embeddings already, so you can skip the embedding step.

Go to the Payment Methods page.
Click Add payment method.

Fill in your card details.

Set up a Workspace integration

In Workspace, click on Integrations.

Click on the "Create integration" plus button.

Select an "Environment Variables" integration.

In the "Name" field, type OPENAI_API_KEY. In the "Value" field, paste in your secret key (starting with sk-)

Click "Create", and connect the new integration.

Task 0: Setup

For the purpose of this training, we'll need to install a few packages:

langchain: The LangChain framework
chromadb: The package we'll use for the vector database
tiktoken: A tokenizer we'll use to count GPT-3 tokens

# install langchain (version 0.0.191)
!pip install langchain==0.0.191
# install chromadb
!pip install chromadb
# install tiktoken
!pip install tiktoken

Task 1: Load data

To be able to embed and store data, we need to provide LangChain with Documents. This is easy to achieve in LangChain thanks to Document Loaders. In our case, we're targeting a "Read the docs" documentation, for which there is a loader ReadTheDocsLoader. In the folder rtdocs, you'll find all the HTML files from the LangChain documentation (https://python.langchain.com/en/latest/index.html).

How did we obtain the data

These file were downloaded by executing this linux command:

wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/

We urge you **NOT** to execute this during the live training, as it will scan and download the full langchain doc site (~1000 files). This operation may be heavy and could disrupt the site, especially if hundreds of learners do it all at once!

Our first task is to load these HTML files as documents that we can use with langchain: we're going to use the ReadTheDocsLoader. It will read the directory containing all HTML files and transform them into Document objects. ReadTheDocsLoader will read each HTML file, remove HTML tags to only keep the text and return it as a Document. At the end of this task, we'll have a variable raw_documents containing a list of Document: one Document per HTML file.

Note that in this step we won't actually load the documents into a database, we're simply loading the documents in a list.

Instructions

import ReadTheDocsLoader from langchain.document_loaders
Create the loader, pointing to the rtdocs/python.langchain.com/en/latest directory and enabling the HTML parser feature with features='html.parser'
Load the data in raw_documents by calling loader.load()

# Import ReadTheDocsLoader
from langchain.document_loaders import ReadTheDocsLoader

# Create a loader for the `rtdocs/python.langchain.com/en/latest` folder
loader = ReadTheDocsLoader("rtdocs/python.langchain.com/en/latest", features="html.parser")

# Load the data
raw_docs = loader.load()

print(raw_docs[353].page_content)

Task 2: Slice the documents into smaller chunks

In the previous step, we turned each HTML file into a Document. These files may be very long, and are potentially too large to embed fully. It's also a good practice to avoid embedding large documents:

long documents often contain several concepts. Retrieval will be easier if each concept is indexed separately;
retrieved documents will be injected in a prompt, so keeping them short will keep the prompt small(ish)

LangChain has a collection of tools to do this: Text Splitters. In our case, we'll be using the most straightfoward one and simplest to use: the Recursive Character Text Splitter. The recursive text splitter will recursively reduce the input by splitting it by paragraph, then sentences, then words as needed until the chunk is small enough.

Instructions

Import the RecursiveCharacterTextSplitter from langchain.text_splitter
Create a text splitter configured with chunk_size=1000 and chunk_overlap=200
These values are arbitrary and you'll need to try different ones to see which best serve your use case
split the raw_documents and store them as documents, using the .split_documents() method

# Import RecursiveCharacterTextSplitter


# Create the text splitter


# Split the documents

Task 3: count tokens and get a cost estimate of embedding

We're now ready to embed our documents. Before we do so, we'd like to get an idea of how big it is and how much it will cost to embed. To do so, we'll use the tiktoken library (no relation to TikTok, there is no dancing involved). tiktoken allows to encode and decode strings of text into tokens. In our case, we're mostly interested in how many tokens our documents translate to.

💡 To better understand what a token is to GPT, head to OpenAI's Tokenizer page where you can see how a text translates to tokens.

Prices for different models can be found on their pricing page.

Instructions

Import tiktoken
Create a tokenizer for the text-embedding-ada-002 model using the .encoding_for_model() method
Count tokens in each document using the .encode() method
Calculate the sum of all tokens
Calculate a cost estimate. The text-embedding-ada-002 model costs $0.0004 for 1000 tokens

# Import tiktoken


# Create an encoder 


# Count tokens in each document


# Calculate the sum of all token counts


# Calculate a cost estimate

‌
‌
‌

Live training | 2023-06-13 | Building AI Applications with LangChain and GPT

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Live training | 2023-06-13 | Building AI Applications with LangChain and GPT

Before you begin

Create a developer account with OpenAI

Add a payment method

Set up a Workspace integration

Task 0: Setup

Task 1: Load data

Instructions

Task 2: Slice the documents into smaller chunks

Instructions

Task 3: count tokens and get a cost estimate of embedding

Instructions

Live training | 2023-06-13 | Building AI Applications with LangChain and GPT