Skip to content
Live training | 2023-06-13 | Building AI Applications with LangChain and GPT
  • AI Chat
  • Code
  • Report
  • Live training | 2023-06-13 | Building AI Applications with LangChain and GPT

    You've probably talked to ChatGPT using the web interface, or used the API with the openai python package and wondered "what if I could teach it about my own data?". Today we're going to build such an application using LangChain, a framework for developing applications powered by language models.

    In today's session, we'll build a chatbot powered by GPT-3.5 that can answer questions about LangChain, as it will have knowledge of the LangChain documentation. We'll cover:

    • Getting setup with an OpenAI developer account and integration with Workspace;
    • Install the LangChain package
    • Preparing the data
    • Embed the data using OpenAI's Embed API, and get a cost estimate for this operation
    • Storing the data in a vector database
    • How to query the vector database
    • Putting together a basic chat application to "talk to the LangChain docs"

    Before you begin

    Create a developer account with OpenAI

    1. Go to the API signup page.

    2. Create your account (you'll need to provide your email address and your phone number).

    1. Go to the API keys page.

    2. Create a new secret key.

    1. Take a copy of it. (If you lose it, delete the key and create a new one.)

    Add a payment method

    OpenAI sometimes provides free credits for the API, but it's not clear if that is worldwide or what the conditions are. You may need to add debit/credit card details.

    We will use 2 APIs:

    • The Chat API with the gpt-3.5-turbo model (cost: $0.002 / 1K tokens)
    • The Embedding API with the Ada v2 model (cost: $0.0004 / 1K tokens)

    In total, the Chat API (used for completions) should cost less than $0.1 and embedding should cost around $0.6. This notebook provides embeddings already, so you can skip the embedding step.

    1. Go to the Payment Methods page.

    2. Click Add payment method.

    1. Fill in your card details.

    Set up a Workspace integration

    1. In Workspace, click on Integrations.
    1. Click on the "Create integration" plus button.
    1. Select an "Environment Variables" integration.
    1. In the "Name" field, type OPENAI_API_KEY. In the "Value" field, paste in your secret key (starting with sk-)
    1. Click "Create", and connect the new integration.

    Task 0: Setup

    For the purpose of this training, we'll need to install a few packages:

    • langchain: The LangChain framework
    • chromadb: The package we'll use for the vector database
    • tiktoken: A tokenizer we'll use to count GPT-3 tokens
    # install langchain (version 0.0.191)
    !pip install langchain==0.0.191
    # install chromadb
    !pip install chromadb
    # install tiktoken
    !pip install tiktoken

    Task 1: Load data

    To be able to embed and store data, we need to provide LangChain with Documents. This is easy to achieve in LangChain thanks to Document Loaders. In our case, we're targeting a "Read the docs" documentation, for which there is a loader ReadTheDocsLoader. In the folder rtdocs, you'll find all the HTML files from the LangChain documentation (

    How did we obtain the data

    These file were downloaded by executing this linux command:

    wget -r -A.html -P rtdocs

    We urge you **NOT** to execute this during the live training, as it will scan and download the full langchain doc site (~1000 files). This operation may be heavy and could disrupt the site, especially if hundreds of learners do it all at once!

    Our first task is to load these HTML files as documents that we can use with langchain: we're going to use the ReadTheDocsLoader. It will read the directory containing all HTML files and transform them into Document objects. ReadTheDocsLoader will read each HTML file, remove HTML tags to only keep the text and return it as a Document. At the end of this task, we'll have a variable raw_documents containing a list of Document: one Document per HTML file.

    Note that in this step we won't actually load the documents into a database, we're simply loading the documents in a list.


    1. import ReadTheDocsLoader from langchain.document_loaders
    2. Create the loader, pointing to the rtdocs/ directory and enabling the HTML parser feature with features='html.parser'
    3. Load the data in raw_documents by calling loader.load()
    # Import ReadTheDocsLoader
    # Create a loader for the `rtdocs/` folder
    # Load the data

    Task 2: Slice the documents into smaller chunks

    In the previous step, we turned each HTML file into a Document. These files may be very long, and are potentially too large to embed fully. It's also a good practice to avoid embedding large documents:

    • long documents often contain several concepts. Retrieval will be easier if each concept is indexed separately;
    • retrieved documents will be injected in a prompt, so keeping them short will keep the prompt small(ish)

    LangChain has a collection of tools to do this: Text Splitters. In our case, we'll be using the most straightfoward one and simplest to use: the Recursive Character Text Splitter. The recursive text splitter will recursively reduce the input by splitting it by paragraph, then sentences, then words as needed until the chunk is small enough.


    1. Import the RecursiveCharacterTextSplitter from langchain.text_splitter
    2. Create a text splitter configured with chunk_size=1000 and chunk_overlap=200
      These values are arbitrary and you'll need to try different ones to see which best serve your use case
    3. split the raw_documents and store them as documents, using the .split_documents() method
    # Import RecursiveCharacterTextSplitter
    # Create the text splitter
    # Split the documents

    Task 3: count tokens and get a cost estimate of embedding

    We're now ready to embed our documents. Before we do so, we'd like to get an idea of how big it is and how much it will cost to embed. To do so, we'll use the tiktoken library (no relation to TikTok, there is no dancing involved). tiktoken allows to encode and decode strings of text into tokens. In our case, we're mostly interested in how many tokens our documents translate to.

    💡 To better understand what a token is to GPT, head to OpenAI's Tokenizer page where you can see how a text translates to tokens.

    Prices for different models can be found on their pricing page.


    1. Import tiktoken
    2. Create a tokenizer for the text-embedding-ada-002 model using the .encoding_for_model() method
    3. Count tokens in each document using the .encode() method
    4. Calculate the sum of all tokens
    5. Calculate a cost estimate. The text-embedding-ada-002 model costs $0.0004 for 1000 tokens
    # Import tiktoken
    # Create an encoder 
    # Count tokens in each document
    # Calculate the sum of all token counts
    # Calculate a cost estimate

    Task 4: embed the documents and store embeddings in the vector database

    We're now ready to embed our documents. Since embedding costs money, we'll want to save the embeddings into a database. LangChain can take care of all that using a Vector Store.

    There are plenty of vector stores to choose from (see the full list). Today we'll use Chroma, but you could be using any other as they have the same interface in LangChain. Once again you'll need to try many of them to see which best fits your use case: some vector stores have specific features (like multimodality or multilingual), so be sure to check them out.

    Chroma is simple to use and can be persisted to disk. If you do not whish to embed the full set of documents yourself, feel free to skip this step and use the provided folder chroma-data-langchain-docs: we've already embedded all documents and persisted it in this folder.


    1. Import Chroma from langchain.vectorstores
    2. Import OpenAIEmbeddings from langchain.embeddings.openai
    3. Create the embedding function
    4. Create a database from our documents, using Chroma.from_documents(). Pass the documents, embedding function and persist_directory.
      Warning: executing this will embed thousands of documents and will cost about $0.6
    5. Persist the data to disk by calling .persist() on the database