Skip to content
Code-along | 2024-01-11 | Chat with Documents using GPT and LangChain | Andrea Valenzuela & Josep Ferrer (copy)
  • AI Chat
  • Code
  • Report
  • Spinner

    Chat with Your Documents Using GPT & LangChain

    Objectives:

    • Learn how to effectively load & store documents using LangChain
    • Build a retrieval augmented generation pipeline for querying data
    • Build a question-answering bot that answers questions based on your documents

    You can learn more about the LangChain library in the following links:

    • How to Make Large Language Models Play Nice with Your Software Using LangChain
    • 6 Problems of LLMs That LangChain is Trying to Assess

    Let's start by understanding our main goal:

    First:

    • Take a set of PDFs.
    • Break them into pieces of texts.
    • Embed them into a vectorized representation.
    • Store them into a vector database. (FAISS, CHROMA, PINECONE...)
    • Once the vectors are persistend in the ddbb, we can get queries, embed them and find a similar chunk vectors.
    • The chunks are ranked according to how relevant they are to the question and are used to contextualize our LLM.

    IMPORTANT: The LLM doesn't really know what PDFs have. We take advantage of the LLM model to generate NLP answers and provide it with a question and a context to generate an accurate answer.

    Install Imports and API Keys

    We need to make sure our environment has the following packages:

    • Install langchain (we force the 0.0.184 version as later versions generate problems within the DataCamp workspace)
    • Install tiktoken, wikipedia, pypdf, faiss-cpu, pinecone-client.
    !pip install langchain==0.0.184
    !pip install tiktoken
    !pip install wikipedia
    !pip install pypdf
    !pip install faiss-cpu
    !pip install pinecone-client

    Before starting, make sure you have avaiable:

    • OpenAI API Key
    • Pinecone API Key and environment.

    To get our API keys, we can set them in an .env document and load them into our environement using the 'load_dotenv()' command or define them directly.

    • To obtain OpenAI API Keys, you can follow the instructions here.
    • To obtain Pinecone API keys, you can follow the instructions here.
    # Basics
    import os
    import pandas as pd
    import matplotlib.pyplot as plt
    from dotenv import load_dotenv
    
    # LangChain Training
    # LLM
    from langchain.llms import OpenAI
    
    # Document Loader
    from langchain.document_loaders import PyPDFLoader 
    
    # Splitter
    from langchain.text_splitter import RecursiveCharacterTextSplitter 
    
    # Tokenizer
    from transformers import GPT2TokenizerFast  
    
    # Embedding
    from langchain.embeddings import OpenAIEmbeddings 
    
    # Vector DataBase
    from langchain.vectorstores import FAISS, Pinecone # for the vector database part -- FAISS is local and temporal, Pinecone is cloud-based and permanent. 
    
    # Chains
    #from langchain.chains.question_answering import load_qa_chain
    #from langchain.chains import ConversationalRetrievalChain
    # We can directly upload our keys using a .env
    #load_dotenv()
    
    import os
    
    openai_api_key = os.environ["OPENAI_API_KEY"]
    pinecone_api_key = os.environ["PINECONE_API_KEY"]
    pinecone_env_key = os.environ["PINECONE_ENV_KEY"]
    
    # Alternatively, you can set the API keys as follows:
    #OPENAI_API_KEY   = "sk-"
    #PINECONE_API_KEY = "34..."
    #PINECONE_ENV_KEY = "gcp-starter"

    PART 1: LANGCHAIN BASICS

    🎯 Objective: Understand what is the LangChain library and all the elements that are required to generate a simple pipeline to query out documents.

    What is LangChain?

    LangChain is a framework for developing applications powered by language models.

    LangChain makes the hardest parts of working with AI models easier in two main ways:

    1. Data-aware - Bring external data, such as your files, other applications, and API data, to your LLMs
    2. Agentic - Allow your LLMs to interact with it's environment via decision making. Use LLMs to help decide which action to take next.

    Why LangChain?

    1. Components - Abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not

    2. Chains - LangChain provides out of the box support for using and customizing 'chains' - a series of actions strung together. A structured assembly of components for accomplishing specific higher-level tasks.

    3. Speed 🚢 - This team ships insanely fast. You'll be up to date with the latest LLM features.

    4. Community 👥 - Wonderful discord and community support, meet ups, hackathons, etc.

    Though the usage of LLMs can be straightforward (text-in, text-out), when trying to build complex applications you'll quickly notice friction points.

    LangChain helps with once you develop more complicated application and manage LLMs the way we want.

    LangChain Components

    The LangChain library contains multiple elements to ease the process of building complex applications using LangChain. In this module we will focus mainly in 10 elements:

    To load and process our documents

    • Document Loaders
    • Text Splitters
    • Chat Messages (Optional)

    To talk with our documents using NLP

    • LLM model (GPT, Llama...)
    • Chains
    • Natural Language Retrieval
    • Metadata and Indexes
    • Memory (Optional)

    Both Processes

    • Text Embedding (OpenAI or Open-source models)
    • Vector Stores

    The Model - Large Language Model of our choice

    An AI-powered LLM that takes text in and responses text out. The default model is always ada-001, but we can explicitly choose the model of our preference.

    You can check the list of all avaialble models here

    from langchain.llms import OpenAI
    
    chatgpt = OpenAI(
                     model_name = "gpt-3.5-turbo", 
                     temperature= 0
    )
    
    prompt="Please, tell me some funny jokes"
    
    print(chatgpt(prompt))

    Chat Messages

    LangChain allows us to segmentate prompts into three main types.(System, Human, AI)

    • System - Helpful background context that tell the AI its high-level behavior.
    • Human - Messages that represent the user input.
    • AI - Messages that show the response of the AI model, they work as examples to the model.

    For more, see OpenAI's documentation

    from langchain.chat_models import ChatOpenAI
    from langchain.schema import HumanMessage, SystemMessage, AIMessage
    
    chatgpt = ChatOpenAI(model_name = "gpt-3.5-turbo",
                      temperature=0
                     )
    
    high_level_behavior = """
                           You are an AI bot that help people decide where to travel. 
                           Always recommend three destination with a short sentence for each.
                          """
    
    response = chatgpt(
        [
            SystemMessage(content=high_level_behavior),
            AIMessage(content="Hello! I am a traveller assistant, how can I help you?"),
            HumanMessage(content="Where should I travel next?"),
        ]
    )
    
    print(response.content)

    You can also pass more chat history with responses from the AI

    response = chatgpt(
            [
                SystemMessage(content=high_level_behavior),
                AIMessage(content="Hello! I am a traveller assistant, how can I help you?"),
                HumanMessage(content="Where should I travel next?"),
                SystemMessage(content="What do you enjoy doing?"),
                HumanMessage(content="I love going to Museums?"),
            ]
        )
    
    print(response.content)

    Text Embedding Model

    When documents or string-variables are too long, things got quite complicated.

    In order to be able to process them, we can embed and convert string variables into vectors (a series of numbers that hold the semantic 'meaning' of your text).

    Mainly used when comparing different pieces of text or when dealing with huge texts.