Code-along | 2024-01-11 | Chat with Documents using GPT and LangChain | Andrea Valenzuela & Josep Ferrer (copy)

Chat with Your Documents Using GPT & LangChain

Objectives:

Learn how to effectively load & store documents using LangChain
Build a retrieval augmented generation pipeline for querying data
Build a question-answering bot that answers questions based on your documents

You can learn more about the LangChain library in the following links:

Let's start by understanding our main goal:

First:

Take a set of PDFs.
Break them into pieces of texts.
Embed them into a vectorized representation.
Store them into a vector database. (FAISS, CHROMA, PINECONE...)
Once the vectors are persistend in the ddbb, we can get queries, embed them and find a similar chunk vectors.
The chunks are ranked according to how relevant they are to the question and are used to contextualize our LLM.

IMPORTANT: The LLM doesn't really know what PDFs have. We take advantage of the LLM model to generate NLP answers and provide it with a question and a context to generate an accurate answer.

Install Imports and API Keys

We need to make sure our environment has the following packages:

Install langchain (we force the 0.0.184 version as later versions generate problems within the DataCamp workspace)
Install tiktoken, wikipedia, pypdf, faiss-cpu, pinecone-client.

!pip install langchain==0.0.184
!pip install tiktoken
!pip install wikipedia
!pip install pypdf
!pip install faiss-cpu
!pip install pinecone-client

Before starting, make sure you have avaiable:

OpenAI API Key
Pinecone API Key and environment.

To get our API keys, we can set them in an .env document and load them into our environement using the 'load_dotenv()' command or define them directly.

To obtain OpenAI API Keys, you can follow the instructions here.
To obtain Pinecone API keys, you can follow the instructions here.

# Basics
import os
import pandas as pd
import matplotlib.pyplot as plt
from dotenv import load_dotenv

# LangChain Training
# LLM
from langchain.llms import OpenAI

# Document Loader
from langchain.document_loaders import PyPDFLoader 

# Splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter 

# Tokenizer
from transformers import GPT2TokenizerFast  

# Embedding
from langchain.embeddings import OpenAIEmbeddings 

# Vector DataBase
from langchain.vectorstores import FAISS, Pinecone # for the vector database part -- FAISS is local and temporal, Pinecone is cloud-based and permanent. 

# Chains
#from langchain.chains.question_answering import load_qa_chain
#from langchain.chains import ConversationalRetrievalChain

# We can directly upload our keys using a .env
#load_dotenv()

import os

openai_api_key = os.environ["OPENAI_API_KEY"]
pinecone_api_key = os.environ["PINECONE_API_KEY"]
pinecone_env_key = os.environ["PINECONE_ENV_KEY"]

# Alternatively, you can set the API keys as follows:
#OPENAI_API_KEY   = "sk-"
#PINECONE_API_KEY = "34..."
#PINECONE_ENV_KEY = "gcp-starter"

PART 1: LANGCHAIN BASICS

🎯 Objective: Understand what is the LangChain library and all the elements that are required to generate a simple pipeline to query out documents.

What is LangChain?

LangChain is a framework for developing applications powered by language models.

LangChain makes the hardest parts of working with AI models easier in two main ways:

Data-aware - Bring external data, such as your files, other applications, and API data, to your LLMs
Agentic - Allow your LLMs to interact with it's environment via decision making. Use LLMs to help decide which action to take next.

Why LangChain?

Components - Abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not
Chains - LangChain provides out of the box support for using and customizing 'chains' - a series of actions strung together. A structured assembly of components for accomplishing specific higher-level tasks.
Speed 🚢 - This team ships insanely fast. You'll be up to date with the latest LLM features.
Community 👥 - Wonderful discord and community support, meet ups, hackathons, etc.

Though the usage of LLMs can be straightforward (text-in, text-out), when trying to build complex applications you'll quickly notice friction points.

LangChain helps with once you develop more complicated application and manage LLMs the way we want.

LangChain Components

The LangChain library contains multiple elements to ease the process of building complex applications using LangChain. In this module we will focus mainly in 10 elements:

To load and process our documents

Document Loaders
Text Splitters
Chat Messages (Optional)

To talk with our documents using NLP

LLM model (GPT, Llama...)
Chains
Natural Language Retrieval
Metadata and Indexes
Memory (Optional)

Both Processes

Text Embedding (OpenAI or Open-source models)
Vector Stores

The Model - Large Language Model of our choice

An AI-powered LLM that takes text in and responses text out. The default model is always ada-001, but we can explicitly choose the model of our preference.

You can check the list of all avaialble models here

from langchain.llms import OpenAI

chatgpt = OpenAI(
                 model_name = "gpt-3.5-turbo", 
                 temperature= 0
)

prompt="Please, tell me some funny jokes"

print(chatgpt(prompt))

Chat Messages

LangChain allows us to segmentate prompts into three main types.(System, Human, AI)

System - Helpful background context that tell the AI its high-level behavior.
Human - Messages that represent the user input.
AI - Messages that show the response of the AI model, they work as examples to the model.

For more, see OpenAI's documentation

from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage

chatgpt = ChatOpenAI(model_name = "gpt-3.5-turbo",
                  temperature=0
                 )

high_level_behavior = """
                       You are an AI bot that help people decide where to travel. 
                       Always recommend three destination with a short sentence for each.
                      """

response = chatgpt(
    [
        SystemMessage(content=high_level_behavior),
        AIMessage(content="Hello! I am a traveller assistant, how can I help you?"),
        HumanMessage(content="Where should I travel next?"),
    ]
)

print(response.content)

You can also pass more chat history with responses from the AI

response = chatgpt(
        [
            SystemMessage(content=high_level_behavior),
            AIMessage(content="Hello! I am a traveller assistant, how can I help you?"),
            HumanMessage(content="Where should I travel next?"),
            SystemMessage(content="What do you enjoy doing?"),
            HumanMessage(content="I love going to Museums?"),
        ]
    )

print(response.content)

Text Embedding Model

When documents or string-variables are too long, things got quite complicated.

In order to be able to process them, we can embed and convert string variables into vectors (a series of numbers that hold the semantic 'meaning' of your text).

Mainly used when comparing different pieces of text or when dealing with huge texts.

‌
‌
‌