Skip to main content
HomeTutorialsData Engineering

Introduction to LangChain for Data Engineering & Data Applications

LangChain is a framework for including AI from large language models inside data pipelines and applications. This tutorial provides an overview of what you can do with LangChain, including the problems that LangChain solves and examples of data use cases.
Apr 2023  · 11 min read

Large language models (LLMs) like OpenAI GPT, Google Bert, and Meta LLaMA are revolutionizing every industry through their power to generate almost any text you can imagine, from marketing copy to data science code to poetry. While ChatGPT has taken the lion's share of attention through its intuitive chat interface, there are many more opportunities for making use of LLMs by incorporating them into other software.

For example, DataCamp Workspace includes an AI Assistant that lets you write or improve the code and text in your analysis, and DataCamp's interactive courses include an "explain my error" feature to help you understand where you made a mistake. These features are powered by GPT and accessed via the OpenAI API.

LangChain provides a framework on top of several APIs for LLMs. It is designed to make software developers and data engineers more productive when incorporating LLM-based AI into their applications and data pipelines.

This tutorial details the problems that LangChain solves and its main use cases, so you can understand why and where to use it. Before reading, it is helpful to understand the basic idea of an LLM. The How NLP is Changing the Future of Data Science tutorial is a good place to refresh your knowledge.

What problems does LangChain solve?

There are essentially two workflows for interacting with LLMs.

  1. "Chatting" involves writing a prompt, sending it to the AI, and getting a text response back.
  2. "Embedding" involves writing a prompt, sending it to the AI, and getting a numeric array response back.

Both the chatting and embedding workflows have some problems that LangChain tries to address.

Prompts Are Full of BoilerPlate Text

One of the problems with both chatting and embedding is that creating a good prompt involves much more than just defining a task to complete. You also need to describe the AI's personality and writing style and include instructions to encourage factual accuracy. That is, you might want to write a simple prompt describing a task:

Write an outline for a 500-word blog post targeted at teenagers about the health benefits of doing yoga.

However, to get a good response, you need to write something more like:

You are an expert sports scientist. Your writing style is casual but terse.

Write an outline for a 500-word blog post targeted at teenagers about the health benefits of doing yoga.

Only include factually correct information. Explain your reasoning.

Much of this boilerplate copy in the prompt will be the same from one prompt to the next. Ideally, you want to write it once, and have it automatically be included in whichever prompts you want.

LangChain solves the problem of boilerplate text in prompts by providing prompt templates. These combine the useful prompt input (in this case, the text about writing a blog post on yoga) with the boilerplate (the writing style and request for factually correct information).

Responses Are Unstructured

In chatting workflows, the response from the model is just text. However, when the AI is used inside software, it is often desirable to have a structured output that can then be programmed against. For example, if the goal is to generate a dataset, you'd want the response to be provided in a specific format like CSV or JSON.

Assuming that you can write a prompt that will get the AI to consistently provide a response in a suitable format, you need a way to handle that output. LangChain provides output parser tools for just this purpose.

It's Hard to Switch Between LLMs

While GPT is wildly successful, there are a lot of other LLMs available. By programming directly against one company's API, you are locking your software into that ecosystem. It's perfectly plausible that after building your AI features on GPT, you realize stronger multilingual capabilities are necessary for your product. You may then want to switch to Polyglot or switch from using an AI to including the AI in your product and require a more compact model like Stability AI's StableLM.

LangChain provides an LLM class, and this abstraction makes it much easier to swap one model for another or even make use of multiple models within your software.

LLMs Have Short Memories

The response of an LLM is generated based on the previous conversation (both the user prompts and its previous responses). However, LLMs have a fairly short memory. Even the state of the art GPT-4 defaults to a memory of 8,000 tokens (approximately 6,000 words).

In a chatting workflow, if the conversation continues beyond the memory limits, the responses from the AI can become inconsistent (since it has no recollection of the start of the conversation). Chatbots are an example where this can be a problem. Ideally, you want the chatbot to recall the entire conversation with the customer so as not to provide contradictory information.

LangChain solves the problem of memory by providing chat message history tools. These allow you to feed previous messages back to the LLM to remind it of what it has been talking about.

It's Hard to Integrate LLM Usage Into Pipelines

When used in data pipelines or in software applications, the AI is often only part of a large piece of functionality. For example, you may wish to retrieve some data from a database, pass it to the LLM, then process the response and feed it into another system.

LangChain provides tooling for pipeline-type workflows through chains and agents. Chains are simple objects that essentially string together several components (for linear pipelines). Agents are more sophisticated, allowing business logic to let you choose how the components should interact. For example, you may wish to have conditional logic depending upon the output from an LLM to decide what the next step is.

Passing Data to the LLM is Tricky

The text-based nature of LLMs means that it often isn't entirely clear how to pass data to the model. There are two parts to the problem.

Firstly, you need to store data in a format that lets you control which portions of the dataset will be sent to the LLM. (For rectangular datasets like a DataFrame or SQL table, you typically want to send data one row at a time.)

Secondly, you must determine how to include that data in the prompt. The simplest approach is to simply include the dataset in the prompt ("prompt stuffing"), but more sophisticated options are available.

LangChain solves the first part of this with indexes. These provide functionality for importing data from databases, JSON files, pandas DataFrames, CSV files, and other formats and storing them in a format suitable for serving them row-wise into an LLM.

LangChain provides index-related chains to solve the second part of the problem and has classes for four techniques for passing data to the LLM.

Prompt stuffing inserts the whole dataset into the prompt. It's very simple but only works when you have a small dataset.

Map-reduce splits the data into chunks, calls the LLM with an initial prompt on each chunk of the data (the "map" part), then calls the LLM again, this time with a different prompt that includes the responses from the first round of prompts. This is appropriate any time you'd think of using a "group by" command.

Refine is an iterative Bayesian-style approach, where you run a prompt on the first chunk of data, then for each additional chunk of data, you use a prompt asking the LLM to refine the result based on the new dataset. This approach works well if you are trying to get the response to converge on a specific output.

Map-rerank is a variation on map-reduce. The first part of the flow is the same: split the data into chunks and call a prompt on each chunk. The difference is that you ask the LLM to provide a confidence score for its response, so you can rank outputs. This approach works well for recommendation-type tasks where the result is a single "best" answer.

Which Programming Languages are Supported by LangChain?

LangChain can be used from JavaScript via the langchain node package. This is suitable for embedding AI into web applications.

It can also be used from Python via the langchain (PyPI, conda) package. This is suitable for including AI into data pipelines or Python-based software.

What are the Main Use Cases of LangChain?

LangChain can be used wherever you might want to use an LLM. Here, we'll cover several examples related to data, explaining which features of LangChain are relevant.

Querying Datasets with Natural Language

One of the most transformative use cases of LLM for data analysis is the ability to write SQL queries (or the equivalent Python or R code) using natural language. This makes exploratory data analysis accessible to people without coding skills.

There are several variations on the workflow. For small datasets, you can get the LLM to generate results directly. This involves loading the data into an appropriate format using LangChain's document loaders, passing the data to the LLM using index-related chains, and parsing the response using an output parser.

More commonly, you'd pass details of the data structure (such as table names, column names and types, and any details like which columns have missing values) to the LLM and ask it for SQL/Python/R code. Then you'd execute the resulting code. This flow is simpler, as it does not require passing data, though LangChain is still useful because it can modularize the steps.

Another variation would be to include a second call to the LLM to get it to interpret the results. This type of workflow that involves multiple calls to the LLM is where LangChain really helps. In this case, you can use the chat message history tools to ensure that the interpretation of the results is consistent with the data structure you provided previously.

Interacting with APIs

For data use cases such as creating a data pipeline, including AI from an LLM is often part of a longer workflow that involves other API calls. For example, you may wish to use an API to retrieve stock data or to interact with a cloud platform.

LangChain's chain and agent features that allow you to connect these steps in sequence (and use additional business logic for branching pipelines) are ideal for this use case.

Building a Chatbot

Chatbots are one of the most popular use cases of AI, and generative AI holds a great deal of promise for chatbots that behave more realistically. However, it can be cumbersome to control the personality of the chatbot and get it to remember the context of the conversation.

LangChain's prompt templates give you control of both the chatbot's personality (its tone of voice and its style of communication) and the responses it gives.

Additionally, the message history tools are useful for giving the chatbot a memory that is longer than the few hundred or thousand words that LLMs provide by default, allowing for greater consistency within a conversation or even across multiple conversations.

Other uses

This tutorial only scratches the surface of the possibilities. There are many more use cases of LLMs for data professionals. Whether you are interested in creating a personal assistant, summarizing reports, or answering questions about support docs or a knowledge base, LangChain provides a framework for including AI into any data pipeline or data application that you can think of.

Take it to the Next Level

LangChain content is coming soon to DataCamp. In the meantime, you can learn about one of the use cases discussed here by taking the Building Chatbots in Python course.

Topics
Related

What is Stable Code 3B?

Discover everything you need to know about Stable Code 3B, the latest product of Stability AI, specifically designed for accurate and responsive coding.
Javier Canales Luna's photo

Javier Canales Luna

11 min

14 Essential Data Engineering Tools to Use in 2024

Learn about the top tools for containerization, infrastructure as code (IaC), workflow management, data warehousing, analytical engineering, batch processing, and data streaming.
Abid Ali Awan's photo

Abid Ali Awan

10 min

How the UN is Driving Global AI Governance with Ian Bremmer and Jimena Viveros, Members of the UN AI Advisory Board

Richie, Ian and Jimena explore what the UN's AI Advisory Body was set up for, the opportunities and risks of AI, how AI impacts global inequality, key principles of AI governance, the future of AI in politics and global society, and much more. 
Richie Cotton's photo

Richie Cotton

41 min

The Power of Vector Databases and Semantic Search with Elan Dekel, VP of Product at Pinecone

RIchie and Elan explore LLMs, vector databases and the best use-cases for them, semantic search, the tech stack for AI applications, emerging roles within the AI space, the future of vector databases and AI, and much more.  
Richie Cotton's photo

Richie Cotton

36 min

Getting Started with Claude 3 and the Claude 3 API

Learn about the Claude 3 models, detailed performance benchmarks, and how to access them. Additionally, discover the new Claude 3 Python API for generating text, accessing vision capabilities, and streaming.
Abid Ali Awan's photo

Abid Ali Awan

Apache Kafka for Beginners: A Comprehensive Guide

Explore Apache Kafka with our beginner's guide. Learn the basics, get started, and uncover advanced features and real-world applications of this powerful event-streaming platform.
Kurtis Pykes 's photo

Kurtis Pykes

8 min

See MoreSee More