Accéder au contenu principal

DuckDB Tutorial: Building AI Projects

This tutorial guides you through DuckDB's key features and practical applications, including building tables, performing data analysis, building an RAG application, and using an SQL query engine with LLM.
7 juil. 2024  · 13 min de lecture

Recently, DuckDB came out of beta and released its stable version, gaining popularity rapidly as various data frameworks integrate it into their ecosystems. This makes it a prime time to learn DuckDB so you can keep up with the ever-changing world of data and AI.

In this tutorial, we will learn about DuckDB and its key features with code examples. Our primary focus will be on how we can integrate it with current AI frameworks. For that, we will work on two projects. First, we'll build a Retrieval-Augmented Generation (RAG) application using DuckDB as a vector database. Then, we'll use DuckDB as an AI query engine to analyze data using natural language instead of SQL.

What is DuckDB?

DuckDB is a modern, high-performance, in-memory analytical database management system (DBMS) designed to support complex analytical queries. It is a relational (table-oriented) DBMS that supports the Structured Query Language (SQL). 

DuckDB combines the simplicity and ease of use of SQLite with the high-performance capabilities required for analytical workloads, making it an excellent choice for data scientists and analysts.

Key features

  1. Simple operation: DuckDB is serverless, has no external dependencies, and is embedded within a host process. This makes it easy to install and deploy, requiring only a C++11 compiler for building.
  2. Feature-rich: It supports extensive SQL data management features. DuckDB also offers deep integration with Python and R, making it suitable for data science and interactive data analysis.
  3. Fast analytical queries: DuckDB uses a columnar-vectorized query execution engine optimized for analytics, enabling parallel query processing and efficient handling of large datasets. 
  4. Free and open source: It is released under the permissive MIT License, making it free to use and open-source. 
  5. Portability: With no external dependencies, DuckDB is highly portable and can run on various operating systems (Linux, macOS, Windows) and CPU architectures (x86, ARM). It can even run in web browsers using DuckDB-Wasm.
  6. Extensibility: DuckDB supports a flexible extension mechanism, allowing the addition of new data types, functions, file formats, and SQL syntax. 
  7. Thorough testing: It undergoes intensive testing using Continuous Integration, with a test suite containing millions of queries. This ensures stability and reliability across different platforms and compilers.

Getting Started with DuckDB

In this section, we will learn to set up DuckDB, load CSV files, perform data analysis, and learn about relations and query functions. 

We will start by installing the DuckDB Python package.

pip install duckdb --upgrade

Creating the DuckDB database

To create the persistent database, you just have to use the connect function and provide it with the database name. 

import duckdb
con = duckdb.connect("datacamp.duckdb")

It will create a database base file in your local directory. 

File directory in DataCamp's DataLab

We will load a CSV file and create a "bank" table. The dataset we are using is available on DataLab and is called Bank Marketing. It consists of direct marketing campaigns by a Portuguese banking institution using phone calls.

To load the CSV file, you have to create a Table first using SQL and then use the read_csv() function within the SQL script to load the file. It is that simple. 

We will then validate our table by executing the SQL script that shows all of the tables within the database and using the fetchdf function to display the result as a pandas DataFrame. 

Note: We are using DataCamp’s DataLab as a code editor. DataLab is a cloud Jupyter Notebook that you can access for free if you have a DataCamp account. 

con.execute("""
    CREATE TABLE IF NOT EXISTS bank AS 
    SELECT * FROM read_csv('bank-marketing.csv')
""")
con.execute("SHOW ALL TABLES").fetchdf()

Show all tables in the database.

Now that we have successfully created our first table, we will run a beginner-level query to analyze the data and display the result as a DataFrame.

con.execute("SELECT * FROM bank WHERE duration < 100 LIMIT 5").fetchdf()

result of the SQL query

DuckDB is natively integrated into the new DataLab by DataCamp. Learn more about it by reading the blog "DuckDB Makes SQL a First-Class Citizen on DataLab" and using the interactive SQL cell to analyze data.

DuckDB Relations

DuckDB relations are essentially tables that can be queried using the Relational API. This API allows for the chaining of various query operations on data sources like Pandas DataFrames. Instead of using SQL queries, you will by chaining together various Python functions to analyze the data. 

For example, we will load a CSV file to create the DuckDB relation. To analyze the table, you can chain the filter and limit functions.

bank_duck = duckdb.read_csv("bank-marketing.csv",sep=";")
bank_duck.filter("duration < 100").limit(3).df()

Result of the relations query

We can also create relations by loading the table from the DuckDB database. 

rel = con.table("bank")
rel.columns
['age',
 'job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed',
 'y']

Let’s write a relation that uses multiple functions to analyze the data. 

rel.filter("duration < 100").project("job,education,loan").order("job").limit(3).df()

We have three rows and columns sorted by job and filtered by duration column.

result of the relations query

DuckDB Query Function

The DuckDB query function allows SQL queries to be executed within the database, returning results that can be converted into various formats for further analysis.

In the code example, we are running the SQL query to find out the job titles of clients over the age of 30, count the number of clients contacted for each job, and calculate the average duration of the campaign.

Take the SQL Fundamentals skill track to learn how to manage a relational database and execute queries for simple data analysis.

res = duckdb.query("""SELECT 
                            job,
                            COUNT(*) AS total_clients_contacted,
                            AVG(duration) AS avg_campaign_duration,
                        FROM 
                            'bank-marketing.csv'
                        WHERE 
                            age > 30
                        GROUP BY 
                            job
                        ORDER BY 
                            total_clients_contacted DESC;""")
res.df()

result of the query function.

We will now close the connection to the database and release any resources associated with that connection, preventing potential memory and file handle leaks.

con.close()

If you are facing issues running the above code, please have a look at the Getting Started with DuckDB workspace. 

Building a RAG Application with DuckDB

In the first project, we will learn to build an RAG application with LlamaIndex and use DuckDB as a Vector database and retriever. 

Setting up

Install all the necessary Python packages that will be used to create and retrieve the index. 

%%capture
%pip install duckdb
%pip install llama-index
%pip install llama-index-vector-stores-duckdb

Import the necessary Python package with the functions. 

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.duckdb import DuckDBVectorStore
from llama_index.core import StorageContext

from IPython.display import Markdown, display

Setting up GPT-4o and Embedding Model

For a language model, we will use the latest GPT4o model and the OpenAI API. To create the large language model (LLM) client, you just have to provide a model name and API key

import os
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o",api_key=os.environ["OPENAI_API_KEY"])

Then, we will create the embed model client using the OpenAI text-embedding-3-small model. 

Note: Providing an OpenAI API key is optional if the environment variable is set with the name “OPENAI_API_KEY” on your development environment.  

from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
)

We will make OpenAI LLM and Embedding models global for all LlamaIndex functions to use. In short, these models will be set as default.

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

Using DuckDB as a vector database

For our project, we will load the PDF files from the data folder. These PDF files are tutorials from DataCamp that are saved as PDF files using the browser’s print function. 

Provide the folder directory to the SimpleDirectoryReader function and load the data. 

documents = SimpleDirectoryReader("Data").load_data()

list of files in the Data folder

Then, create the vector store called “blog” using an existing database called “datacamp.duckdb.” After that, convert the PDF's data into embeddings and store them in the vector store.

vector_store = DuckDBVectorStore(database_name = "datacamp.duckdb",table_name = "blog",persist_dir="./", embed_dim=1536)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

To check if our vector store was successfully created, we will connect the database using the DuckDB Python API and run the SQL query to display all the tables in the database. 

import duckdb
con = duckdb.connect("datacamp.duckdb")

con.execute("SHOW ALL TABLES").fetchdf()

We have two tables: a “bank” promotional table and a “blog” table, which is a vector store. The “blog” table has an “embedding” column where all the embeddings are stored.

Show all the tables in the DuckDB

Creating a simple RAG application

Convert the index into the query engine, which will automatically first search the vector database for similar documents and use the additional context to generate the response. 

To test the RAG query engine, we will ask the question about the tutorial. 

query_engine = index.as_query_engine()
response = query_engine.query("Who wrote 'GitHub Actions and MakeFile: A Hands-on Introduction'?")
display(Markdown(f"<b>{response}</b>"))

And the answer is correct.

The author of "GitHub Actions and MakeFile: A Hands-on Introduction" is Abid Ali Awan.

Creating a RAG chatbot with memory

Now, let’s create an advanced RAG application that uses the conversation history to generate the response. For that, we have to create a chat memory buffer and then a chat engine with memory, LLM, and vector store retriever. 

from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine

memory = ChatMemoryBuffer.from_defaults(token_limit=3900)

chat_engine = CondensePlusContextChatEngine.from_defaults(
    index.as_retriever(),
    memory=memory,
    llm=llm
)

response = chat_engine.chat(
    "What is the easiest way of finetuning the Llama 3 model? Please provide step-by-step instructions."
)

display(Markdown(response.response))

We asked the chat engine how to fine-tune the Llama 3 model, and it used the vector store to give a highly accurate answer.

Result of RAG chatbot with memory

To check if the memory buffer is working correctly, we will ask a follow-up question.

response = chat_engine.chat(
    "Could you please provide more details about the Post Fine-Tuning Steps?"
)
display(Markdown(response.response))

The chat engine remembered the previous conversation and responded accordingly. 

result of the RAG chatbot with memory with follow up question.

If you are facing issues running the above code, please have a look at the Building a RAG application with DuckDB workspace. 

Building a DuckDB SQL Query Engine Using an LLM

In the second project, we will use DuckDB as an SQL query engine. This involves integrating the database engine with the GPT-4o model to generate natural language responses to questions about the database. 

Install duckdb-engine to create a database engine using SQLAlchemy.

%pip install duckdb-engine -q

Loading the DuckDB database

We will load the DuckDB database using the create_engine function and then write a simple SQL query to check whether it is successfully loaded. 

from sqlalchemy import create_engine

engine = create_engine("duckdb:///datacamp.duckdb")
with engine.connect() as connection:
    cursor = connection.exec_driver_sql("SELECT * FROM bank LIMIT 3")
    print(cursor.fetchall())

Prefect. Our DuckDB database engine is ready to be used. 

[(56, 'housemaid', 'married', 'basic.4y', 'no', 'no', 'no', 'telephone', 'may', 'mon', 261, 1, 999, 0, 'nonexistent', 1.1, 93.994, -36.4, 4.857, 5191.0, 'no'), (57, 'services', 'married', 'high.school', 'unknown', 'no', 'no', 'telephone', 'may', 'mon', 149, 1, 999, 0, 'nonexistent', 1.1, 93.994, -36.4, 4.857, 5191.0, 'no'), (37, 'services', 'married', 'high.school', 'no', 'yes', 'no', 'telephone', 'may', 'mon', 226, 1, 999, 0, 'nonexistent', 1.1, 93.994, -36.4, 4.857, 5191.0, 'no')]

Now, we have to create a database Tool using the SQLDatabase function. Provide it with an engine object and table name. 

from llama_index.core import SQLDatabase
sql_database = SQLDatabase(engine, include_tables=["bank"])

Building the SQL query engine

Create the SQL query engine using the NLSQLTableQueryEngine function by providing it with the LlamaIndex SQL database object.

from llama_index.core.query_engine import NLSQLTableQueryEngine

query_engine = NLSQLTableQueryEngine(sql_database)

Ask the question from the query engine about the “bank” table in the natural language. 

response = query_engine.query("Which is the longest running campaign?")


print(response.response)

In response, we will get the answer to your query in natural languages. This is awesome, don't you think?

The longest running campaign in the database has a duration of 4918 days.

Let's ask a complex question.

response = query_engine.query("Which type of job has the most housing loan?")
print(response.response)

The answer is precise, with additional information.  

The job type with the most housing loans is 'admin.' with 5559 housing loans. This is followed by 'blue-collar' with 4710 housing loans and 'technician' with 3616 housing loans. Other job types with significant housing loans include 'services', 'management', 'retired', 'entrepreneur', and 'self-employed'.

To check what is going on on the back end, we will print the metadata. 

print(response.metadata)

As we can see, GPT-4o first generates the SQL query, runs the query to get the result, and uses the result to generate the response. This multi-step process is achieved through two lines of code. 

{'d4ddf03c-337e-4ee6-957a-5fd2cfaa4b1c': {}, 'sql_query': "SELECT job, COUNT(housing) AS housing_loan_count\nFROM bank\nWHERE housing = 'yes'\nGROUP BY job\nORDER BY housing_loan_count DESC;", 'result': [('admin.', 5559), ('blue-collar', 4710), ('technician', 3616), ('services', 2050), ('management', 1490), ('retired', 892), ('entrepreneur', 779), ('self-employed', 740), ('unemployed', 557), ('housemaid', 540), ('student', 471), ('unknown', 172)], 'col_keys': ['job', 'housing_loan_count']}

Close the engine when you are done with the project. 

engine.close()

If you are facing issues running the above code, please have a look at the DuckDB SQL Query Engine workspace. 

Conclusion

DuckDB is fast, easy to use, and integrates seamlessly with numerous data and AI frameworks. As a data scientist, you will find that it takes only a few minutes to get accustomed to its API and start using it like any other Python package. One of the best features of DuckDB is that it has no dependencies, meaning you can use it virtually anywhere without worrying about hosting or additional setup.

In this tutorial, we have learned about DuckDB and its key features. We have also explored the DuckDB Python API, using it to create a table and perform simple data analysis. The second half of the tutorial covered two projects: one involving a Retrieval-Augmented Generation (RAG) application with DuckDB as a vector database and the other demonstrating DuckDB as an SQL query engine.

Before jumping into using a SQL query engine or integrating a database with AI, you need a basic understanding of SQL and data analysis. You can write the query, but how would you know what question to ask? This is where a basic knowledge of data analysis and SQL comes in. You can gain this knowledge by completing the Associate Data Analyst in SQL career track.


Abid Ali Awan's photo
Author
Abid Ali Awan
LinkedIn
Twitter

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Sujets

Top DataCamp Courses

cours

Vector Databases for Embeddings with Pinecone

3 hr
464
Discover how the Pinecone vector database is revolutionizing AI application development!
Afficher les détailsRight Arrow
Commencer Le Cours
Voir plusRight Arrow
Apparenté

blog

An Introduction to DuckDB: What is It and Why Should You Use It?

Explore DuckDB, the fast, easy-to-use analytics database for Python and R. Learn its key features, use cases, and how it optimizes data analysis tasks.
Kurtis Pykes 's photo

Kurtis Pykes

7 min

A duck codes in DuckDB

blog

DuckDB makes SQL a first-class citizen on DataLab

In this blog post, we list out all recent improvements that make it seamless and efficient to query data with SQL, all without leaving the tool.
Filip Schouwenaars's photo

Filip Schouwenaars

didacticiel

A Comprehensive Guide to Databricks Lakehouse AI For Data Scientists

This tutorial dives into the Databricks approach to AI & Machine Learning in the Databricks Lakehouse and introduces its latest features.
Arunn Thevapalan's photo

Arunn Thevapalan

12 min

didacticiel

Using a Knowledge Graph to Implement a RAG Application

Learn how to implement knowledge graphs for RAG applications by following this step-by-step tutorial to enhance AI responses with structured knowledge.
Dr Ana Rojo-Echeburúa's photo

Dr Ana Rojo-Echeburúa

19 min

didacticiel

PostgresML Tutorial: Doing Machine Learning With SQL

An introductory article on how to perform machine learning using SQL statements in PostgresML.
Bex Tuychiev's photo

Bex Tuychiev

11 min

didacticiel

Databricks DBRX Tutorial: A Step-by-Step Guide

Learn how Databricks DBRX—an open-source LLM can handle complex tasks and generate intelligent results.
Laiba Siddiqui's photo

Laiba Siddiqui

10 min

See MoreSee More