Blog

An Introduction to DuckDB: What is It and Why Should You Use It?

Explore DuckDB, the fast, easy-to-use analytics database for Python and R. Learn its key features, use cases, and how it optimizes data analysis tasks.

Updated Jun 2023 · 7 min read

In the realm of data science and analytics, extracting insights from data often involves a series of steps, typically conducted in Python using libraries like pandas. While powerful, pandas may face performance issues with large datasets and resource-intensive operations.

Aiming for a balance between robust functionality and efficiency, DuckDB emerges as an excellent alternative. As a high-speed, user-friendly analytics database, DuckDB is transforming data processing in Python and R.

This article will explore:

DuckDB's unique features and capabilities
Advantages of DuckDB over traditional data manipulation tools
Practical use cases demonstrating DuckDB's potential
Guidelines for working with DuckDB in Python and R

By the end of this article, you will understand why DuckDB is a tool worth considering for your data analysis tasks, regardless of the data size and complexity of the operations involved.

What is DuckDB?

DuckDB is a free, open-source, embedded, in-process, relational, OnLine Analytical Processing (OLAP) DataBase Management System (DBMS). That’s a lot of words and acronyms, so let’s break it down:

By in-process, we mean the DBMS features are running from within the application you’re trying to access from instead of an external process your application connects to
OLAP means the database is designed for data analysis. There are also OnLine Transaction Processing (OLTP) databases, which are designed to handle large transactional data.

The main difference between OLAP and OLTP is how data is stored. OLTP databases commonly store data by records while ensuring all data associated with a record is stored close together in memory. They are also optimized for reading and writing rows efficiently.

In contrast, OLAP databases are commonly oriented by columns, which means data is organized by columns. All data associated with a column is stored near one another in memory, and the database is optimised for reading and computing on the columns efficiently.

Since DuckDB is an OLAP database, any data stored is organized by columns. Additionally, DuckDB is optimized to perform complex queries on data (e.g., joins, aggregations, groupings, etc.) – check out the SQL Fundamentals skill track to learn more.

If you’re familiar with SQLite, the easiest way to conceptualize DuckDB is as its analytics-focused replica. This plays into why DuckDB is so popular — it leverages the simplicity of SQLite and the functionalities of Snowflake on your local computer. DuckDB fills the need for an embedded database solution for analytical processing.

Key Features of DuckDB

Let’s take a look at some of the key features of DuckDB.

Fast analytical queries

DuckDB is incredibly fast; this may come as a shock to anyone familiar with databases that handle OLAP workloads, like PostgreSQL. Traditionally, OLAP databases tend to have slow response times since they typically process more data.

However, DuckDB runs on a columnar-vectorized query engine, which helps to make efficient use of the CPU cache and speed up response times for analytical query workloads.

Supports SQL and integration with other programming languages

DuckDB enables users to run complex SQL queries and provides APIs for Java, C, C++, and more. It’s also deeply integrated into Python and R, enabling users to conduct efficient interactive data analysis; thus, you can interact with DuckDB from your preferred programming language. There’s also access to extra SQL keywords that make SQL queries easier to write, such as EXCLUDE, REPLACE, and ALL.

Note: DuckDB has no external dependencies, so you don’t have to worry about dependency issues.

Free & open-source

DuckDB is open-sourced and has several active contributors, which means developments and improvements can be implemented fast. It’s also free, although it may not remain free for good.

DuckDB Use Cases

The two most common use cases for DuckDB are interactive data analysis and edge computing.

Let’s learn more about each one in turn.

Interactive data analysis

Data analysis describes the series of steps involved with processing and modeling data. The purpose of data analysis is to discover useful information to inform conclusions and data-driven decisions.

Nowadays, companies employ data scientists and analysts who take on this responsibility using technologies such as Python and R programming languages to conduct their analyses.

DuckDB offers a better alternative to those who prefer to use SQL for their local development than SQLite, which has a tough time managing OLAP workloads due to its lack of functionality for data analytics.

Edge computing

Embeddable databases, like DuckDB, enable users to analyze data on edge. Edge computing is an emerging distributed computing paradigm, which refers to a range of networks and devices being brought to or closer to the location where it’s required.

Processing data closer to where it’s required enables improved response times and bandwidth being preserved.

Getting Started with DuckDB

Your environment determines the steps required to install DuckDB, but the full installation guide can be found in the DuckDB documentation.

For the most part, it’s pretty straightforward. You only require a few lines of code since DuckDB is an embedded solution. It also does not require any server, and it has no external dependencies.

Here’s an example of how we would install DuckDB using Python:

pip install duckdb==0.8.0

And in R:

install.packages("duckdb")

Now, let’s check out some code examples.

Working With DuckDB in Python

Once you’ve installed DuckDB, it’s pretty simple to get started. You simply import duckdb into your environment, then connect to an existing database or create a new one if required.

For example:

import duckdb 
connection = duckdb.connect()

Since no database file was passed as a parameter to the connect() method, a new database will be created.

However, the most straightforward way to get started executing SQL queries with DuckDB is by using the sql() method.

# Source: Basic API usage - https://duckdb.org/docs/api/python/overview.html
import duckdb
duckdb.sql('SELECT 42').show()

Executing this command will run an in-memory database stored globally in the Python module and return a relation, which is essentially a symbolic representation of the query.

Note, the query is not executed until the result is explicitly requested, as follows:

# Source: Execute SQL - https://duckdb.org/docs/guides/python/execute_sql.html
results = duckdb.sql('SELECT 42').fetchall()
print(results)

"""
[(42,)]
"""

We can also use the df() method to convert the data to a Pandas DataFrame:

results_df = duckdb.sql('SELECT 42').df()
print(results) 

"""
    42
0   42
"""

Or you can use one of the many data ingestion methods to read data into memory:

# Source: Python API - https://duckdb.org/docs/api/python/overview.html 
import duckdb
duckdb.read_csv('example.csv') # read a CSV file into a Relation
duckdb.read_parquet('example.parquet')# read a Parquet file into a Relation
duckdb.read_json('example.json') # read a JSON file into a Relation

duckdb.sql('SELECT * FROM "example.csv"')     # directly query a CSV file

Disclaimer: If you use the sql() method without first connecting to an existing database, any data you store in the files will be lost after shutting down the program. You can prevent this by establishing a connection using the connect() method we discussed at the beginning of the section.

Conclusion and Further Resources

DuckDB is an OLAP database used by data professionals, such as data scientists and analysts, to analyze data in a fast and efficient manner. It leverages a SQL query execution engine capable of running complex queries on large datasets.

DuckDB also has integrations with many other programming languages, so it’s possible to access it from your favorite language.

To continue your learning, check out the SQL Basics Cheat Sheet and how DuckDB made SQL a first-class Citizen in DataLab.

Author

Kurtis Pykes

Topics

Data Science

blog

5 Common Data Science Challenges and Effective Solutions

Emerging technologies are changing the data science world, bringing new data science challenges to businesses. Here are 5 data science challenges and solutions.

DataCamp Team

8 min

blog

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.

Mark Graus

10 min

podcast

Avoiding Burnout for Data Professionals with Jen Fisher, Human Sustainability Leader at Deloitte

Jen and Adel cover Jen’s own personal experience with burnout, the role of a Chief Wellbeing Officer, the impact of work on our overall well-being, the patterns that lead to burnout, the future of human sustainability in the workplace and much more.

Adel Nehme

44 min

podcast

Becoming Remarkable with Guy Kawasaki, Author and Chief Evangelist at Canva

Richie and Guy explore the concept of being remarkable, growth, grit and grace, the importance of experiential learning, imposter syndrome, finding your passion, how to network and find remarkable people, measuring success through benevolent impact and much more.

Richie Cotton

55 min

tutorial

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial

Learn to master DynamoDB with Node.js in this beginner's guide. Explore table creation, CRUD operations, and scalability in AWS's NoSQL database.

Gary Alway

11 min

See More See More

An Introduction to DuckDB: What is It and Why Should You Use It?

What is DuckDB?

Key Features of DuckDB

Fast analytical queries

Supports SQL and integration with other programming languages

Free & open-source

DuckDB Use Cases

Interactive data analysis

Edge computing

Getting Started with DuckDB

Working With DuckDB in Python

Conclusion and Further Resources

5 Common Data Science Challenges and Effective Solutions

Top 32 AWS Interview Questions and Answers For 2024

A Data Science Roadmap for 2024

Avoiding Burnout for Data Professionals with Jen Fisher, Human Sustainability Leader at Deloitte

Becoming Remarkable with Guy Kawasaki, Author and Chief Evangelist at Canva

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial

5 Common Data Science Challenges and Effective Solutions

Top 32 AWS Interview Questions and Answers For 2024

A Data Science Roadmap for 2024

Avoiding Burnout for Data Professionals with Jen Fisher, Human Sustainability Leader at Deloitte

Becoming Remarkable with Guy Kawasaki, Author and Chief Evangelist at Canva

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial