Skip to main content
HomeBlogData Science

An Introduction to DuckDB: What is It and Why Should You Use It?

Explore DuckDB, the fast, easy-to-use analytics database for Python and R. Learn its key features, use cases, and how it optimizes data analysis tasks.
Jun 2023  · 7 min read

In the realm of data science and analytics, extracting insights from data often involves a series of steps, typically conducted in Python using libraries like pandas. While powerful, pandas may face performance issues with large datasets and resource-intensive operations.

Aiming for a balance between robust functionality and efficiency, DuckDB emerges as an excellent alternative. As a high-speed, user-friendly analytics database, DuckDB is transforming data processing in Python and R.

This article will explore:

  • DuckDB's unique features and capabilities
  • Advantages of DuckDB over traditional data manipulation tools
  • Practical use cases demonstrating DuckDB's potential
  • Guidelines for working with DuckDB in Python and R

By the end of this article, you will understand why DuckDB is a tool worth considering for your data analysis tasks, regardless of the data size and complexity of the operations involved.

What is DuckDB?

DuckDB is a free, open-source, embedded, in-process, relational, OnLine Analytical Processing (OLAP) DataBase Management System (DBMS). That’s a lot of words and acronyms, so let’s break it down:

  • By in-process, we mean the DBMS features are running from within the application you’re trying to access from instead of an external process your application connects to
  • OLAP means the database is designed for data analysis. There are also OnLine Transaction Processing (OLTP) databases, which are designed to handle large transactional data.

The main difference between OLAP and OLTP is how data is stored. OLTP databases commonly store data by records while ensuring all data associated with a record is stored close together in memory. They are also optimized for reading and writing rows efficiently.

In contrast, OLAP databases are commonly oriented by columns, which means data is organized by columns. All data associated with a column is stored near one another in memory, and the database is optimised for reading and computing on the columns efficiently.

Since DuckDB is an OLAP database, any data stored is organized by columns. Additionally, DuckDB is optimized to perform complex queries on data (e.g., joins, aggregations, groupings, etc.) – check out the SQL Fundamentals skill track to learn more.

If you’re familiar with SQLite, the easiest way to conceptualize DuckDB is as its analytics-focused replica. This plays into why DuckDB is so popular — it leverages the simplicity of SQLite and the functionalities of Snowflake on your local computer. DuckDB fills the need for an embedded database solution for analytical processing.

Key Features of DuckDB

Let’s take a look at some of the key features of DuckDB.

Fast analytical queries

DuckDB is incredibly fast; this may come as a shock to anyone familiar with databases that handle OLAP workloads, like PostgreSQL. Traditionally, OLAP databases tend to have slow response times since they typically process more data.

However, DuckDB runs on a columnar-vectorized query engine, which helps to make efficient use of the CPU cache and speed up response times for analytical query workloads.

Supports SQL and integration with other programming languages

DuckDB enables users to run complex SQL queries and provides APIs for Java, C, C++, and more. It’s also deeply integrated into Python and R, enabling users to conduct efficient interactive data analysis; thus, you can interact with DuckDB from your preferred programming language. There’s also access to extra SQL keywords that make SQL queries easier to write, such as EXCLUDE, REPLACE, and ALL.

Note: DuckDB has no external dependencies, so you don’t have to worry about dependency issues.

Free & open-source

DuckDB is open-sourced and has several active contributors, which means developments and improvements can be implemented fast. It’s also free, although it may not remain free for good.

DuckDB Use Cases

The two most common use cases for DuckDB are interactive data analysis and edge computing.

Let’s learn more about each one in turn.

Interactive data analysis

Data analysis describes the series of steps involved with processing and modeling data. The purpose of data analysis is to discover useful information to inform conclusions and data-driven decisions.

Nowadays, companies employ data scientists and analysts who take on this responsibility using technologies such as Python and R programming languages to conduct their analyses.

DuckDB offers a better alternative to those who prefer to use SQL for their local development than SQLite, which has a tough time managing OLAP workloads due to its lack of functionality for data analytics.

Edge computing

Embeddable databases, like DuckDB, enable users to analyze data on edge. Edge computing is an emerging distributed computing paradigm, which refers to a range of networks and devices being brought to or closer to the location where it’s required.

Processing data closer to where it’s required enables improved response times and bandwidth being preserved.

Getting Started with DuckDB

Your environment determines the steps required to install DuckDB, but the full installation guide can be found in the DuckDB documentation.

For the most part, it’s pretty straightforward. You only require a few lines of code since DuckDB is an embedded solution. It also does not require any server, and it has no external dependencies.

Here’s an example of how we would install DuckDB using Python:

pip install duckdb==0.8.0

And in R:

install.packages("duckdb")

Now, let’s check out some code examples.

Working With DuckDB in Python

Once you’ve installed DuckDB, it’s pretty simple to get started. You simply import duckdb into your environment, then connect to an existing database or create a new one if required.

For example:

import duckdb 
connection = duckdb.connect()

Since no database file was passed as a parameter to the connect() method, a new database will be created.

However, the most straightforward way to get started executing SQL queries with DuckDB is by using the sql() method.

# Source: Basic API usage - https://duckdb.org/docs/api/python/overview.html
import duckdb
duckdb.sql('SELECT 42').show()

Executing this command will run an in-memory database stored globally in the Python module and return a relation, which is essentially a symbolic representation of the query.

Note, the query is not executed until the result is explicitly requested, as follows:

# Source: Execute SQL - https://duckdb.org/docs/guides/python/execute_sql.html
results = duckdb.sql('SELECT 42').fetchall()
print(results)

"""
[(42,)]
"""

We can also use the df() method to convert the data to a Pandas DataFrame:

results_df = duckdb.sql('SELECT 42').df()
print(results) 

"""
    42
0   42
"""

Or you can use one of the many data ingestion methods to read data into memory:

# Source: Python API - https://duckdb.org/docs/api/python/overview.html 
import duckdb
duckdb.read_csv('example.csv') # read a CSV file into a Relation
duckdb.read_parquet('example.parquet')# read a Parquet file into a Relation
duckdb.read_json('example.json') # read a JSON file into a Relation

duckdb.sql('SELECT * FROM "example.csv"')     # directly query a CSV file

Disclaimer: If you use the sql() method without first connecting to an existing database, any data you store in the files will be lost after shutting down the program. You can prevent this by establishing a connection using the connect() method we discussed at the beginning of the section.

Conclusion and Further Resources

DuckDB is an OLAP database used by data professionals, such as data scientists and analysts, to analyze data in a fast and efficient manner. It leverages a SQL query execution engine capable of running complex queries on large datasets.

DuckDB also has integrations with many other programming languages, so it’s possible to access it from your favorite language.

To continue your learning, check out the SQL Basics Cheat Sheet and how DuckDB made SQL a first-class Citizen on DataCamp Workspace.


Photo of Kurtis Pykes
Author
Kurtis Pykes
Topics
Related

How to Choose The Right Data Science Bootcamp in 2023 (With Examples)

Learn everything about data science bootcamps, including a list of top programs to kickstart your career.
Abid Ali Awan's photo

Abid Ali Awan

10 min

DataCamp Portfolio Challenge: Win $500 Publishing Your Best Work

Win up to $500 by building a free data portfolio with DataCamp Portfolio.
DataCamp Team's photo

DataCamp Team

5 min

Building Diverse Data Teams with Tracy Daniels, Head of Insights and Analytics at Truist

Tracy and Richie discuss the best way to approach DE & I in data teams and the positive outcomes of implementing DEI correctly.
Richie Cotton's photo

Richie Cotton

49 min

Making Better Decisions using Data & AI with Cassie Kozyrkov, Google's First Chief Decision Scientist

Richie speaks to Google's first Chief Decision Scientist and CEO of Data Scientific, Cassie Kozyrkov, covering decision science, data and AI.
Richie Cotton's photo

Richie Cotton

68 min

Markdown Cheat Sheet

Learn everything you need to know about Markdown in this convenient cheat sheet!
Richie Cotton's photo

Richie Cotton

Chroma DB Tutorial: A Step-By-Step Guide

With Chroma DB, you can easily manage text documents, convert text to embeddings, and do similarity searches.
Abid Ali Awan's photo

Abid Ali Awan

10 min

See MoreSee More