Skip to main content
HomeTutorialsData Engineering

DuckDB for Data Engineers: Speed Up Your Data Pipelines 10x and More

DuckDB is a powerful analytical engine that lives on your laptop. You can use it to speed up data reading and processing and reduce your pipeline runtimes from minutes to seconds. Follow this hands-on guide to learn how.
Sep 21, 2024  · 27 min read

How many times has a manager asked you how long your pipeline takes to run, and you felt embarrassed to answer?

You’re not alone. The convenience of Python has to be paid for somewhere. That place is usually the runtime speed. Pipelines are fast to write but difficult to scale, often taking more and more time as project requirements grow.

But maybe it’s not your code that’s the bottleneck. Maybe there isn’t any additional performance you can squeeze out of pandas. Maybe switching to a different data processing engine can reduce the runtime from hours to minutes.

That’s where DuckDB chimes in.

In this article, I’ll tell you exactly what DuckDB is and why it matters to data engineers. You’ll learn how to use DuckDB through hands-on examples and see how much faster it is than Python’s most popular data processing libraries. 

What is DuckDB, and Is It the Next Big Thing for Data Engineers?

DuckDB is an open-source, embedded, in-process, relational OLAP DBMS.

If you prefer plain English, think of it as an analytical columnar database running in memory. Being an analytical database, it’s optimized for SELECT statements rather than INSERT and UPDATE. It’s just like SQLite, but the opposite.

DuckDB has been around for about 5 years, but the first “stable” release was announced in June 2024 - 3 months ago at the time of writing. But don’t let the newness factor fool you - DuckDB has been tried and tested by many and is almost always a recommended tool when speed is critical.

Why choose DuckDB for your data pipelines

If you are a data engineer, here are a couple of concrete benefits of using DuckDB in your data pipelines:

  • It’s fast: Think orders of magnitude faster than Python’s default data frame libraries. In most cases, it’s even faster than libraries optimized for speed (e.g., polars).
  • It’s open-source: The entire source code is available on a public GitHub repo. You can use it, tweak it, and even contribute to the project.
  • It delays the point of switching to the cloud: Cloud computing can get expensive, but the price is often justified when the processing speed is mission-critical. With DuckDB, you can analyze hundreds of millions of rows on your laptop.
  • It’s easy to pick up: It’s easy to get started with DuckDB. You’ll likely have to make minimal changes in your existing data pipelines, as DuckDB integrates nicely with Python’s data processing libraries.
  • It integrates well with cloud storage platforms: For example, you can write a DuckDB SQL query that reads data directly from AWS S3. There’s no need to download the data first.

For these reasons (and many more to come), I think DuckDB is the next big leap for data engineers.

Become a Data Engineer

Build Python skills to become a professional data engineer.

Get Started for Free

How to Get Started with DuckDB

As I said earlier, DuckDB operates on your laptop. The first order of business is getting it there. 

I’ll show you the installation steps for macOS, but the instructions for Windows and Linux are well-explained and easy to follow.

Step 1: Install DuckDB

If you have a Mac, the easiest way to install DuckDB is with Homebrew. Run the following shell command:

brew install duckdb

You’ll see the following output in a matter of seconds:

Installing DuckDB with Homebrew

Installing DuckDB with Homebrew

Once installed, you can enter the DuckDB shell with the following command:

duckdb

DuckDB shell

DuckDB shell

From here, the world is your oyster!

You can write and run SQL queries directly from the console. For example, I’ve downloaded the first 6 months’ worth of NYC Taxi Data for 2024. If you’ve done the same, use the following snippet to display the row count of a single Parquet file:

SELECT COUNT(*)
  FROM PARQUET_SCAN("path-to-data.parquet");

Single Parquet file row counts

Single Parquet file row counts

Yup - that’s almost 20 million rows from a single file. I’ve downloaded 24 of them.

You probably don’t want to run DuckDB from the terminal, so next, I’ll walk you through the steps of connecting DuckDB with Python.

Step 2: Connect DuckDB to your Python workflow

The goal of this section is to show you how to configure and run DuckDB in the most popular language of data engineering: Python.

In fact, DataCamp offers a complete career track in data engineering in Python.

Python has a specific DuckDB library that you need to install first. Run the following from the terminal, preferably in a virtual environment:

pip install duckdb

Now create a new Python file and paste the following code:

import duckdb
# Function to get the count from a single file
def get_row_count_from_file(conn: duckdb.DuckDBPyConnection, file_path: str) -> int:
    query = f"""
        SELECT COUNT(*)
        FROM PARQUET_SCAN("{file_path}")
    """
    return conn.sql(query).fetchone()[0]
if __name__ == "__main__":
    # In-memory database connection
    conn = duckdb.connect()
    # Path to a parquet file
    file_path = "fhvhv_tripdata_2024-01.parquet"
    # Get the row count
    row_count = get_row_count_from_file(conn=conn, file_path=file_path)
    print(row_count)

Long story short, the code snippet has a function that fetches the row count from a single file, given a valid DuckDB connection and a file path.

You’ll get the following result back almost instantly after running the script:

DuckDB and Python connection

DuckDB and Python connection

That’s it! 

Up next, I’ll show you how to do amazing things with Python and DuckDB blazing-fast!

DuckDB in Action: How to Speed up Data Pipelines

For reference, I’ll run the code on a 2024 portion (January - June) of the NYC Taxi Dataset - specifically for high-volume for-hire vehicles. That’s 6 Parquet files, taking approximately 3GB of disk space. As for the hardware, I’m using an M3 Pro MacBook Pro 16” with 12 CPU cores and 36GB of unified memory.

Your mileage may vary, but you should still end up with similar numbers.

Read huge datasets in no time

I’ve converted these 6 files into two additional formats - CSV and JSON. You can see how much disk space each of them takes:

Dataset size comparison

Dataset size comparison

Total: 2.96GB for Parquet, 19.31GB for CSV, and 66.04GB for JSON.

Huge difference! If you take nothing else from today’s article, at least remember always to use the Parquet file format when dealing with huge files. It’ll save you both processing time and disk space.

DuckDB has convenient functions that can read multiple files of the same format at once (using a glob pattern). I’ll use it to read all six to get the row count and compare the runtime:

import duckdb
import pandas as pd
from datetime import datetime
def get_row_count_and_measure_time(file_format: str) -> str:
    # Construct a DuckDB query based on the file_format
    match file_format:
        case "csv":
            query = """
            SELECT COUNT(*)
            FROM READ_CSV("nyc-taxi-data-csv/*.csv")
            """
        case "json":
            query = """
            SELECT COUNT(*)
            FROM READ_JSON("nyc-taxi-data-json/*.json")
            """
        case "parquet":
            query = """
                SELECT COUNT(*)
                FROM READ_PARQUET("nyc-taxi-data/*.parquet")
            """
        case _:
            raise KeyError("Param file_format must be in [csv, json, parquet]")
    # Open the database connection and measure the start time
    time_start = datetime.now()
    conn = duckdb.connect()
    # Get the row count
    row_count = conn.sql(query).fetchone()[0]
    # Close the database connection and measure the finish time
    conn.close()
    time_end = datetime.now()
    return {
        "file_format": file_format,
        "duration_seconds": (time_end - time_start).seconds + ((time_end - time_start).microseconds) / 1000000,
        "row_count": row_count
    }
# Run the function
file_format_results = []
for file_format in ["csv", "json", "parquet"]:
    file_format_results.append(get_row_count_and_measure_time(file_format=file_format))
pd.DataFrame(file_format_results)

The results are in - there’s only one clear winner:

File reading runtime comparison

File reading runtime comparison

Thanks to DuckDB, all three are fast and end up in the same place, but it’s clear that Parquet wins by a factor of 600 compared to CSV and by a factor of 1200 compared to JSON.

For this reason, I’ll only use Parquet for the rest of the article.

Query data with SQL

DuckDB lets you aggregate data through SQL.

My goal for this section is to expand on that idea. To be more precise, I’ll show you how to calculate summary statistics on a monthly level that include information on the number of rides, ride time, ride distance, ride cost, and driver pay—all from 120 million rows of data.

The best part: It’ll only take two seconds!

This is the code I used inside a Python script to aggregate data and print the results:

conn = duckdb.connect()
query = """
    SELECT
        ride_year || '-' || ride_month AS ride_period,
        COUNT(*) AS num_rides,
        ROUND(SUM(trip_time) / 86400, 2) AS ride_time_in_days,
        ROUND(SUM(trip_miles), 2) AS total_miles,
        ROUND(SUM(base_passenger_fare + tolls + bcf + sales_tax + congestion_surcharge + airport_fee + tips), 2) AS total_ride_cost,
        ROUND(SUM(driver_pay), 2) AS total_rider_pay
    FROM (
        SELECT
            DATE_PART('year', pickup_datetime) AS ride_year,
            DATE_PART('month', pickup_datetime) AS ride_month,
            trip_time,
            trip_miles,
            base_passenger_fare,
            tolls,
            bcf,
            sales_tax,
            congestion_surcharge,
            airport_fee,
            tips,
            driver_pay
        FROM PARQUET_SCAN("nyc-taxi-data/*.parquet")
        WHERE 
            ride_year = 2024
        AND ride_month >= 1 
        AND ride_month <= 6
    )
    GROUP BY ride_period
    ORDER BY ride_period
"""
# Aggregate data, print it, and show the data type
results = conn.sql(query)
results.show()
print(type(results))

The output contains aggregation results, variable type for results, and the total runtime:

Monthly summary statistics aggregation

Monthly summary statistics aggregation

Let it sink in: That’s 3GB of data spanning over 120 million rows, all aggregated in 2 seconds on a laptop!

The only problem is that the variable type is somewhat unusable in the long run. Luckily, there’s an easy fix.

Integrate with pandas and DataFrames

The upside of using DuckDB isn’t just that it’s fast, but it also integrates well with your favorite data frames library: pandas.

If you have a temporary result set similar to the one I showed above, you only have to call the .df() method on it to convert it into a pandas DataFrame:

# Convert to Pandas DataFrame
results_df = results.df()
# Print the type and contents
print(type(results_df))
results_df

DuckDB to pandas conversion

DuckDB to pandas conversion

Similarly, you can use DuckDB to perform calculations on pandas DataFrames that have already been loaded into memory.

The trick is referencing the variable name after the FROM keyword in a DuckDB SQL query. Here’s an example:

# Pandas DataFrame
pandas_df = pd.read_parquet("nyc-taxi-data/fhvhv_tripdata_2024-01.parquet")
# Run SQL queries through DuckDB
duckdb_res = duckdb.sql("""
    SELECT
        pickup_datetime,
        dropoff_datetime,
        trip_miles,
        trip_time,
        driver_pay
    FROM pandas_df
    WHERE trip_miles >= 300
""").df()
duckdb_res

DuckDB query on an existing Pandas DataFrame

DuckDB query on an existing pandas DataFrame

To conclude, you can avoid pandas altogether or perform aggregations faster on existing pandas DataFrames.

Up next, I’ll show you a couple of advanced DuckDB features.

Advanced DuckDB for Data Engineers

There’s a lot more DuckDB than meets the eye.

In this section, I’ll walk you through a couple of advanced DuckDB features that are essential to data engineers.

Extensions

DuckDB allows you to extend its native functionality through core and community extensions. I use extensions all the time to connect to cloud storage systems and databases and work with additional file formats.

You can install extensions through both the Python and DuckDB consoles.

In the code snippet below, I show you how to install the httpfs extension that’s needed to connect to AWS and read S3 data:

import duckdb
conn = duckdb.connect()
conn.execute("""
    INSTALL httpfs;
    LOAD httpfs;
""")

You won’t see any output if you don’t chain the .df() method to conn.execute(). If you do, you’ll see either a “Success” or an “Error” message.

Cloud storage querying

More often than not, your company will grant you access to the data that you have to include in your pipelines. These are usually stored on scalable cloud platforms, such as AWS S3.

DuckDB allows you to connect S3 (and other platforms) directly.

I already showed you how to install the extension (httpfs), and the only thing left to do is to configure it. AWS requires you to provide information on your region, access key, and secret access key.

Assuming you have all of these, run the following command through Python:

conn.execute("""    CREATE SECRET aws_s3_secret (
        TYPE S3,
        KEY_ID '<your-access-key>',
        SECRET '<your-secret-key>',
        REGION '<your-region>'
    );
""")

My S3 bucket contains two of the files from the New York Taxi dataset:

S3 bucket contents

S3 bucket contents

There’s no need to download the files, as you can scan them directly from S3:

import duckdb
conn = duckdb.connect()
aws_count = conn.execute("""
    SELECT COUNT(*)
    FROM PARQUET_SCAN('s3://<your-bucket-name>/*.parquet');
""").df()
aws_count

DuckDB S3 fetch results

DuckDB S3 fetch results

You’re reading this right: It took only 4 seconds to go through ~ 900MB of data sitting in an S3 bucket.

Parallel processing

When it comes to parallelization, DuckDB implements it by default based on row groups: horizontal data partitions you’d typically find in Parquet. A single row group can have a maximum of 122,880 rows. 

Parallelism in DuckDB starts when you’re working with more than 122,880 rows.

DuckDB will automatically launch new threads when this occurs. By default, the number of threads equals the number of CPU cores. But this doesn’t stop you from manually playing around with the number of threads.

In this section, I’ll show you how to go about it and the exact impact a different number of threads has on an identical workload.

To view the current number of threads, run the following:

SELECT current_setting('threads') AS threads;

Current number of threads used by DuckDB

The current number of threads used by DuckDB

You can get the same results through Python.

You’ll want to run the SET threads = N command to change the number of threads. 

In the following code snippet, I’ll show you how to implement a custom Python function that runs a DuckDB query on a given number of threads, converts it into a pandas DataFrame, and returns the runtime (among other things). The code below the function runs it in a range of [1, 12] threads:

def thread_test(n_threads: int) -> dict:
    # Open the database connection and measure the start time
    time_start = datetime.now()
    conn = duckdb.connect()
    # Set the number of threads
    conn.execute(f"SET threads = {n_threads};")
    query = """
        SELECT
            ride_year || '-' || ride_month AS ride_period,
            COUNT(*) AS num_rides,
            ROUND(SUM(trip_time) / 86400, 2) AS ride_time_in_days,
            ROUND(SUM(trip_miles), 2) AS total_miles,
            ROUND(SUM(base_passenger_fare + tolls + bcf + sales_tax + congestion_surcharge + airport_fee + tips), 2) AS total_ride_cost,
            ROUND(SUM(driver_pay), 2) AS total_rider_pay
        FROM (
            SELECT
                DATE_PART('year', pickup_datetime) AS ride_year,
                DATE_PART('month', pickup_datetime) AS ride_month,
                trip_time,
                trip_miles,
                base_passenger_fare,
                tolls,
                bcf,
                sales_tax,
                congestion_surcharge,
                airport_fee,
                tips,
                driver_pay
            FROM PARQUET_SCAN("nyc-taxi-data/*.parquet")
            WHERE 
                ride_year = 2024
            AND ride_month >= 1 
            AND ride_month <= 6
        )
        GROUP BY ride_period
        ORDER BY ride_period
    """
    # Convert to DataFrame
    res = conn.sql(query).df()
    # Close the database connection and measure the finish time
    conn.close()
    time_end = datetime.now()
    return {
        "num_threads": n_threads,
        "num_rows": len(res),
        "duration_seconds": (time_end - time_start).seconds + ((time_end - time_start).microseconds) / 1000000
    }
thread_results = []
for n_threads in range(1, 13):
    thread_results.append(thread_test(n_threads=n_threads))
pd.DataFrame(thread_results)

You should see an output similar to mine:

DuckDB runtime for different number of threads

DuckDB runtime for different number of threads

In general, the more threads you throw at a computation, the faster it will finish. There might be a point at which an overhead of creating a new thread leads to an increase in the overall runtime, but I didn’t encounter it in this range.

Performance Comparison: DuckDB vs Traditional Approaches

In this section, I’ll show you how to write a data pipeline from scratch. It’ll be relatively simple: reading data from disk, performing aggregations, and writing the results. My goal is to show how much performance you can gain by switching from pandas to DuckDB. Not only that, but your code will also look cleaner.

This is not a comprehensive introduction to ETL/ELT processes but a high-level overview.

Data pipeline objectives

The data pipeline I’m about to show you accomplishes the following:

  • Extract: Reads multiple Parquet files from disk (around 120 million rows).
  • Transform: Calculates monthly summary statistics, such as Taxi company revenue, revenue margin (difference from ride cost and rider pay), and average revenue per ride. You’ll also see more common statistics I discussed earlier in the article.
  • Load: Save the monthly statistics locally into a CSV file.

After running the code, you should end up with exactly the same aggregation results for both pipeline implementations (not taking some rounding differences into account):

Pipeline results for DuckDB and Pandas

Pipeline results for DuckDB and pandas

Up first, let’s go over the pipeline implementation in pandas.

Code: Python and pandas

No matter what I’ve tried, I couldn’t process all 6 Parquet files at once. System warning messages like this one were shown in a matter of seconds:

System memory error

System memory error

It looks like 36GB of memory isn’t enough to process 120 million rows at once. As a result, the Python script was killed:

Killed Python script due to insufficient memory

Killed Python script due to insufficient memory

To mitigate this, I had to process Parquet files sequentially. Here’s the code I used for the data pipeline:

import os
import pandas as pd
def calculate_monthly_stats_per_file(file_path: str) -> pd.DataFrame:
    # Read a single Parquet file
    df = pd.read_parquet(file_path)
    # Extract ride year and month
    df["ride_year"] = df["pickup_datetime"].dt.year
    df["ride_month"] = df["pickup_datetime"].dt.month
    # Remove data points that don"t fit in the time period
    df = df[(df["ride_year"] == 2024) & (df["ride_month"] >= 1) & (df["ride_month"] <= 6)]
    # Combine ride year and month
    df["ride_period"] = df["ride_year"].astype(str) + "-" + df["ride_month"].astype(str)
    # Calculate total ride cost
    df["total_ride_cost"] = (
        df["base_passenger_fare"] + df["tolls"] + df["bcf"] +
        df["sales_tax"] + df["congestion_surcharge"] + df["airport_fee"] + df["tips"]
    )
    # Aggregations
    summary = df.groupby("ride_period").agg(
        num_rides=("pickup_datetime", "count"),
        ride_time_in_days=("trip_time", lambda x: round(x.sum() / 86400, 2)),
        total_miles=("trip_miles", "sum"),
        total_ride_cost=("total_ride_cost", "sum"),
        total_rider_pay=("driver_pay", "sum")
    ).reset_index()
    # Additional attributes
    summary["total_miles_in_mil"] = summary["total_miles"] / 1000000
    summary["company_revenue"] = round(summary["total_ride_cost"] - summary["total_rider_pay"], 2)
    summary["company_margin"] = round((1 - (summary["total_rider_pay"] / summary["total_ride_cost"])) * 100, 2).astype(str) + "%"
    summary["avg_company_revenue_per_ride"] = round(summary["company_revenue"] / summary["num_rides"], 2)
    # Remove columns that aren't needed anymore
    summary.drop(["total_miles"], axis=1, inplace=True)
    return summary
def calculate_monthly_stats(file_dir: str) -> pd.DataFrame:
    # Read data from multiple Parquet files
    files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith(".parquet")]
    
    df = pd.DataFrame()
    for file in files:
        print(file)
        file_stats = calculate_monthly_stats_per_file(file_path=file)
        # Check if df is empty
        if df.empty:
            df = file_stats
        else:
            # Concat row-wise
            df = pd.concat([df, file_stats], axis=0)
    # Sort the dataset
    df = df.sort_values(by="ride_period")
    # Change column order
    cols = ["ride_period", "num_rides", "ride_time_in_days", "total_miles_in_mil", "total_ride_cost",
            "total_rider_pay", "company_revenue", "company_margin", "avg_company_revenue_per_ride"]
    return df[cols]
if __name__ == "__main__":
    data_dir = "nyc-taxi-data"
    output_dir = "pipeline_results"
    output_file_name = "results_pandas.csv"
    # Run the pipeline
    monthly_stats = calculate_monthly_stats(file_dir=data_dir)
    # Save to CSV
    monthly_stats.to_csv(f"{output_dir}/{output_file_name}", index=False)

Pipeline implementation in DuckDB should hopefully finish without any memory issues.

Code: DuckDB

You’re already familiar with the big chunk of the DuckDB code.

The only thing that’s new is the top SELECT, as it calculates a couple of additional statistics. I left everything else unchanged:

import duckdb
import pandas as pd
def calculate_monthly_stats(file_dir: str) -> pd.DataFrame:
    query = f"""
        SELECT
            ride_period,
            num_rides,
            ride_time_in_days,
            total_miles / 1000000 AS total_miles_in_mil,
            total_ride_cost,
            total_rider_pay,
            ROUND(total_ride_cost - total_rider_pay, 2) AS company_revenue,
            ROUND((1 - total_rider_pay / total_ride_cost) * 100, 2) || '%' AS company_margin,
            ROUND((total_ride_cost - total_rider_pay) / num_rides, 2) AS avg_company_revenue_per_ride
        FROM (
            SELECT
                ride_year || '-' || ride_month AS ride_period,
                COUNT(*) AS num_rides,
                ROUND(SUM(trip_time) / 86400, 2) AS ride_time_in_days,
                ROUND(SUM(trip_miles), 2) AS total_miles,
                ROUND(SUM(base_passenger_fare + tolls + bcf + sales_tax + congestion_surcharge + airport_fee + tips), 2) AS total_ride_cost,
                ROUND(SUM(driver_pay), 2) AS total_rider_pay
            FROM (
                SELECT
                    DATE_PART('year', pickup_datetime) AS ride_year,
                    DATE_PART('month', pickup_datetime) AS ride_month,
                    trip_time,
                    trip_miles,
                    base_passenger_fare,
                    tolls,
                    bcf,
                    sales_tax,
                    congestion_surcharge,
                    airport_fee,
                    tips,
                    driver_pay
                FROM PARQUET_SCAN("{file_dir}/*.parquet")
                WHERE 
                    ride_year = 2024
                AND ride_month >= 1 
                AND ride_month <= 6
            )
            GROUP BY ride_period
            ORDER BY ride_period
        )
    """
    conn = duckdb.connect()
    df = conn.sql(query).df()
    conn.close()
    return df
if __name__ == "__main__":
    data_dir = "nyc-taxi-data"
    output_dir = "pipeline_results"
    output_file_name = "results_duckdb.csv"
    # Run the pipeline
    monthly_stats = calculate_monthly_stats(file_dir=data_dir)
    # Save to CSV
    monthly_stats.to_csv(f"{output_dir}/{output_file_name}", index=False)

Up next, I’ll show you the runtime differences.

Performance comparison results

After running both pipelines 5 times and averaging the results, these are the runtimes I got:

DuckDB vs pandas runtime comparison chart

DuckDB vs. pandas runtime comparison chart

On average, pandas was 24 times slower than DuckDB when loading and processing around 120 million rows (~3GB), spread over 6 Parquet files.

The comparison is not entirely fair, since I wasn’t able to process all Parquet files at once with pandas. Nevertheless, the results are still relevant since you’ll be encountering the same difficulties at your day job. 

If one library can’t do what you need, try another one. And the one that will likely finish the quickest most of the time is DuckDB.

Best Practices for Using DuckDB in Data Pipelines

Before letting you explore DuckDB further on your own, I want to share a couple of generic and data-engineering-related best practices:

  • Always try to use DuckDB’s built-in functions first: You can bring your Python functions into DuckDB with user-defined functions, which brings the flexibility of DuckDB to an entirely different dimension. One of the generic best practices in programming is to not reinvent the wheel. In other words, you don’t want to implement a functionality from scratch if it already exists.
  • Optimize file formats first: Just because DuckDB offers significant performance benefits out of the box, it doesn’t mean you shouldn’t cut your optimizations short. You’ve seen how much slower DuckDB is when reading CSV files (up to 600 times) when compared to Parquet. The latter will always be faster to read and will take less disk space. Win-win. 
  • Don’t use DuckDB for transactional purposes: DuckDB is an OLAP database (Online Analytical Processing), which means it’s optimized for SELECT statements. You should avoid using it in workflows that rely on frequent INSERT and UPDATE statements. If that’s the case, use SQLite instead.
  • Leverage the native DuckDB support in Jupyter Notebooks: Jupyter Lab/Notebook users can run DuckDB queries directly, without the need to use the specific Python functions. This is a great way to explore data faster and keep your notebooks tidy.
  • Always remember how DuckDB handles concurrency: You can configure DuckDB in a way that one process can read and write to the database, or in a way that multiple processes can read from it, but none can write. The first one allows for the caching of data in RAM for faster analytical queries. Multiple processes in theory can write to the same database file, but to achieve this you would need to implement the logic for cross-process mutex locks and opening/closing the database manually.
  • You can use DuckDB in the cloud: The MotherDuck project is a collaborative data warehouse that brings and extends the power of DuckDB to the cloud. It might be an avenue worth exploring for data engineers.

Wrapping Up

To conclude, I think you should try DuckDB if you’re a data engineer in charge of building and optimizing data pipelines.

You have nothing to lose and everything to gain. The thing is almost guaranteed to be faster than anything Python has to offer. It runs blazing fast on your laptop and allows you to go through datasets that would otherwise result in memory errors. It can even connect to cloud storage platforms in cases you don’t want to or aren’t allowed to store data locally.

That being said, DuckDB shouldn’t be the only optimization you bring to the table. You should always strive to optimize the data that’s going into your system. For example, ditching CSV in favor of Parquet will save you both compute time and storage. Another optimization you should consider is to implement the principles of modern data architecture into your workflows and pipelines.

DuckDB is just a tool. But a very powerful one.

Get certified in your dream Data Engineer role

Our certification programs help you stand out and prove your skills are job-ready to potential employers.

Timeline mobile.png

FAQs

Is DuckDB free to use?

Yes, DuckDB is an open-source project. You can download it, modify it, and even contribute through GitHub.

How is DuckDB different from SQLite?

DuckDB is optimized for analytical workloads through complex SQL queries (think SELECT statements), while SQLite is better suited for transactional processing (think INSERT and UPDATE statements).

Is DuckDB a NoSQL database?

No, DuckDB is an in-process SQL OLAP (Online Analytical Processing) database management system.

Is DuckDB faster than pandas?

In almost all scenarios, DuckDB will be faster than pandas, often by an order of magnitude (at least). It will also allow you to work with datasets that would raise memory errors in pandas.

Does DuckDB use a special SQL dialect?

No, you’ll feel right at home if you have basic SQL knowledge. DuckDB closely follows the conventions of the PostgreSQL dialect, but even if you’re accustomed to a different database vendor, there’s practically no learning curve.


Photo of Dario Radečić
Author
Dario Radečić
LinkedIn
Senior Data Scientist based in Croatia. Top Tech Writer with over 700 articles published, generating more than 10M views. Book Author of Machine Learning Automation with TPOT.
Topics

Learn more about data engineering with these courses!

Course

Understanding Data Engineering

2 hr
240.4K
Discover how data engineers lay the groundwork that makes data science possible. No coding involved!
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

An Introduction to DuckDB: What is It and Why Should You Use It?

Explore DuckDB, the fast, easy-to-use analytics database for Python and R. Learn its key features, use cases, and how it optimizes data analysis tasks.
Kurtis Pykes 's photo

Kurtis Pykes

7 min

A duck codes in DuckDB

blog

DuckDB makes SQL a first-class citizen on DataLab

In this blog post, we list out all recent improvements that make it seamless and efficient to query data with SQL, all without leaving the tool.
Filip Schouwenaars's photo

Filip Schouwenaars

tutorial

DuckDB Tutorial: Building AI Projects

This tutorial guides you through DuckDB's key features and practical applications, including building tables, performing data analysis, building an RAG application, and using an SQL query engine with LLM.
Abid Ali Awan's photo

Abid Ali Awan

13 min

tutorial

Moving Data with Python and dlt: A Guide for Data Engineers

Python dlt is a great choice for data engineers who want to move data fast and confidently. It integrates with almost any source and destination and lets you write your ELT/ETL pipelines quickly. This tutorial shows you how to control every step of the pipeline and work with local files, cloud, databases, and REST APIs.
Dario Radečić's photo

Dario Radečić

26 min

code-along

Getting Started with Data Pipelines for ETL

In this session, you'll learn fundamental concepts of data pipelines, like what they are and when to use them, then you'll get hands-on experience building a simple pipeline using Python.
Jake Roach's photo

Jake Roach

code-along

Creating Data Pipelines with Airflow

Join Mike, an experienced data engineering consultant, as he guides you through the fundamentals of data pipelines with Airflow and Python.
Mike Metzger's photo

Mike Metzger

See MoreSee More