Getting Started with Polars GPU Acceleration: 13x Faster Queries

Discover how to use the recently released Polars GPU engine, powered by NVIDIA RAPIDS cuDF, to achieve faster query performance on large datasets.

Sep 19, 2024 · 11 min read

Recently, I had the privilege of getting early access to the Polars GPU engine, powered by NVIDIA RAPIDS cuDF, before its open beta release. This cutting-edge feature has the potential to transform data workflows by accelerating Polars operations up to 13x with NVIDIA GPUs. If you work with large-scale datasets in Python, this is a game-changer you won’t want to miss.

In this blog post, I’ll explain everything you need to know about the new Polars GPU engine and provide a step-by-step guide to help you get started!

Polars: A High-Performance DataFrame Library

At the core of most data science workflows is the DataFrame, a tabular data structure that is both flexible and intuitive for handling structured data. Simply put, everyone in data science has worked with DataFrames.

DataFrames allow easy data manipulation, exploration, and analysis, providing a familiar and consistent interface for data cleaning, filtering, grouping, and transforming data.

However, in the context of big data, traditional DataFrame libraries can lack performance and scalability, which is where Polars enters the scene.

Polars is a fast, efficient DataFrame library that has quickly become a top choice for high-performance data processing. Written from scratch in Rust, Polars is designed to operate close to the hardware, optimizing speed and resource usage without relying on external dependencies.

The Introduction to Polars blog post is an excellent resource for getting started with the library in Python.

Comparing the performance of the most popular DataFrame libraries. Image source: Polars

The benchmark results published by Polars, shown in the image above, demonstrate that Polars consistently outperforms other libraries like Pandas, Modin, PySpark, and Dask across different queries. These results highlight Polars’ strength as the fastest option for high-performance data processing.

But how is that even possible? These are some of the features that make Polars blazing fast:

Written in Rust: It uses low-level programming to execute operations, remaining close to the hardware.
I/O flexibility: It offers first-class support for all common data storage layers, whether local, in cloud storage, or connected to databases.
Intuitive API: The Polars API allows you to write queries intuitively. The Polars query optimizer internally determines the most efficient execution plan, minimizing time and memory use.
Out-of-core processing: Polars’ streaming API allows you to process datasets too large to fit into memory.
Parallel execution: Polars automatically utilizes all available CPU cores, dividing workloads across them without requiring additional configuration, maximizing your machine's computational power.
Vectorized query engine: Polars processes queries in a vectorized manner using Apache Arrow, a columnar data format, and SIMD (Single Instruction, Multiple Data) to further optimize CPU usage.

And with the recent open beta announcement of the Polars GPU engine, it’s about to become even faster!

Earn a Python Certification

Showcase you are a job-ready data scientist in Python

Build My Data Career

Understanding the NVIDIA RAPIDS cuDF Integration

The demand for faster data processing and analysis has grown exponentially in recent years as datasets have scaled to hundreds of millions or even billions of rows. To address this challenge, the data science field has seen a growing shift from traditional CPU-based processing to GPU-accelerated computing. This is where NVIDIA RAPIDS cuDF comes into play.

NVIDIA RAPIDS is an open-source suite of libraries that enables GPU-accelerated data science and analytics. At its core, RAPIDS is designed to streamline data workflows by utilizing the massive parallelism of NVIDIA GPUs to accelerate tasks that are typically CPU-bound.

Within RAPIDS, cuDF is the library responsible for DataFrame operations. cuDF extends the familiar DataFrame abstraction to GPU memory, allowing integration into data workflows without extensive code changes. The library supports all the key operations data scientists perform on DataFrames: filtering, aggregating, merging, and sorting, all powered by the GPU for speed.

NVIDIA and Polars just announced the open beta release of the Python Polars GPU engine, powered by RAPIDS cuDF. This new feature marks a significant leap forward for high-performance data processing, delivering up to 13x faster workflows for Polars users with NVIDIA GPUs!

High-level design of the Polars API, including a GPU engine. Image source: Polars

With the addition of the GPU engine, Polars users can decide which engine to run their data workloads on. The Polars optimizer supports all engines, dynamically determining which queries can execute on the GPU or CPU.

The Polars GPU Engine: A Game-Changer

So now you have some context on Polars and the NVIDIA RAPIDS cuDF library and how both integrate and are made available through the Python Polars API. But what does this mean for developers and data scientists?

Here’s a summary of what the Polars GPU engine enables:

Integration with Python Polars Lazy API: Users can simply pass engine="gpu" to the collect() operation in Polars’ Lazy API, enabling GPU processing without significant code changes.
Interactive processing of large datasets: The GPU engine is designed to make processing hundreds of millions of rows of data feel interactive. With a single GPU, you can handle massive data volumes that would require more complex and slower solutions.
Optimized for efficiency: The Polars GPU engine fully utilizes the Polars optimizer to ensure execution is as efficient as possible, minimizing processing time and memory usage.
Graceful CPU fallback: In cases where specific queries aren’t supported on the GPU, the engine includes a CPU fallback. This ensures the workflow remains uninterrupted, automatically reverting to CPU processing when needed.

So, what are the results so far? Let’s see.

Polars GPU Engine accelerates data processing up to 13x. Image source: Polars

The benchmark above demonstrates the impressive capabilities of the Polars GPU engine by accelerating queries up to 13x.

However, it's important to note that using Polars on a GPU isn’t always guaranteed to be faster than on a CPU, particularly for simpler queries that aren’t computationally intensive. In such cases, performance is often limited by the speed of reading data from disk, meaning GPU acceleration offers less of an advantage when input/output (I/O) operations are the primary constraint.

That said, the GPU engine provides significant speedups for more complex data operations, making it a valuable tool for teams handling large datasets or queries involving joins, group-bys, and string processing.

The interaction between CPU and GPU

One of the most compelling features of this new integration is its flexibility. As previously mentioned, the optimizer can switch between CPU and GPU execution based on the complexity of your queries. But how exactly does this transition work?

Polars without GPU acceleration

When running Polars on the CPU, the engine follows a structured and highly optimized pipeline to execute data operations. This process involves roughly the following steps:

DSL (domain-specific language): The Polars engine first creates a structured query outline from your code, similar to a programming language syntax tree. This outline defines the sequence of operations you want to perform, such as filtering, grouping, and aggregating.
IR (intermediate representation): Next, Polars converts the query outline into a detailed execution plan, ensuring that all operations are correct and the data schemas match. This step sets up the foundation for how Polars will process the data.
Optimizer: The optimizer improves the execution plan by reordering operations and removing unnecessary steps. For example, filtering may be moved ahead of joins or aggregations to reduce the size of the dataset earlier in the process. This optimization ensures the plan is as efficient as possible before execution.
Optimized IR: After optimization, the improved execution plan is finalized and ready to be processed by the Polars engine. This plan is the blueprint for how Polars will execute each operation on the data.
Polars in-memory engine (CPU execution): Finally, the Polars in-memory engine executes the optimized plan entirely on the CPU. Despite Polars’ high-performance optimizations, this step may take considerable time if you're working with large datasets, as CPUs process data sequentially or in limited parallelism.

Polars with GPU acceleration

When you enable GPU acceleration by passing engine="gpu" to the .collect() method, the process remains essentially the same on the surface but with additional checks and optimizations behind the scenes to determine how much of the workload can be offloaded to the GPU, via cuDF:

DSL (domain-specific language): Like in the CPU mode, Polars starts by creating a structured query outline from your code. This outline is the same whether you’re running on the CPU or GPU, capturing the operations you want to perform.
IR (intermediate representation): Polars converts the outline into an execution plan, checking for correctness and ensuring the data schemas are valid. The IR remains neutral at this stage, as the engine hasn’t decided whether the CPU or GPU will handle the operations.
Optimizer: The optimizer works as it does for CPU execution, reordering operations for maximum efficiency, and removing redundant steps. At this point, the plan is still optimized for general execution.
Optimized IR: The optimized plan is finalized as in CPU execution. However, with GPU acceleration enabled, the engine now checks whether parts of this plan can be offloaded to the GPU.
cuDF callback (GPU check): This is where the magic happens! The cuDF callback determines whether the GPU can handle the execution plan. Certain operations are naturally suited for GPU acceleration, while others may not be. The callback modifies the plan to use the GPU for supported operations.
Polars in-memory engine (GPU and CPU execution): Finally, the Polars in-memory engine executes the plan. The GPU handles queries that can be GPU-accelerated, while queries that the GPU cannot handle fall back to the CPU.

This hybrid execution model—where GPU and CPU are used in tandem—makes Polars highly flexible. You benefit from GPU acceleration without sacrificing compatibility for queries not supported on the GPU. The Polars release blog post explains this in deeper technical detail.

Getting Started with the Polars GPU Engine

Finally, the most exciting part! In this section, I’ll walk through the steps needed to set up your environment and begin taking advantage of GPU-accelerated data processing.

The most straightforward way to get up and running is to follow along with this Google Colab notebook.

Colab offers free limited GPU usage, particularly useful if you don’t have immediate access to a GPU. If you’re using the provided notebook, we start from step 2 below, so you can skip the previous steps.

Prerequisites

Before installing the Polars GPU engine, ensure your system meets the requirements for NVIDIA RAPIDS cuDF. You can review the necessary system specifications, including GPU compatibility and driver requirements, on the RAPIDS documentation page.

1. Create a virtual environment (recommended)

To start with the Polars GPU engine, it's best practice to create a virtual environment to isolate your project dependencies.

I use conda for this example but feel free to use your preferred package manager.

In addition to Python, you can include JupyterLab in the environment for interactive data exploration and development, which is especially useful for running small code snippets and analyzing datasets on the fly.

conda create -n polars-gpu -c conda-forge python=3.11 jupyterlab

After creating the virtual environment, activate it using:

conda activate polars-gpu

2. Install the Polars GPU engine

Now, let’s enable GPU acceleration by installing Polars and the GPU engine. This will also set up cuDF and other dependencies required for using NVIDIA GPUs:

pip install -U polars[gpu] --extra-index-url=https://pypi.nvidia.com

Once the Polars GPU engine is installed, using Polars on a GPU will feel similar to working with the CPU but with much faster results for complex or data-heavy workflows.

As explained before, the Polars engine automatically optimizes the execution plan, fully utilizing the NVIDIA GPU to speed up operations and minimize memory usage. For queries that do not support GPU acceleration, the Polars GPU engine has graceful CPU fallback, ensuring the workflow continues uninterrupted.

Let’s put the engine into action with a few simple examples!

Running Some Queries with the Polars GPU Engine

For this demonstration, we'll use a 22GB dataset of simulated financial transactions from Kaggle. NVIDIA hosts the dataset on a Google Cloud Storage (GCS) bucket, ensuring fast download speeds.

To start, download the dataset:

!wget https://storage.googleapis.com/rapidsai/polars-demo/transactions.parquet -O transactions.parquet

Since this dataset is sourced from Kaggle, it's governed by a Kaggle dataset-specific license and terms of use.

Let’s load the dataset using Polars and inspect the schema:

import polars as pl
from polars.testing import assert_frame_equal

transactions = pl.scan_parquet("transactions.parquet")
transactions.collect_schema()

Here’s the output:

Schema([('CUST_ID', String),
        ('START_DATE', Date),
        ('END_DATE', Date),
        ('TRANS_ID', String),
        ('DATE', Date),
        ('YEAR', Int64),
        ('MONTH', Int64),
        ('DAY', Int64),
        ('EXP_TYPE', String),
        ('AMOUNT', Float64)])

Now, let’s calculate the total transaction volume by summing the AMOUNT column. First, let’s try it without GPU acceleration:

transactions.select(pl.col("AMOUNT").sum()).collect()

Output:

AMOUNT
f64
3.6183e9

That’s a high total transaction volume! Let's run the same query on the GPU:

transactions.select(pl.col("AMOUNT").sum()).collect(engine="gpu")

Output:

AMOUNT
f64
3.6183e9

For simple operations like this, the CPU and GPU produce the same result with similar speed, as the query is not computationally intensive enough to benefit much from GPU acceleration. However, the GPU will shine when handling more complex queries.

Now, let’s move on to a more complex query. In this query, we group transactions by customer ID (CUST_ID), sum the total transaction amounts, and then sort the results by the highest spenders.

First, we’ll run it on the CPU:

%%time
res_cpu = (
   transactions
   .group_by("CUST_ID")
   .agg(pl.col("AMOUNT").sum())
   .sort(by="AMOUNT", descending=True)
   .head()
   .collect()
)
res_cpu

Here’s the output:

CPU times: user 4.63 s, sys: 3.75 s, total: 8.38 s
Wall time: 6.04 s
CUST_ID
AMOUNT
str
f64
"CA9UYOQ5DA"
2.0290e6
"CJUK2MTM5Q"
1.8115e6
"CYXX1NBIKL"
1.8082e6
"C6ILEYAYQ9"
1.7961e6
"CCNBC305GI"
1.7274e6

Now, let’s run the same query on the GPU:

%%time
res_gpu = (
   transactions
   .group_by("CUST_ID")
   .agg(pl.col("AMOUNT").sum())
   .sort(by="AMOUNT", descending=True)
   .head()
   .collect(engine=”gpu”)
)
res_gpu

Output:

CPU times: user 347 ms, sys: 0 ns, total: 347 ms
Wall time: 353 ms
shape: (5, 2)
CUST_ID
AMOUNT
str
f64
"CA9UYOQ5DA"
2.0290e6
"CJUK2MTM5Q"
1.8115e6
"CYXX1NBIKL"
1.8082e6
"C6ILEYAYQ9"
1.7961e6
"CCNBC305GI"
1.7274e6

As we can see, using the GPU for more complex queries offers significant performance gains, reducing execution time from 6.04 seconds on the CPU to just 353 milliseconds on the GPU!

This example demonstrates the powerful performance boost the Polars GPU engine provides for large-scale data operations.

You can find more advanced examples in the accompanying Colab notebook.

Conclusion

The Polars GPU Engine, powered by NVIDIA RAPIDS cuDF, brings impressive speed improvements, with up to 13x faster data processing for complex operations. For large-scale datasets, Polars offers a clear advantage over traditional DataFrame libraries.

Even though not every query will benefit equally from GPU acceleration, Polars remains a powerful tool for anyone working with large datasets. The ease of integration, combined with its hybrid CPU-GPU execution model, makes Polars a strong contender for modern data workflows.

If you're interested in developing your data manipulation skills, especially with Pandas, I highly recommend checking out these courses:

These resources will give you a solid foundation which complements the powerful capabilities of tools like Polars!

Get certified in your dream Data Scientist role

Our certification programs help you stand out and prove your skills are job-ready to potential employers.

Get your Certification

What kind of queries benefit the most from GPU acceleration in Polars?

Will GPU acceleration always be faster than using the CPU?

Do I need to rewrite my Polars queries to use the GPU engine?

Can I use the Polars GPU engine on any machine?

Author

Thalia Barrera

Topics

Python

Data Science

Learn more about data science and Python with these courses!

Course

Introduction to Data Science in Python

4 hr

488.8K

Dive into data science using Python and learn how to effectively analyze and visualize your data. No coding experience or skills needed.

See Details

Start Course

Course

Reshaping Data with pandas

4 hr

21.9K

Reshape DataFrames from a wide to long format, stack and unstack rows and columns, and wrangle multi-index DataFrames.

See Details

Start Course

Course

Intermediate Python for Developers

2 hr

42.8K

Dive into the Python ecosystem, discovering modules and packages along with how to write custom functions!

See Details

Start Course

blog

An Introduction to Polars: Python's Tool for Large-Scale Data Analysis

Explore Polars, a robust Python library for high-performance data manipulation and analysis. Learn about its features, its advantages over pandas, and how it can revolutionize your data analysis processes.

Moez Ali

9 min

blog

NVIDIA Announces cuDF pandas Accelerator Mode

Discover how NVIDIA's new cuDF pandas Accelerator Mode can turbocharge your data manipulation tasks in Python. Learn how to get started, the benefits it offers, and how it simplifies high-performance pandas coding.

Richie Cotton

8 min

Tutorial

High Performance Data Manipulation in Python: pandas 2.0 vs. polars

Discover the main differences between Python’s pandas and polars libraries for data science

Javier Canales Luna

Tutorial

DuckDB for Data Engineers: Speed Up Your Data Pipelines 10x and More

DuckDB is a powerful analytical engine that lives on your laptop. You can use it to speed up data reading and processing and reduce your pipeline runtimes from minutes to seconds. Follow this hands-on guide to learn how.

Dario Radečić

Tutorial

Benchmarking High-Performance pandas Alternatives

Discover the latest benchmarking of Python's powerful pandas alternatives, Polars, Vaex, and Datatable. Discover their performance in data loading, grouping, sorting, and more.

Zoumana Keita

code-along

Getting Started with Data Analysis in Snowflake using Python and SQL

In this code-along session, you will learn how to use Snowpark Python and SQL to perform data analysis in the Snowflake Data Cloud.

Vino Duraisamy

See More See More

Polars: A High-Performance DataFrame Library

Earn a Python Certification

Understanding the NVIDIA RAPIDS cuDF Integration

The Polars GPU Engine: A Game-Changer

The interaction between CPU and GPU

Polars without GPU acceleration

Polars with GPU acceleration

Getting Started with the Polars GPU Engine

Prerequisites

1. Create a virtual environment (recommended)

2. Install the Polars GPU engine

Running Some Queries with the Polars GPU Engine

Conclusion

Get certified in your dream Data Scientist role

FAQs

Do I need to rewrite my Polars queries to use the GPU engine?

Can I use the Polars GPU engine on any machine?

An Introduction to Polars: Python's Tool for Large-Scale Data Analysis

NVIDIA Announces cuDF pandas Accelerator Mode

High Performance Data Manipulation in Python: pandas 2.0 vs. polars

DuckDB for Data Engineers: Speed Up Your Data Pipelines 10x and More

Benchmarking High-Performance pandas Alternatives

Getting Started with Data Analysis in Snowflake using Python and SQL

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to Data Science in Python

Reshaping Data with pandas

Intermediate Python for Developers

An Introduction to Polars: Python's Tool for Large-Scale Data Analysis

NVIDIA Announces cuDF pandas Accelerator Mode

High Performance Data Manipulation in Python: pandas 2.0 vs. polars

DuckDB for Data Engineers: Speed Up Your Data Pipelines 10x and More

Benchmarking High-Performance pandas Alternatives

Getting Started with Data Analysis in Snowflake using Python and SQL

Introduction to Data Science in Python