Introduction
Pandas is an extraordinarily powerful tool in Python's data science ecosystem, offering several data manipulation and cleaning capabilities. However, while great for medium-sized datasets, it can face performance issues when dealing with large datasets, prompting the need for high-performance alternatives.
This comprehensive article introduces some of these alternatives and compares them through benchmarking in terms of data loading time, execution time, memory usage, scalability, and ease of use.
Understanding the Benchmarks
Benchmarks are a point of reference against which software or hardware may be compared for performance evaluation. This reference is relevant in software performance optimization as it allows us to measure the efficiency of different methods, algorithms, or tools. In this context, the key metrics for benchmarking data manipulation libraries include execution time, memory usage, scalability, and ease of use.
Introduction to High-Performance Alternatives
The main alternative Python tools are Polars, Vaex, and Datatable. Before diving into their comparative analysis, let’s have a quick overview of what each one of these tools is.
Polars
Polars can be described by three key features. First, it offers a comprehensive Python API filled with numerous functions to handle Dataframes. Second, it can serve dual purposes - it can be employed as a library for Dataframes or as the underlying query engine for data models. Lastly, by incorporating the secure Arrow2 implementation of the Apache Arrow specification, Polars becomes a highly efficient tool for processing large amounts of data.
You can check out our introduction to Polars tutorial, as well as a comparison between pandas 2.0 and Polars.
Vaex
Vaex is for lazy, out-of-core DataFrames (similar to Pandas) to visualize and explore big tabular datasets. It can be very efficient as it delays operations until necessary (lazy evaluation), reducing memory usage and time.
Datatable
Datatable can be used for performing large data processing (up to 100GB) on a single-node machine at the maximum speed possible. One of the important features of Datatable is its interoperability with Pandas/NumPy/pure Python, which allows users to easily convert to another data-processing framework.
To learn more about pandas, our pandas Tutorial Dataframes in Python is a great starting point.
Also, the pandas Cheat Sheet for Data Science in Python gives a quick guide to the basics of the Python data analysis library, along with code samples.
Setting up the Benchmarking Environment
This section covers the creation of the benchmarking data and the installation of all the libraries involved in the benchmarking analysis.
Benchmarking data
The benchmarking data set is 5.7 GB large, which is large enough to perform a decent comparison. Each row has been duplicated 100000 times, resulting in 76 800 000 rows. Diabetes data is the original one being used and is freely available from Kaggle.
The final benchmarking notebook is available on GitHub.
import pandas as pd
data_URL = "https://raw.githubusercontent.com/keitazoumana/Experimentation-Data/main/diabetes.csv"
original_data = pd.read_csv(data_URL)
# Duplicate each row 100000 times
benchmarking_df = original_data.loc[original_data.index.repeat(100000)]
# Save the result as a CSV file for future use.
file_name = "benchmarking_data.csv"
Installation of the libraries
Now, we can proceed with the installation of all the libraries involved in the benchmarking analysis.
From a Jupyter notebook environment, all the installation is performed from the pip install command as follows:
%%bash
pip3 install -U polars
pip3 install vaex
!pip3 install "git+https://github.com/h2oai/datatable.git"
After successfully running the above code block, we can import them with the import statement:
import polars as pl
import vaex as vx
import datatable as dt
Furthermore, we use the plotly library for visualization purpose.
import plotly.express as px
Now we are all set to proceed with the benchmarking analysis.
Benchmarking Execution Time
The execution time will be evaluated for the following operations: data loading, data offloading, data aggregation, and data filtering. A helper function is implemented for each task and will return a dictionary with two main keys: the library name and the execution time.
The plot metrics function is then used to plot the result in a graphical manner using the Plotly Express library. It has two parameters: the list of all the dictionaries from the previous helper functions and the title of the underlying graphic.
def plot_metrics(list_exec_time, graph_title):
df = pd.DataFrame(list_exec_time)
fig = px.bar(df, x='library', y= 'execution_time', title=graph_title)
fig.show()
Data loading
The helper function read_csv_with_time
has two main parameters: the file name to be read and the library name.
def read_csv_with_time(library_name, file_name):
final_time = 0
start_time = time.time()
if library_name.lower() == 'polars':
df = pl.read_csv(file_name)
elif library_name.lower() == 'pandas':
df = pd.read_csv(file_name)
elif library_name.lower() == 'vaex':
df = vx.read_csv(file_name)
elif library_name.lower() == 'datatable':
df = dt.fread(file_name)
else:
raise ValueError("Invalid library name. Must be 'polars', 'pandas', 'vaex', or 'datatable'")
end_time = time.time()
final_time = end_time - start_time
return {"library": library_name, "execution_time": final_time}
The function can be applied using each of the libraries, starting with pandas.
pandas_time = read_csv_with_time('pandas', file_name)
polars_time = read_csv_with_time('polars', file_name)
vaex_time = read_csv_with_time('vaex', file_name)
datatable_time = read_csv_with_time('datatable', file_name)
The resulting dictionaries can be combined as a list and plotted using the following expression.
exec_times = [pandas_time, polars_time,
vaex_time, datatable_time]
df = pd.DataFrame(exec_times)
# Plot bar plot using Plotly Express
fig = px.bar(df, x='library', y='execution_time', title="Execution Time Comparison")
fig.show()
Dictionary format of the execution times for data loading
Graphic format of the execution times for data loading
Based on the graphics, we can conclude that Polars is:
- 4.46 times faster than pandas
- 5.34 times faster than Vaex
- 2 times slower than Datatable
Hence, Datatable has the best performance.
Data grouping
Similarly, the group_data_with_time
is responsible for tracking the execution time for grouping the column provided in parameter. In this case, we are using the Pregnancies
column.
def group_data_with_time(library_name, df, column_name='Pregnancies'):
start_time = time()
if library_name.lower() == 'polars':
df_grouped = df.groupby(column_name).first()
elif library_name.lower() == 'vaex':
df_grouped = df.groupby(column_name)
elif library_name.lower() == 'pandas':
df_grouped = df.groupby(column_name)
elif library_name.lower() == 'datatable':
df_grouped = df[:, :, dt.by(column_name)]
else:
raise ValueError("Invalid library name. Must be 'polars', 'vaex', or 'datatable'")
end_time = time()
final_time = end_time - start_time
return {"library": library_name, "execution_time": final_time}
Dictionary format of the execution times for data grouping
Graphic format of the execution times for data grouping
For the data grouping graphic, we can notice that pandas is approximately:
- 899.15 times faster than Polars
- 158.14 times faster than vaex
- 99.12 times faster than Datatable
Column Sorting
Using the same approach, we the column sorting function sorts the given column using each library.
def sort_data_with_time(library_name, df, column_name='Pregnancies'):
start_time = time()
if library_name.lower() == 'polars':
df_sorted = df.sort(column_name)
elif library_name.lower() == 'vaex':
df_sorted = df.sort(column_name)
elif library_name.lower() == 'datatable':
df_sorted = df.sort(column_name)
elif library_name.lower() == 'pandas':
df_sorted = pd.DataFrame(df).sort_values(column_name)
else:
raise ValueError("Invalid library name. Must be 'polars', 'vaex', 'datatable', or 'pandas'")
end_time = time()
final_time = end_time - start_time
return {"library": library_name, "execution_time": final_time}
Dictionary format of the execution times for data sorting
Graphic format of the execution times for data sorting
For the column sorting, Pandas seems to have the highest execution time. It is approximately:
- 1.61 times slower than Polars
- 8.54 times slower than Vaex
- 22.87 times slower than Datatable.
Data Offloading
With data offloading, the idea is to convert the original data into a different format, which numpy array in this specific scenario.
def offload_data_with_time(library_name, df):
start_time = time()
if library_name.lower() == 'polars':
array = df.to_numpy()
elif library_name.lower() == 'vaex':
array = df.to_pandas_df().values
elif library_name.lower() == 'datatable':
array = df.to_numpy()
elif library_name.lower() == 'pandas':
array = pd.DataFrame(df).values
else:
raise ValueError("Invalid library name. Must be 'polars', 'vaex', 'datatable', or 'pandas'")
end_time = time()
final_time = end_time - start_time
return {"library": library_name, "execution_time": final_time}
Dictionary format of the execution times for data offloading
Graphic format of the execution times for data offloading
Data offloading is the last round of the benchmarking for the execution time. This time, Vaex is the one with the highest execution time. Comparing pandas to these libraries, we can see that its execution time is approximately:
- 1.47 times slower than Polars.
- 2.91 times slower than Datatable.
- 2.62 faster than Vaex.
Benchmarking Memory Usage
The execution time is not the only criterion when comparing different libraries. Knowing how efficiently they deal with memories is also important to consider. This section shows two different memory usage using the tracemalloc library:
- Current memory usage. The total amount of memory used during the execution using a specific library. It corresponds to the RAM space being occupied at that time.
- Peak memory usage. Corresponds to the maximum amount of memory used by the program. It is important to know that the peak memory is always higher than the current memory.
Having these visualizations helps understand which program uses as much memory, leading to out-of-memory errors, especially when working with large datasets.
In addition to tracemalloc, the os library is also required.
import tracemalloc as tm
import os
Before starting, we create an empty list that will hold the result of each memory usage value.
list_memory_usage = []
For each library, the estimation of the memory usage is done with the general syntax below corresponding to the execution with the Vaex library. Also, this focuses only on the memory usage for the data offloading tasks.
tm.start()
vaex_time = offload_data_with_time('vaex', vaex_df)
memory_usage = tm.get_traced_memory()
tm.stop()
list_memory_usage.append({
'library': 'vaex',
'memory_usage': memory_usage
})
The syntax is the same for the remaining libraries, and we get the following instructions, respectively for Polars, pandas, and Datatable:
tm.start()
polars_time = offload_data_with_time('polars', polars_df)
memory_usage = tm.get_traced_memory()
tm.stop()
list_memory_usage.append({
'library': 'polars',
'memory_usage': memory_usage
})
tm.start()
offload_data_with_time('pandas', pandas_df)
# Get the memory usage
memory_usage = tm.get_traced_memory()
tm.stop()
list_memory_usage.append({
'library': 'pandas',
'memory_usage': memory_usage
})
tm.start()
datatable_time = offload_data_with_time('datatable', datatable_df)
memory_usage = tm.get_traced_memory()
tm.stop()
list_memory_usage.append({
'library': 'datatable',
'memory_usage': memory_usage
})
After the execution of all the previous code, we can plot the graphic using the helper function below:
def plot_memory_usage(list_memory_usage, graph_title='Memory Usage by Library'):
df = pd.DataFrame(list_memory_usage)
# separate the memory usage tuple into two columns: current_memory and peak_memory
df[['current_memory', 'peak_memory']] = pd.DataFrame(df['memory_usage'].tolist(), index=df.index)
# now we no longer need the memory_usage column
df = df.drop(columns='memory_usage')
# melt the DataFrame to make it suitable for grouped bar chart
df_melted = df.melt(id_vars='library', var_name='memory_type', value_name='memory')
# create the grouped bar chart
fig = px.bar(df_melted, x='library', y='memory', color='memory_type', barmode='group',
labels={'memory':'Memory Usage (bytes)', 'library':'Library', 'memory_type':'Memory Type'},
title=graph_title)
fig.update_layout(yaxis_type="log")
fig.show()
We can notice that:
- Pandas uses approximately 1100 times more current memory than Polars, 7.4 times more than Vaex, and 29.4 times more than Datatable.
- When it comes to peak memory usage, both pandas and Vaex use around 963K times more memory than Polars and 131K times more than Datatable.
Further analysis, such as Dask vs pandas speed, could provide a broader overview.
Pandas Alternatives Comparison Table
Below, we’ve compiled our findings into a comparison table, showing the differences between Polars, Vaex, and Datatables:
Benchmark Criteria |
Polars |
Vaex |
Datatable |
Data Loading Time |
Faster than Vaex, slower than Datatable |
Slower than both Polars and Datatable |
Fastest among the three |
Data Grouping Time |
Slowest among the three |
Faster than Polars, slower than Datatable |
Fastest among the three |
Data Sorting Time |
Faster than Vaex, slower than Datatable |
Slower than both Polars and Datatable |
Fastest among the three |
Data Offloading Time |
Faster than Vaex, slower than Datatable |
Slowest among the three |
Faster than Polars, slower than Vaex |
Current Memory Usage |
Uses least memory |
Uses more memory than Polars but less than Datatable |
Uses most memory among the three |
Peak Memory Usage |
Uses least memory |
Uses more memory than Polars but less than Datatable |
Uses most memory among the three |
Scalability |
Scalable with data size |
Scalable with data size, but may use more memory as data size increases |
Highly scalable with data size |
Ease of use |
Has a complete Python API and is compatible with Apache Arrow |
Utilizes lazy evaluation and is compatible with Pandas |
Offers interoperability with Pandas/NumPy/pure Python |
Conclusion
Different factors must be considered when performing the scalability and ease of integration analysis. Our benchmarking highlighted that even though pandas is well-established and user-friendly and has a steeper learning curve, it does not efficiently handle large datasets.
However, there are different ways that be adapted on how to speed up pandas, and our High-Performance Data Manipulation in Python: Pandas 2.0 vs Polars can help with that.
Vaex performance varies depending on the task. So, choosing an alternative to Pandas depends on the users’ specific demands, dataset sizes, and their ability to handle the learning curve and integration.

Zoumana is a Data/Machine Learning professional. He loves to build machine learning models that are beneficial to society.
How to Learn Python From Scratch in 2023: An Expert Guide
10 Essential Python Skills All Data Scientists Should Master
Thaylise Nakamoto
9 min
Boosting Your Data Science Workflow with Dask: A Comprehensive Guide
Bex Tuychiev
11 min
Distributed Processing using Ray framework in Python
Geocoding for Data Scientists: An Introduction With Examples
Eugenia Anello
9 min
A Complete Guide to Socket Programming in Python
Serhii Orlivskyi
41 min