Skip to main content

Top Python Packages For R Users — Become a Bilingual Data Scientist

Take the first step towards becoming a bilingual data scientist by learning about the best libraries the Python language has to offer for die-hard R lovers.
Jul 2022  · 11 min read

The "R vs. Python" language war has a long history. As the defacto programming languages for data science, they attract different supporters from many domains. As a result, practitioners don’t take full advantage of what both languages offer and limit themselves to mastering one tool only. However, learning to use both R and Python as a bilingual data scientist can help you better solve any problems you may encounter in your career.

As Python grows in popularity, more and more R users are switching to Python. So, if you are an R user, consider this article as an overview of Python libraries that you can integrate into your daily workflow. Some have R-like syntax for a smooth switch, while others bring more functionality and speed. 

Top Python packages for R Users

Data Manipulation Libraries

R has a rich ecosystem of data manipulation libraries. Be it dplyr, tidyr, or data.table, R users enjoy a wide variety of tools at their disposal. However, they might consider switching to some Python alternatives for more flexibility, speed, and features.

Pandas

Chief amongst the Python data manipulation packages is the pandas library. It is the premier data manipulation package in the Python data science stack and is used by millions worldwide. It currently has over 20 million weekly downloads, making it one of the most popular Python packages.

pandas download stats

pandas download stats from PyPI Stats website.

It offers such an extensive suite of functions and classes to work with data, that you can still find yourself learning new techniques with pandas despite years of experience. pandas is also a keystone library in the ecosystem, as many other Python libraries mentioned in this article are written so that their functionality aligns with pandas' classes.

Even though it is such an extensive library, it is simple to learn and master. Knowing a few classes and functions allows you to perform complex analyses on any dataset.

Python’s datatable package

If pandas’ syntax looks too unfamiliar to R users, they might feel right at home using the Python datatable package. It is inspired by its R counterpart and written solely to deal with today's massive datasets, and it can read and manipulate gigabyte-sized files in mere seconds.

An everyday use case is to read a large dataset with datatable and convert it to the pandas DataFrame format, which is much faster than reading it purely with pandas. But, as an R user, you don't even have to do that as datatable has almost the same syntax as the data.table package of R.

RapidsAI & cuDF

R users looking to switch to Python can also benefit from improved performance due to widespread GPU support for Python libraries. RapidsAI offers just that opportunity via the cuDF library. cuDF is a dataframe library that lets you manipulate datasets with billions of rows by tapping into the computing power of NVIDIA GPUs. Another advantage of cuDF is that it has a similar syntax to pandas.

Data Visualization Libraries

R’s ggplot2 sets the standard for data visualization packages and is among data science's most widely used libraries. However—for any R user looking to switch to Python for data visualization—there are a lot of worthy alternatives.

Matplotlib

Matplotlib is one of the first libraries people are introduced to when they start learning data science in Python. It is one of the rare libraries that balance complexity and flexibility perfectly. In other words, it is easy for beginners to learn to create great charts while also having all the tools experienced users need to create truly amazing custom plots.

matplotlib download stats

Matplotlib download stats from PyPI Stats website.

Seaborn

The drawback of Matplotlib is that plots require a great deal of customization. However, for easily styled plots, you can check out Seaborn. It is a wrapper API around Matplotlib, making it considerably easier for beginners to create highly aesthetic plots. Seaborn also introduces new plot types and sub-plotting tools that aren't readily available in Matplotlib.

Plotly & Dash

Python also allows you to create interactive data visualizations using a range of interactive data visualization libraries. Most notably is Plotly, which has deep roots in R as well. It is excellent for producing high-quality interactive charts and provides interfaces to customize and create complex plots. 

Moreover, Python’s Dash framework is built on top of Plotly and allows you to create beautiful web apps for hosting dashboards easily. Another Python library that lets you easily create interactive plots is Bokeh, which also allows you to deploy visualizations as web apps easily. 

Plotnine

For anyone looking for a natural ggplot2 alternative in Python, the plotnine package offers an implementation of the grammar of graphics in Python. It has extremely similar syntax and aesthetics as the ggplot2 package and provides a seamless experience for R users looking to start visualizing data in Python immediately. 

Math and Statistical Libraries

Native Python doesn’t come loaded with a host of statistical functions like R. However, its libraries more than make up for this shortcoming.

NumPy

The first one is the mighty NumPy, which carries many other essential Python packages on its shoulders. It is a superb array manipulation library with a rich selection of vectorized math functions. Its speed in matrix manipulation is perhaps only rivaled by Julia, one of the fastest languages in programming history.

NumPy's n-dimensional arrays are the backbone of other important computational libraries like TensorFlow or PyTorch. For this reason, NumPy's download stats are much higher than that of pandas and Matplotlib combined.

Numpy download stats

Numpy download stats from PyPI Stats website.

SciPy

If you can't find a function in NumPy, SciPy is the answer. It has separate sub-modules for various computational applications in math, physics, and statistics.

Its special functions module contains essential speed-optimized mathematical physics functions for researchers. You can solve optimization problems using its optimization module while its integrate and fft modules take care of calculus, and Fourier transforms. Its linalg module contains everything in NumPy's linalg module plus more advanced and niche linear algebra functions. This module has excellent support from BLAS/LAPACK (standard base software libraries for linear algebra), making it even faster than NumPy.

It can also let you process multi-dimensional images efficiently. While NumPy is great for 2-D/3-D images, it can't easily handle higher-order images from fields such as medicine and biology. This is where you can use SciPy's ndimage module.

Statsmodels

While the above packages revolve more around maths than statistics, Python also offers the statsmodels library. It is a vast library with functions and classes that allow you to estimate many statistical models, conduct hypothesis tests, and explore data.

There are entire functions for regression analysis and mature APIs for generalized linear models. Its Time Series Analysis module is especially handy, as it contains specialized functions to perform and visualize time series.

Its other modules can be used for survival and duration analysis, nonparametric methods, and multivariate statistics. And to R users' delight, most of these mentioned modules use R-like syntax in both writing functions and printing their output. For pure statistical analysis, statsmodels is the perfect combination of NumPy, SciPy, and Matplotlib.

Machine learning libraries

This is the area where R users might find the most value in picking up Python. Python’s machine learning libraries are much more holistic than R’s and provide many algorithms and tools for data pre-processing, feature engineering, hyper-parameter tuning, modeling, and more. 

Scikit-learn

The first such framework is scikit-learn. It is the most popular machine learning framework in Python and is used by most machine learning practitioners today. 

State of Data Science and ML Survey 2021

From Kaggle’s State of Data Science and Machine Learning Survey 2021

Scikit-learn has a vast selection of supervised learning algorithms for regression, binary, and multi-class classification. It has dedicated sub-modules for data preprocessing, pipelines, dimensionality reduction, feature engineering, feature selection, missing data imputation, and clustering.

Despite its massive number of classes and functionality, it has a highly intuitive and straightforward API. Scikit-learn does a remarkable job implementing all 20 Python code design patterns in the Zen of Python

XGBoost

If scikit-learn has a drawback, it would be its lack of GPU support. All its algorithms run on CPU, so naturally, people turn to other frameworks when they need more computation speed with GPU-enabled libraries. One such library that R users are already familiar with is XGBoost

With state-of-the-art GPU-powered Gradient Boosted Trees at its core, it can solve supervised learning tasks much faster than other machine learning frameworks with high-performance levels. The above image shows that it is quite popular among the Python community and dominates tabular competitions on Kaggle

LightGBM & Catboost

Two similar libraries are LightGBM and CatBoost. While LightGBM requires significantly fewer memory resources than XGBoost, CatBoost provides speed and accuracy advantages. All three gradient-boosted libraries can implement the Scikit-learn API, making them very easy to use.

Beyond the language wars: R and Python for the Modern Data Scientist

While R and Python enjoy different advantages, framing it as a “language war” is counter-productive for data scientists. In their recent appearance on the DataFramed podcast, authors of "Python and R for the Modern Data Scientist" Rick Scavetta and Boyan Angelov speak about the advantages of being a bilingual data scientist and how modern data teams would benefit from bilingual practitioners.

Throughout the podcast, they discuss how a key drawback of the language wars framing is the rise of monocultures thinking within the R and Python communities. One of the best counter-arguments to monoculture or the "us versus them" mentality is the "hammer and nail" analogy — if all you have is a hammer, everything looks like a nail. 

In other words, the language-first approach limits your creativity and ability to solve problems effectively. By taking a solution-first approach, you start to look at the solution of your problem in terms of concepts rather than sticking to what is rigidly offered in a single language.

Knowing both R and Python enables you to use the best of both worlds appropriately in contexts where one is superior to the other. For example, the authors state how R shines in data visualization with ggplot2 and how mature its reporting ecosystem is with the help of R Markdown and Shiny while highlighting how Python excels at machine learning, APIs, and MLOps.

The authors outline a brilliant case study in the book that shows how you can make both languages communicate with each other. Using tools like reticulate in RStudio, you can call Python scripts and run Python packages within R, allowing you to freely pass objects between the languages.

Becoming a bilingual data scientist has never been easier if you are an R user. For more on learning Python, you can check out our extensive collection of resources.

Introduction to Python

Beginner
4 hours
4,596,587
Master the basics of data analysis with Python in just four hours. This online course will introduce the Python interface and explore popular packages.
See DetailsRight Arrow
Start Course

Extreme Gradient Boosting with XGBoost

Beginner
4 hours
42,159
Learn the fundamentals of gradient boosting and build state-of-the-art machine learning models using XGBoost to solve classification and regression problems.

Introduction to Data Visualization with Matplotlib

Beginner
4 hours
126,581
Learn how to create, customize, and share data visualizations using Matplotlib.
See all coursesRight Arrow
Related

Predicting FIFA World Cup Qatar 2022 Winners

Learn to use Elo ratings to quantify national soccer team performance, and see how the model can be used to predict the winner of FIFA World Cup Qatar 2022.

Arne Warnke

The 23 Top Python Interview Questions & Answers

Essential Python interview questions with examples for job seekers, final-year students, and data professionals.
Abid Ali Awan's photo

Abid Ali Awan

22 min

Plotly Express Cheat Sheet

Plotly is one of the most widely used data visualization packages in Python. Learn more about it in this cheat sheet.
DataCamp Team's photo

DataCamp Team

0 min

Getting started with Python cheat sheet

Python is the most popular programming language in data science. Use this cheat sheet to jumpstart your Python learning journey.
DataCamp Team's photo

DataCamp Team

8 min

Python pandas tutorial: The ultimate guide for beginners

Are you ready to begin your pandas journey? Here’s a step-by-step guide on how to get started. [Updated November 2022]
Vidhi Chugh's photo

Vidhi Chugh

15 min

Python Iterators and Generators Tutorial

Explore the difference between Python Iterators and Generators and learn which are the best to use in various situations.
Kurtis Pykes 's photo

Kurtis Pykes

10 min

See MoreSee More