Skip to main content
HomeBlogPython

Top Python Packages For R Users — Become a Bilingual Data Scientist

Take the first step towards becoming a bilingual data scientist by learning about the best libraries the Python language has to offer for die-hard R lovers.
Jul 2022  · 11 min read

The "R vs. Python" language war has a long history. As the defacto programming languages for data science, they attract different supporters from many domains. As a result, practitioners don’t take full advantage of what both languages offer and limit themselves to mastering one tool only. However, learning to use both R and Python as a bilingual data scientist can help you better solve any problems you may encounter in your career.

As Python grows in popularity, more and more R users are switching to Python. So, if you are an R user, consider this article as an overview of Python libraries that you can integrate into your daily workflow. Some have R-like syntax for a smooth switch, while others bring more functionality and speed. 

Top Python packages for R Users

Data Manipulation Libraries

R has a rich ecosystem of data manipulation libraries. Be it dplyr, tidyr, or data.table, R users enjoy a wide variety of tools at their disposal. However, they might consider switching to some Python alternatives for more flexibility, speed, and features.

Pandas

Chief amongst the Python data manipulation packages is the pandas library. It is the premier data manipulation package in the Python data science stack and is used by millions worldwide. It currently has over 20 million weekly downloads, making it one of the most popular Python packages.

pandas download stats

pandas download stats from PyPI Stats website.

It offers such an extensive suite of functions and classes to work with data, that you can still find yourself learning new techniques with pandas despite years of experience. pandas is also a keystone library in the ecosystem, as many other Python libraries mentioned in this article are written so that their functionality aligns with pandas' classes.

Even though it is such an extensive library, it is simple to learn and master. Knowing a few classes and functions allows you to perform complex analyses on any dataset.

Python’s datatable package

If pandas’ syntax looks too unfamiliar to R users, they might feel right at home using the Python datatable package. It is inspired by its R counterpart and written solely to deal with today's massive datasets, and it can read and manipulate gigabyte-sized files in mere seconds.

An everyday use case is to read a large dataset with datatable and convert it to the pandas DataFrame format, which is much faster than reading it purely with pandas. But, as an R user, you don't even have to do that as datatable has almost the same syntax as the data.table package of R.

RapidsAI & cuDF

R users looking to switch to Python can also benefit from improved performance due to widespread GPU support for Python libraries. RapidsAI offers just that opportunity via the cuDF library. cuDF is a dataframe library that lets you manipulate datasets with billions of rows by tapping into the computing power of NVIDIA GPUs. Another advantage of cuDF is that it has a similar syntax to pandas.

Data Visualization Libraries

R’s ggplot2 sets the standard for data visualization packages and is among data science's most widely used libraries. However—for any R user looking to switch to Python for data visualization—there are a lot of worthy alternatives.

Matplotlib

Matplotlib is one of the first libraries people are introduced to when they start learning data science in Python. It is one of the rare libraries that balance complexity and flexibility perfectly. In other words, it is easy for beginners to learn to create great charts while also having all the tools experienced users need to create truly amazing custom plots.

matplotlib download stats

Matplotlib download stats from PyPI Stats website.

Seaborn

The drawback of Matplotlib is that plots require a great deal of customization. However, for easily styled plots, you can check out Seaborn. It is a wrapper API around Matplotlib, making it considerably easier for beginners to create highly aesthetic plots. Seaborn also introduces new plot types and sub-plotting tools that aren't readily available in Matplotlib.

Plotly & Dash

Python also allows you to create interactive data visualizations using a range of interactive data visualization libraries. Most notably is Plotly, which has deep roots in R as well. It is excellent for producing high-quality interactive charts and provides interfaces to customize and create complex plots. 

Moreover, Python’s Dash framework is built on top of Plotly and allows you to create beautiful web apps for hosting dashboards easily. Another Python library that lets you easily create interactive plots is Bokeh, which also allows you to deploy visualizations as web apps easily. 

Plotnine

For anyone looking for a natural ggplot2 alternative in Python, the plotnine package offers an implementation of the grammar of graphics in Python. It has extremely similar syntax and aesthetics as the ggplot2 package and provides a seamless experience for R users looking to start visualizing data in Python immediately. 

Math and Statistical Libraries

Native Python doesn’t come loaded with a host of statistical functions like R. However, its libraries more than make up for this shortcoming.

NumPy

The first one is the mighty NumPy, which carries many other essential Python packages on its shoulders. It is a superb array manipulation library with a rich selection of vectorized math functions. Its speed in matrix manipulation is perhaps only rivaled by Julia, one of the fastest languages in programming history.

NumPy's n-dimensional arrays are the backbone of other important computational libraries like TensorFlow or PyTorch. For this reason, NumPy's download stats are much higher than that of pandas and Matplotlib combined.

Numpy download stats

Numpy download stats from PyPI Stats website.

SciPy

If you can't find a function in NumPy, SciPy is the answer. It has separate sub-modules for various computational applications in math, physics, and statistics.

Its special functions module contains essential speed-optimized mathematical physics functions for researchers. You can solve optimization problems using its optimization module while its integrate and fft modules take care of calculus, and Fourier transforms. Its linalg module contains everything in NumPy's linalg module plus more advanced and niche linear algebra functions. This module has excellent support from BLAS/LAPACK (standard base software libraries for linear algebra), making it even faster than NumPy.

It can also let you process multi-dimensional images efficiently. While NumPy is great for 2-D/3-D images, it can't easily handle higher-order images from fields such as medicine and biology. This is where you can use SciPy's ndimage module.

Statsmodels

While the above packages revolve more around maths than statistics, Python also offers the statsmodels library. It is a vast library with functions and classes that allow you to estimate many statistical models, conduct hypothesis tests, and explore data.

There are entire functions for regression analysis and mature APIs for generalized linear models. Its Time Series Analysis module is especially handy, as it contains specialized functions to perform and visualize time series.

Its other modules can be used for survival and duration analysis, nonparametric methods, and multivariate statistics. And to R users' delight, most of these mentioned modules use R-like syntax in both writing functions and printing their output. For pure statistical analysis, statsmodels is the perfect combination of NumPy, SciPy, and Matplotlib.

Machine learning libraries

This is the area where R users might find the most value in picking up Python. Python’s machine learning libraries are much more holistic than R’s and provide many algorithms and tools for data pre-processing, feature engineering, hyper-parameter tuning, modeling, and more. 

Scikit-learn

The first such framework is scikit-learn. It is the most popular machine learning framework in Python and is used by most machine learning practitioners today. 

State of Data Science and ML Survey 2021

From Kaggle’s State of Data Science and Machine Learning Survey 2021

Scikit-learn has a vast selection of supervised learning algorithms for regression, binary, and multi-class classification. It has dedicated sub-modules for data preprocessing, pipelines, dimensionality reduction, feature engineering, feature selection, missing data imputation, and clustering.

Despite its massive number of classes and functionality, it has a highly intuitive and straightforward API. Scikit-learn does a remarkable job implementing all 20 Python code design patterns in the Zen of Python

XGBoost

If scikit-learn has a drawback, it would be its lack of GPU support. All its algorithms run on CPU, so naturally, people turn to other frameworks when they need more computation speed with GPU-enabled libraries. One such library that R users are already familiar with is XGBoost

With state-of-the-art GPU-powered Gradient Boosted Trees at its core, it can solve supervised learning tasks much faster than other machine learning frameworks with high-performance levels. The above image shows that it is quite popular among the Python community and dominates tabular competitions on Kaggle

LightGBM & Catboost

Two similar libraries are LightGBM and CatBoost. While LightGBM requires significantly fewer memory resources than XGBoost, CatBoost provides speed and accuracy advantages. All three gradient-boosted libraries can implement the Scikit-learn API, making them very easy to use.

Beyond the language wars: R and Python for the Modern Data Scientist

While R and Python enjoy different advantages, framing it as a “language war” is counter-productive for data scientists. In their recent appearance on the DataFramed podcast, authors of "Python and R for the Modern Data Scientist" Rick Scavetta and Boyan Angelov speak about the advantages of being a bilingual data scientist and how modern data teams would benefit from bilingual practitioners.

Throughout the podcast, they discuss how a key drawback of the language wars framing is the rise of monocultures thinking within the R and Python communities. One of the best counter-arguments to monoculture or the "us versus them" mentality is the "hammer and nail" analogy — if all you have is a hammer, everything looks like a nail. 

In other words, the language-first approach limits your creativity and ability to solve problems effectively. By taking a solution-first approach, you start to look at the solution of your problem in terms of concepts rather than sticking to what is rigidly offered in a single language.

Knowing both R and Python enables you to use the best of both worlds appropriately in contexts where one is superior to the other. For example, the authors state how R shines in data visualization with ggplot2 and how mature its reporting ecosystem is with the help of R Markdown and Shiny while highlighting how Python excels at machine learning, APIs, and MLOps.

The authors outline a brilliant case study in the book that shows how you can make both languages communicate with each other. Using tools like reticulate in RStudio, you can call Python scripts and run Python packages within R, allowing you to freely pass objects between the languages.

Becoming a bilingual data scientist has never been easier if you are an R user. For more on learning Python, you can check out our extensive collection of resources.

Topics

Related courses

Course

Introduction to Python

4 hr
5.5M
Master the basics of data analysis with Python in just four hours. This online course will introduce the Python interface and explore popular packages.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

tutorial

A Comprehensive Tutorial on Optical Character Recognition (OCR) in Python With Pytesseract

Master the fundamentals of optical character recognition in OCR with PyTesseract and OpenCV.
Bex Tuychiev's photo

Bex Tuychiev

11 min

tutorial

Encapsulation in Python Object-Oriented Programming: A Comprehensive Guide

Learn the fundamentals of implementing encapsulation in Python object-oriented programming.
Bex Tuychiev's photo

Bex Tuychiev

11 min

tutorial

Everything You Need to Know About Python Environment Variables

Learn the ins and outs of managing Python environment variables with os and python-dotenv libraries.
Bex Tuychiev's photo

Bex Tuychiev

9 min

tutorial

Everything You Need to Know About Python's Maximum Integer Value

Explore Python's maximum integer value, including system limits and the sys.maxsize attribute.
Amberle McKee's photo

Amberle McKee

5 min

tutorial

Python KeyError Exceptions and How to Fix Them

Learn key techniques such as exception handling and error prevention to handle the KeyError exception in Python effectively.
Javier Canales Luna's photo

Javier Canales Luna

6 min

tutorial

Troubleshooting The No module named 'sklearn' Error Message in Python

Learn how to quickly fix the ModuleNotFoundError: No module named 'sklearn' exception with our detailed, easy-to-follow online guide.
Amberle McKee's photo

Amberle McKee

5 min

See MoreSee More