Leveraging the best of both Python and R
Data Science has become an integral part of every industry today. Right from Banking, to insurance to healthcare, humongous amount of data is being generated every second of the day. Therefore it has become imperative that we should be able to utilize this vast amount of data to generate actionable insights and work upon them. A lot of tools in the form of programming languages are available in the market today. Out of all those languages, Python and R appear to be leading the race.
Both Python and R are being widely used in the data science world. Both the languages have a wide variety of tools which provide an excellent array of functions, extremely suitable for the data science scenario. Whereas Python is a general-purpose language used for a variety of applications, R is a programming language and environment for statistical computing and graphics.
An overview of Python and R Ecosystem
Let’s have a look at the various aspects of these languages and what’s good and not so good about them.
Python Programming language
Python is an interpreted, high-level, general-purpose programming language. It was created and released in 1991 by Guido Van Rossum. Since its release, Python has been extremely popular in various fields including data science. Python, today is among the fastest-growing programming languages in the world.
Some of the reasons for its vast popularity are:
- Object-oriented language
- General Purpose
- Has an incredible community support
- Simple and easy to understand and learn
- Has efficient Packages like pandas, numpy and scikit-learn which make it an excellent choice for machine learning activities.
However, when it comes to statistical computing, python lags behind and doesn't have specialized packages, unlike its counterpart R.
R Programming Language
R is essentially a software for statistical computing and graphics which is supported by the R foundation of statistical computing. It first appeared in the August of 1993 with its first stable release in 1995, and since then has been widely used by statisticians and data miners for statistical computing.
Some of the features that make R stand out among other languages are:
- Consists of packages for almost any statistical application one can think of. CRAN currently hosts more than 10k packages.
- Comes equipped with excellent visualization libraries like ggplot2.
- Capable of standalone analyses with built-in packages.
But there is a downside. Performance wise R is not the fastest language and can be a memory glutton sometimes when dealing with large datasets.
There is an excellent infographic on Datacamp which shows how these two programming languages relate to each other. The infographic explores what the strengths of R are over Python and vice versa, and aims to provide a basic comparison between these two programming languages from a data science and statistics perspective. Though the infographic was released in 2015, all points are pretty much relevant today also.
Here is the link to the Infograph.
Using Python & R together
R and Python, are excellent tools in their own right but more often than not are conceived as rivals. Instead of looking at them this way, we should try and utilize the good points of both the languages so that we can have the best of both worlds.
The Data Science community today have people who generally work with only a single language. However, there are still those who are using both Python and R, but their percentage is small. On the other hand, there are a lot of people who are committed to only one programming language but wished they had access to some of the capabilities of their adversary. For instance, R users sometimes yearn for the object-oriented capacities that are native to Python and similarly, some Python users long for the full range of the statistical distributions that are available within R.
The figure below shows the results of a survey conducted by Red Monk in the third quarter of 2018. These results are based on the popularity of the languages on Stack Overflow as well as on Github and clearly show that both R and Python are rated quite high when it comes to Data Science activities. Therefore, there is no inherent reason as to why we cannot work with both of them on the same project. Our ultimate goal should be to do better analytics and derive better insights and choice of a programming language should not be a hindrance in achieving that.
How to use both Python & R in a single project?
When it comes to embedding SQL within either R or Python script, we don't bat an eyelid. So why not utilize the statistical prowess of R along with the programming capabilities of Python in the same way? Yes, it can be done, and there are libraries which can handle these transitions very well.
There are basically two approaches by which we can use both Python and R side by side in a single project.
R within Python
This means calling R functions within a Python script. Some of the libraries created for this purpose are:
PypeR provides a simple way to access R from Python through pipes. PypeR is also included in Python’s Package Index which provides a more convenient way for installation. PypeR is especially useful when there is no need for frequent interactive data transfers between Python and R. By running R through pipe, the Python program gains flexibility in sub-process controls, memory control, and portability across popular operating system platforms, including Windows, GNU Linux, and Mac OS.
Conventions for the conversion of Python objects to R objects
pyRserve is another library created for this purpose. It uses Rserve as an RPC connection gateway. Through such a connection, variables can be set in R from Python, and also R-functions can be called remotely. R objects are exposed as instances of Python-implemented classes, with R functions as bound methods to those objects in a number of cases.
rpy2 library is used more often than the previous two. The reason is that rpy2 is being actively and aggressively developed. It runs embedded R in a Python process. It creates a framework that can translate Python objects into R objects, pass them into R functions, and convert R output back into Python objects.
One advantage of using R within Python is that we would able to use R’s awesome packages like ggplot2, tidyr, dplyr et al. easily in Python. As an example let’s see how we can easily use ggplot2 for mapping in Python.
- Basic Plot
For an in-depth review and detailed knowledge about the installation and working of rpy2, you may want to have a look at the following resources :
- rpy2’s Official Documentation
- RPy2: Combining the Power of R + Python for Data Science
- Accessing R from Python using RPy2
Python within R
It is also possible to run Python scripts in R by using one of the alternatives below:
This package implements an interface to Python via Jython. It is intended for other packages to be able to embed python code along with R.
rPython is again a Package allowing R to Call Python. It makes it possible to run Python code, make function calls, assign and retrieve variables, etc. from R.
SnakeCharmR is a modern overhauled version of rPython. It is a fork from ‘rPython’ which uses ‘jsonlite’ and has a lot of improvements over rPython.
PythonInR makes accessing Python from within R very easy by providing functions to interact with Python from within R.
The reticulate package provides a comprehensive set of tools for interoperability between Python and R. Out of all the above alternatives, this one is the most widely used, more so because it is being aggressively developed by Rstudio.
Reticulate embeds a Python session within the R session, enabling seamless, high-performance interoperability. The package enables you to reticulate Python code into R, creating a new breed of a project that weaves together the two languages.
The reticulate package provides the following facilities:
- Calling Python from R in a variety of ways including R Markdown, sourcing Python scripts, importing Python modules, and using Python interactively within an R session.
- Translation between R and Python objects (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).
- Flexible binding to different versions of Python including virtual environments and Conda environments.
Some great resources on using the reticulate package are:
- The Documentation is pretty robust and has a lot of examples and use cases to help you get started.
- SNAKES IN A PACKAGE: COMBINING PYTHON AND R WITH RETICULATE
Both R and Python are excellent tools and almost sufficient to carry out the Data Science tasks from scratch. One might not even need to use both of them in a single project. However, knowledge of both can come in handy and especially giving us the option of working in a different environment. As it is said, the focus should be on the skills and not on the tools. Therefore we should be open to learning new tools and languages if they help us in solving the problem at hand, with ease.
- Interfacing R and Python — Andrew Collier
- From ‘R vs Python’ to ‘R and Python’)