Beyond the Language Wars: R & Python for the Modern Data Scientist
In this episode of DataFramed, we speak with Rick Scavetta and Boyan Angelov about their new book, Python and R for the Modern Data Scientist: The Best of Both Worlds, and how it dawns the start of a new bilingual data science community.
Rick Scavetta is a biologist, workshop trainer, freelance data scientist and co-founder of Scavetta Academy, a company dedicated to helping scientists better understand and visualize their data. Rick's practical, hands-on exposure to a wide variety of datasets has informed him of the many problems scientists face when trying to visualize their data.
Boyan Angelov has a decade of experience in academic and industry environments. He has delivered data science, engineering, and strategy work in the fields of bioinformatics, clinical trials, HRTech, and management consulting.
Adel is a Data Science educator, speaker, and Evangelist at DataCamp where he has released various courses and live training on data analysis, machine learning, and data engineering. He is passionate about spreading data skills and data literacy throughout organizations and the intersection of technology and society. He has an MSc in Data Science and Business Analytics. In his free time, you can find him hanging out with his cat Louis.
Adel Nehme: Hello, this is Adel Nehme from DataCamp, and welcome to DataFramed, a podcast covering all things data and its impact on organizations across the world. One of the most frequently asked questions I receive from folks trying to break into data science is which language to learn first, R or Python? This debate has gone through ebbs and flows for as long as I can remember, and almost every data scientist knows it as the language wars.
Adel Nehme: However, do we really need this us versus them paradigm? That's why I'm excited to have Rick Scavetta and Boyan Angelov on today's podcast. Rick Scavetta and Boyan Angelov are the authors of Python and R for the Modern Data Scientist: The Best of Both Worlds. Rick is a prolific data science educator and founder of Scavetta Academy and is primarily an R user. Boyan is a data strategist and is the author of Elements of Data Strategy and is primarily a Python user.
Adel Nehme: In their new book, they detail the histories of R and Python, what led to the so-called language wars and chart a variety of use cases of language interoperability that shift the us versus them paradigm on its head. Throughout the episode, we discuss the history of Python and R, what led them to write the book, how Python and R can be interoperable, the advantages of each language, when, where to use it, how beginner data scientists should think about learning programming languages, how experienced data scientists can take the next level by learning a language they're not necessarily comfortable with, and more.
... See more
Rick Scavetta: Well, good morning, Adel. Really nice to have us on the podcast. Thanks a lot for inviting us.
Adel Nehme: I'm excited to discuss your new book, Python and R for the Modern Data Scientist: The Best of Both Worlds and how it signals the dawn of a bilingual data science community. Before we dive into the details, I want to first start off by understanding your motivations for writing the book.
Rick Scavetta: Well, the motivations for the book, I think are kind of a couple of different reasons that we have, and I think Boyan and I had different reasons for why we wanted to write this book. Maybe Boyan, you can start us off.
Boyan Angelov: Yeah, yeah. Thanks for having us. So it's great to be here. I mean, for me personally, my biggest motivation behind this book will be the impression I had about the so-called language wars. I'm sure we'll get a bit deeper into that a bit later. But I mean, it helped me and Rick to say the language war is over, and me personally, I always thought that... I always find it confusing that people get so focused on one tool versus the other while in reality should always use the best tool for the job.
Boyan Angelov: From my experience with working with clients, I worked also in very different industries. I have seen that things are moving into more a tool-agnostic direction, where you want to use the best tool for the job and it's not as important what actually you're using. This really motivated me and Rick to write this book.
Rick Scavetta: Yeah. So I mean, the book is really about getting the R users to understand Python and start using Python and the Pythonistas to start appreciating R users and to look at the tools available in R. Like what I mentioned, it's really not just about using the best tool for the job, but really appreciating the resources that are available to us. For me, it's really frustrating when scientists or any kind of technical professionals limits themselves to one way of working or one tool of working where it says, "Okay, everything needs to be in this specific language." Just like data scientists are this large diverse group of people, the tools we have are also diverse, and why shouldn't we take advantage of the wonderful tools and resources that we have available to us?
Rick Scavetta: So that was part of it. I think another reason is that, going back to the whole language wars issues that Boyan mentioned. I was kind of over it, and it kind of bugs me to hear people talk about this us versus them mentality in whatever context it is, and I don't think that helps to build any kind of community. So I'm happy to see that people kind of getting over this us versus them, Pythonista versus user, which is better and kind of snuffing the nose at another attitude, which I always found kind of disgusting.
Rick Scavetta: It's also a little bit about getting over that, but also having empathy for the other group. Right? So to understand why do Pythonistas do things in the way that they do, what are they thinking about? What context does it work in? What are the advantages, disadvantages, and then also vice versa, right? So can we really get into the mind of a Pythonista user and the minds of a user and understand kind of where they're coming from so we can then understand how to use the tools better? So there's a little bit about empathy mixed aside there as well.
R and Python for Data Science
Adel Nehme: That's awesome, and especially is how the book acts as a bridge builder between both communities. Can you describe the set of events that led to R and Python becoming the primary data science tools of today and how this translates into what we call the language war between R and Python?
Rick Scavetta: That's something that we talk about right at the very beginning of the book and just kind of a little bit unusual that we would begin a book on R and Python by giving this whole history of both of the languages. But I want to start the book in that way because I thought that it's important to help people understand the current context and how we got to where we are.
Rick Scavetta: So one of the first questions people ask when they're trying to decide, should I learn R or should I learn Python is, what's the difference between R and Python? You'll see a lot of posts on Stack Overflow or different message boards, Reddit, where people talk about, what are the basic differences between these languages, and why should I use one versus the other? Nobody really kind of comes to a clear consensus or a clear understanding of what the difference is.
Rick Scavetta: So what I tried to do in the first chapter is just outline the history of these languages to give an idea about the different ethos and how things work differently between the two languages. So R was there kind of at the very beginning of scientific computing in academia in the late 70s, early 80s in Bell Laboratories. It was really developed as a programming language for doing statistical analysis. In the book I call it a FUBU language. So FUBU is a street wear clothing company from New York that I used to love when I was a teenager, and it stands for, For Us, By Us, and I liked this For Us, By Us attitude. It's very much for the community by the community kind of ethos.
Rick Scavetta: That's very much what R is, right? It's for statisticians, biostatisticians, and it's just meant to just get statistics done, get data analysis, work with your data, and just get it done. It's a programming language in its own, right? But it's first and foremost used for doing data analysis. That really shaped all of what came afterwards in R. Python kind of comes at it from a different direction. Python originated as a generalist programming language to make just entry to programming easier, with a nice syntax and easier access to managing all kinds of different tools and system administration and building applications and web development.
Rick Scavetta: So it had its fingers everywhere as a generalist programming language, and then came data science later on. So Python wasn't data science first and then programming second. It was kind of the other way around. That also kind of affected the ethos of how data science got done in them. So when you see that Python is much more popular, part of that is due to the fact that it began life really as a popular generalist programming language. So it wanted to be everywhere, and it is everywhere. It's not always done in data science context, but it is widespread, and that's why it's a little bit easier to integrate into companies using Python than usingR.
Adel Nehme: That's great. The early part of the book really acts as a lay of the land and tries to introduce the R and Python universe for users of the opposing language. I think you both do an excellent job at talking about some of the benefits of each language without necessarily putting the other down. As authors here, can you walk me through the challenge of drawing a fair comparison between both languages that doesn't really extend into promoting a monoculture or taking sides?
Boyan Angelov: Yeah. It wasn't easy, I have to say, to compare those two things, I mean R and Python, because as Rick suggested, I mean, they have such a different origin. I mean, you can really feel it to this day. From the beginning, our initial idea was, yeah, can we make complete one-to-one scientific way to compare them. Very quickly, we realized, I mean, that's not very fair. I mean, we do go into that direction quite a bit in the book to show how exactly the same thing is done in different ways in both languages, but we tried to be more focused on, what do you want to achieve? It's our deep belief that you do need to first think about the problem and then about the tool and only then you can choose the best one for the job.
Boyan Angelov: This is why, throughout the book, we really try to focus on use cases, different workflows, different data formats and only then decide, okay, maybe here, are those things in a bit of a different way than Python and to what exactly those differences are. So our solution was really to go to look for the practical, for the use cases.
Rick Scavetta: Yeah. I think that's a really good point that we didn't want to try and make a one-to-one translation between R and Python. So we have an appendix where we do talk about, okay, this is a list in R, and that's a similar structure in Python and vice versa. But the point of the book was not to just help people map their knowledge of one language on to another language, but to really appreciate it and really be bilingual. Anybody that's learned a new language, a spoken language as an adult sees that this is kind of a difficult. You try to speak a new language using the grammar and even vocabulary of your original language, right?
Rick Scavetta: We have these false friends between many languages that sound very similar to words in your original language, but they're different or the grammar. You want to use your original grammar in a different context, and it just doesn't work. Right? So this one-to-one translation is not really the point. The way to kind of work between them was to really think about being bilingual and to kind of think about not putting stuff into a box, but really thinking about, "Okay, how does each language deal with it, and what's the best scenario, what's the best case for the specific tool that I need to work with here?" Instead of trying to solve every problem with the one language that I have, we can expand our resources and think about, "Okay, well, would this be better? Is there a better resources in this language to deal with this specific problem that I'm dealing with?"
Rick Scavetta: So one of Boyan's favorite analogies is the hammer and the nail. If all you have is a hammer, everything looks like a nail. I'm not a huge fan of this analogy, but it does kind of work. I will admit it too to Boyan. It does work in this context. It's more like if you only have the one tool available to you, then you're going to just be limiting yourself. I kind of think of it as more as finite versus infinite thinking, right? If you only have a small set of tools, you're not going to think about things in creative and new ways, right?
Rick Scavetta: So it's not just that you try to treat every problem with the tools that you have, but it's that you're limiting your creativity. You're limiting the way that you think about things. You're limiting the way that you approach things, right? That's the whole problem with monoculture or low diversity environments where we just limit our creativity. So part of that is changing the way people think about how the two languages relate to each other.
Boyan Angelov: Maybe I can take this analogy a bit further. What we often also talk about is when you're bilingual, in real human languages, I heard the expression somewhere that you do not think in words. You think in concepts. It's kind of true. I mean, because you're not limited by the sounds and the meaning of just one language, and it makes you a bit more flexible in how you approach things. One example that they often give is, for example, if I'm working with a very, very junior data scientist, and they'll look at me searching for things on Stack Overflow.
Boyan Angelov: I mean, obviously, everybody searches for things at Stack Overflow. It doesn't matter what your level is. But then they wonder why I'm so fast to those things, and I would explain it's really this idea that I think in the concept, I don't know exactly one-to-one what I'm looking for, but because I've seen it in a different context, I expect this to be present in a documentation. When I see it, I immediately get it. So I think learning how to use R and Python, both of them being bilingual, allows you to be very, very creative in solving problems. This is my impression from my work as well.
Being Bilingual as a Data Scientist
Adel Nehme: I want to expand here on the benefits of being bilingual as a data scientist. There are so many details that we can cover. But if you had to summarize the top five features or components of the R universe that Pythonistas should know, what are they?
Rick Scavetta: I think the R is quite distinct from Python in a number of areas that really do make it stand out. One I think is because it's a FUBU language, because it's really meant for just doing your work, it's much easier to get off the ground running with R. That's not just because there's one very dominant IDE, which many people are using, which is very well supported, but it's also because just the language itself, even if you're not using specific packages, even just using base R, it's easy to get off the ground running.
Rick Scavetta: I can show a complete newbie that's never seen any data analysis written out in a programming language before. I can show them R code, and they can tell me what it's doing. Right? Which is pretty impressive. So I think it's easy to get off the ground running, which is why you see it used in academia a lot. Academics, they don't have the time to learn a whole new program language. They got enough stuff to do. They've got so much research and stuff to keep on top of. They just want to do their analysis and get on with their lives. So R really facilitates that in a nice way.
Rick Scavetta: I think another thing that is really a massive advantage in R is the whole Markdown ecosystem. There's many packages associated with that. So there's not just R Markdown, but logdown, bookdown, and many other packages that help with reporting. It sounds like a minor thing, but actually it's pretty massive. So Jupiter notebooks, ever since I discovered them, I have felt that they were frustrating because it's not saved in a flat text format. You couldn't just go in there and edit the flat text file. You had to edit it in an editor that allowed you to work with those specific files, and that for me was always very frustrating.
Rick Scavetta: So just to have this whole ecosystem of reporting and Markdown really well integrated and also with Python inside there is really nice system. I think another part that kind of goes back again to the whole ethos and R is the community. Many people I realize don't appreciate this if you've never really been in the R community. People talk about it a lot, but it's something that you see at R conferences, at the useR conference. The international R conference just took place a couple weeks ago. There was a woman from Ghana, and she was speaking about diversity and inclusion, and there was a whole panel discussion about diversity and inclusion, which first off is pretty impressive in a program language tech conference that there's a whole panel discussion only about diversity and inclusion and accessibility in that language, which was massively impressive.
Rick Scavetta: We had people with vision impairment and talking about different racial aspects to accessibility and things like this. So it was really interesting discussion, which very impressive that that was in the useR conference. This woman from Accra who at first I can't remember her name, she mentioned that she's been in the business for decades, and there is no community like R that is as diverse and as inclusive and as concerned about accessibility as the useR community is and really stands out.
Rick Scavetta: Unless you really experience that, you don't really know what that is. Right? I started programming in high school with basic programming languages that you learn at that stage, and then for my work, I started to learn PERL, and then we look at Python. You look at the way people communicate and the way people interact with each other and useR community is really quite different. I think that's really a massive strength.
Rick Scavetta: Another huge strength is data visualization. So Python is kind of catching up to this, but data visualization in Python was for a long time, very frustrating and ggplot2 really kind of came and dominated R, and that's a very, very good foundation, right? So Paul Morel just gave a really interesting talk at the useR conference about the grid package and different tools and developments happened inside there. So data visualization as a core functionality has been there since the beginning and is very well supported by the core team at R, and ggplot2 is just a wonderful package, which massive amount of extensions at this point, which makes it incredibly flexible. Of course, I have my courses on DataCamp on ggplot2. So a plug for me and DataCamp. You can check out my courses there, or you can join one of my online courses if you're a student in Germany.
Rick Scavetta: Then the last thing maybe I would mention is Shiny, is that that kind of goes part and parcel with Markdown in terms of an easy reporting method, right? So just the interactive Markdown documents, having a shiny runtime is already using Shiny to a brilliant effect in a very, very simple way, right? So you can have interactive documents in a really nice, easy, simple way, which competes with anything that Python has, and you can build upon that using the same syntax and have a really nice elegant web apps which are getting more and more supported and to be really be professional data products in the room, right?
Boyan Angelov: Yeah. I'm also a big fan of R, right. I mean, obviously, I'll cover some things about Python because we want to make the comparison, right? I really have to say about the data visualization part for somebody who comes a bit from the Python world, for me, the ggplot packages, I would say it's more than a package. I mean, it's such an interesting way to do data visualization. It really goes beyond the syntax, and I do think for Python user to see how a ggplot is constructed, how do you layer a plot in the crazy amount of flexibility and how we can change different models, use extensions about it? I think it's extremely important. This goes to the point which I made about a different way of thinking. So I can really suggest that one. Yeah.
Adel Nehme: Definitely agree with you here on Shiny and ggplot2. Yeah. I think even though primarily I'm a Python user, I think ggplot2 offers a very interesting framework for how to think about visualization, and I think shiny is so far R's biggest killer app, for sure. Similarly, Boyan, given that Rick described the top five features of R for Pythonistas, how would you describe the top five features or components of the Python universe to users of the capital R?
Boyan Angelov: I think exactly thinking about Python, the origins of it and where it is right now, I think for an R user, it will be amazing to see how much stuff is out there. Obviously, that comes with... I mean, it's not the only positive thing. It can be a bit overwhelming to see what kind of packages are available for Python. But the ones which are mostly used, I would say like 80% of data science work, these are the PI data stack, right? So this is your non-PI Pandas, scikit-learn. I make a big argument in the book why this is such a great thing, how well they integrate with each other, because you can use them really interchangeably. For example, Panda's data frame underneath its collections [inaudible] and scikit-learn just works with it. This is something which I always found amazing that you can expect things to work in the Python packages together, and this is something that I think R users definitely should check out.
Boyan Angelov: Other packages and tools and frameworks for R users which might be interesting, they're from kind of adjacent data science fields. I mean, I'm a big fan of the idea that the data scientist should be quite comfortable with data engineering as the workflow. I mean, obviously they, data scientists don't need to know Scallop for example, or to complete pipelines perhaps only by themselves. But they should know some of those things, how they're done. I would personally expect people to know how to build an API, and for those types of things, Python because of its kind of glue-like nature.
Boyan Angelov: They have developed so many more packages. Over the top of my head, just Flask, perhaps BentoML, first API. These are just wonderful tools for you to deploy your models. I think also Python makes it so easy to go into different things. For example, as an R user, you might think that how do I get into IoT? How do I get into... There's a weird database that I need to explore, and maybe Python has a package for that, and this is kind of the big benefit there.
Adel Nehme: Obviously, the package ecosystem, is there crucial considerations practitioners make when deciding for one language over another? Can you walk us through which subdomains of data science you think each language excels in, and what are the criteria you choose when evaluating one particular ecosystem over another?
Boyan Angelov: So me and Rick had a discussion about it before we started the writing. Again, it wasn't immediately obvious to us how to make the distinction. Our first idea was to focus on the domains of application, for example, finance, bioinformatics, and things like this. But that clearly didn't feel right to us in a way. It felt a bit like discriminating against other other domains and didn't feel the right way to split the two languages, because some of those domains, there is a mix. I mean, clearly you might make the argument that in finance, there's a lot of R still there. But this is changing as well, and the balance might be switching in the future.
Boyan Angelov: Our decision there was to split the two languages, the comparison into two levels. So one that we selected is a data format, and by that we mean the way the data is stored. So in a way, you could say, "Okay, we have spatial data. We have image data. We have text data, and we can time series." Obviously, as we write in the book, there's some others. But these, you can see, "Okay. They correspond to big proportion of what we see in the real world." How does R and Python connect to those data formats? How easy it is to work with them.
Boyan Angelov: The second level that we chose was the level of workflow. I mentioned data engineering before. Rick mentioned data visualization. So in the book you write, for example, about the machine learning workflow, the exploratory data analysis workflow, the reporting workflow, and we want to make the comparison, which one of those is easier to do in one language or the other. So maybe I can give a specific example for the EDA workflow, what Rick mentioned, and I also supported him in that. ggplot is such an amazing tool to data visualization, and it's so, so nice to plotting scenario. It's so easy to make really amazing reports as well.
Boyan Angelov: For both of us, it was very clear that our recommendation would be to do that. Well, for example, if you're working with a text data format, it might be storing some more in a non-relational database, and as I mentioned before, the glue-like nature of Python makes it much easier to find a weird database connector for that and work with it and tools like spaCy, for example, more of the natural language processing tools. They are just extremely mature in Python, and this is how we made the distinction.
Rick Scavetta: Yeah. Maybe I would also mention here that in terms of thinking about the workflow and the data formats of when you would to use each language, so really on a use case-based scenario and not on a, okay, in this domain, in this field, in this industry, you're going to use this language, but really to think about, okay, what is the kind of data? What is the problem as a workflow. To focus more on that means that we kind of frame the language usage in a more of a generic term, right? So the way I always like to think about whether it's data science or being in the laboratory as an actual scientist or a programmer or whatever you've got is you've got this bag of tricks, and the younger you are, the more novice you are, you've got this small bag of little tricks, right?
Rick Scavetta: So also, in math, right, high school students have got a very small bag of tricks, and the more you get deeper into mathematics, you learn really about different tools, and your bag of tricks grows. So when you see something, you have more tools that you can use with it. Your bag of tricks grows. Part of that bag of tricks is just expanding your language use here and really identifying generic scenarios that you can use either one of them. Right? So I think that's more of a way that we kind of try to approach the use cases for when you want to use one language versus another.
Rick Scavetta: Then some of that goes right back to the early days to the very ethos, right? So like the EDA and the data visualization, so stuff that Boyan mentioned. I see that as really being hard-baked right into the foundation of R, and it's something that I mentioned also in the book is that when you look back at the origins in the late '70s and early 80s of S that eventually became R, you saw that the people working on it were very much embedded in these statistics community at the time, and they were publishing books about exploratory data analysis. Right?
Rick Scavetta: So Tukey, I think was the one that coined the term exploratory data analysis with his book called Exploratory Data Analysis. He was not involved in authoring as at that time, but he was part and parcel with the whole cohort at the laboratory. So the originator of exploratory data analysis was in cahoots with the people that were working on S at that time and also thinking about using that visualization as a tool in exploratory data analysis and as a tool in explaining and reporting. That was present all the way right back at the very beginning. So that kind of ethos really led the way and made that. Of course, you can do really good visualization in EDA in Python. That's not the issue. But it's just that really defined a lot of, I think, what came later in R.
How Interoperability Works Between Both Languages
Adel Nehme: I think that was one of the most interesting parts of the book, especially how it connects the strength of each language, there's history. Especially for someone early in their data career as like myself, you point out in the book as well, that there's a lot of room for interoperability between both languages. So practitioners may not be as locked in to one ecosystem as they might think they are. Do you mind walking us through how interoperability works between both languages?
Boyan Angelov: Yeah. I mean, it's a bit of a growing field and while writing the book, you realize, "Okay, this is still early days in doing such work." Maybe I can talk a bit about the idea why you do such a thing. What are the possible scenarios when you might select a different technology? Maybe Rick, you can go through the specific example in the book because we have a super nice case study at the end of the book which we built. I think when you try to... Why would you want to work in the same environment? I think that's an important question to start with, and then we can get into the how. I think the why comes from, again, the modern teams. I think they started to become bilingual.
Boyan Angelov: You see it in job boards and job descriptions. You start to see that companies, they cannot afford nowadays to say, "Yeah, we are looking for an R person with whatever Shiny experience." They realize they need to hire people who can solve problems, and this makes the community much more diverse in the company as well. Then you start to run into those issues that we also talk about, like what do you do, for example, when you have a machine learning person who is working on the machine learning workflow, and this person comes from a computer science background with high-performance computing, and this person is very proficient in Python, scikit-learn, TensorFlow, PyTorch, and all of those things.
Boyan Angelov: Then at the same team, you have somebody who comes from, let's say a more standard statistical background where let's say in psychology, for example, where they use lots of R to do statistical testing and visualizations, perhaps. You put those people together. What might happen is... I mean, one way to split it from a project manager perspective is you have the person doing the EDA, and then they kind of communicate to the machine learning person, the machine learning engineer, who will build the modeling Python.
Boyan Angelov: That can become a challenge because how very often you do need to at least run each other's code. You need to also understand how certain things happen, right? Even to take the example further, when you're done with the model, even if you did it in R, how do we deploy it? You go to the data engineering team, and you say, "Okay. Here's my model in R. Go have fun." You'll see that this becomes a challenge. We did a survey on a few tools, which help you with this workflow and how to combine them together. Our selection of choice was ridiculous, the package that we use in the book. I think from our perspective, it gives you the great combination between both of them.
Rick Scavetta: I can elaborate on that a bit. Before we get into particular, I want to go back to a point the Boyan mentioned at the beginning, which was really using and being bilingual in a company in a practical context. One point there that I kind of want to highlight is that there's many ways of doing data science, and data science is not a protected title, and it's not a certified title. It's not a career that has letters after his name that says, "Oh, you are a certified data science, and this is exactly what it means." Right?
Rick Scavetta: So data science means a lot of different things. On the ground, it looks very different, depending on the context. Would you see it? It looks different in academia versus in industry, and it looks different in smaller companies versus larger companies and the way it's integrated and it's using the expectations and the tools that we use. Right? So when we think about data science being deploying the models in industry, that's one aspect, right, and working with the data engineers. That's one aspect. But there's many different ways that we use that.
Rick Scavetta: So what I'm seeing, I'm not involved in hiring, thank God, but I am in contact with people that are hiring, and one thing that you see come up over and over is that people are less concerned about the language and more interested in your skills as an analyst and really understanding analytical thinking as this thing from critical thinking skills, right? So really understanding what those things are and really being able to demonstrate that you have analytical and critical thinking skills, because those things are very difficult to learn, and they're very difficult to teach, right? So I try to have that in my courses, but it's really difficult to kind of really sure that people kind of get that across and really difficult to teach that.
Rick Scavetta: So if you come to a company being able to demonstrate that you have critical thinking and analytical thinking skills, you can pick up one of the languages. If you only know R, you can pick up Python. I know people here in Berlin that have very, very strong math background, and it's purely R, and they're hired at a startup, a very small startup, and they're given a month, and the first month is learn Python. Your job is literally learn Python, and we're going to give you one month. We'll pay you a salary, and your job is to learn Python because we want you and your skills and your math skills and your analytical skills. The language is irrelevant. We use Python, so we'll give you time to learn it, right?
Rick Scavetta: So in that sense, it's really good to be able to kind of switch back and forth, but it's more about how you think, and do you really understand the problems that are available, and also really understanding the business use case scenario and how your work is related to the needs of the company and the bottom line and the return on investment. So those are really important skills that are really language agnostic, right?
Rick Scavetta: So then you come into a company, and maybe you're given the freedom of which language to use, but you may be working with people that are using a different language. That may not be allowed. You may be encouraged to use one language just because it makes things easier. But if you are able to, you may be able to use R and then somebody has some Python scripts, and that's a scenario that I've seen quite often. You come into a laboratory or a company, and somebody has made a workflow or some protocol written in Python with maybe no documentation, or I've also seen incredibly detailed and complicated documentation because the scripts are so complicated and messy that you need this massive amount of documentation so that people can just make heads or tails of it.
Rick Scavetta: So instead of trying to rewrite all of that Python in R, just use it. This is something that's been going on for decades since the early days of computing, where you just pass documents, and you just call one script from another, right? So you can see that Bash is calling Python scripts, and you're calling Python scripts from R. Those are things that have been happening for decades and for a long time before the days of Python. So just calling scripts between and passing documents between one language, that's already interoperability, right? So using those skills and then just using the scripts that you have available to you.
Rick Scavetta: Then of course, as Boyan mentioned, the next level is really being bilingual then using them together in an integrated environment, then that's at the moment, the best supported way of doing that is with articulate. But that's really a specific use case where you really need to be passing objects between the different languages and not just calling a script from one language to the other.
When to use both languages?
Adel Nehme: So in the final chapter of the book, you brilliantly laid down a case study, showcasing how R and Python can work together in the real world. Before you described this workflow, what are the real world scenarios where data teams need to use both languages in the same workflow?
Rick Scavetta: I think it goes back again to this concept of just using existing resources, right? You don't need to reinvent the wheel, right? If somebody in the company already has a solution that works very well, and you just need to pass a specific data format to that script, then why not just use it? You can reuse existing resources in an easy way. Then Boyan, you can speak to some specific scenarios or specific examples.
Boyan Angelov: I mean, one thing which happens is what I mentioned before is in Python, we have a ton of packages for everything possible, right? Often as the data scientist, you want to figure some specific problem, and you have the feeling somebody else gets over it almost always. Then let's say you are an R user, and then you see, no, in Python, you have this super nice packages. There's this very specific thing. Let's take the NLP example. Let's take a nice package like spaCy. You might think, "Yeah, okay, spaCy is such a cool package. I would love to use it, but maybe I need to know Python to do that." You can use, for example, reticulate from within your RStudio.
Boyan Angelov: So there's a nice thing about reticulate, it's also pushed by RStudio is that it is very well integrated in the ITE. So for example, if you are importing objects, Python objects, you will see them in the environment then. So for example [inaudible] data frame will be available there together with you R data frame, and you can run both Python and R code on it, and you can show your plots. So those tools make it very easy to do such stuff. You might take this weird package from the other language and use it. You can just run it from within R, from within your tool and take advantage of it without using a wrapper, because a lot of those tools nowadays, which are imported from one language to the other one, case in point may be a lime for explainable AI. There's also, I think. Cara's is also a wrapper.
Boyan Angelov: So there's a few of those, which for example, just call the functions under the pool from the other language. I mean, this is all fine. It allows you to do things. But you have to rely on the developers maintaining those packages and keeping them up to date. While if you use reticulate RStudio, if you really work how we showcase in the use case, then you will take advantage of the most recent version of the package without worrying, and you can have a Python code which somebody can look through as well.
Adel Nehme: So given how data science languages are becoming more and more interoperable, what is your advice to someone who's starting off in the data science space right now? Which language do you think they should learn first? How do they maneuver through becoming bilingual?
Boyan Angelov: I mean, there's a different choice. There was the choice, do you learn the specific language, or do you learn by... Learn by doing is what we hear all the time. So this is a question which both of us I think have received so many times, do you pick out Python at the beginning, right? You could make the really nice argument for R in some case saying, "Okay. Super easy to setup R." You just download it. RStudio is an amazing, amazing IP. Now we have this code for Python as well, so that's getting better. But Python setting this up, do you use Anaconda, do you use virtual environments, do we have a system? But I think it's not so easy, right? But [inaudible] argument, which is a fair one, I will say that you should not even think about the language. You should think about what you're trying to do.
Boyan Angelov: Even as a beginner, our advice I think would be focus on the problem that you want to solve. Let's say you want to learn data science. Our advice will be pick a problem that you find interesting. Many people will say I'm interested in investing programs. In that case, sure, I mean, some of the data sets are more like time series related. Then we go into what we discussed in the book, take for time series a data. We have super nice packages in R, how to visualize it, for example. Then you go into R and then learn it by doing. So this would be my advice. Do not worry too much about which language, because they will be converging to some aspects, in some aspects, not. You should know both ideally in the long.
Rick Scavetta: Yeah. I would agree with that, that it doesn't really matter at the beginning. But I would also add to that, that I kind of have a different perspective on how you should go about learning data science and why you should be learning data science. A lot of people are motivated by career and by money, and that's fine, right? That's perfectly legit. Maybe I'm a little bit of a naive academic, but I think that it's more of, if you want to enter that field and if you want to learn data science and you're thinking about what's the best way to enter into the field, it really needs to come from a place of motivation that is an authentic curiosity. I think that's something that's hard to fake.
Rick Scavetta: If you don't have that authentic curiosity in the problems and really just a pure fascination for how and why things work the way that they do and really picking apart things and understanding how they work, it's kind of hard to fake that, and it's kind of hard to develop your analytical and your critical thinking skills to do that. So you may have the actual tool set. But if you don't have that few feeling for it, so people say you have to love data. I don't know. You don't have to love that. Sometimes I really hate it, and it really completely drives me nuts, depending on what kind of data set I get to work on or what I'm looking at. But I really love seeing how things work and picking things apart.
Rick Scavetta: So think about your passions and how those come into play in data science, right? So in that case, the language is less important. It's really more about kind of what your motivations are for learning data science and entering into the field. Then in terms of what language to use, I would say based on your interests and things that have brought you into the field that are guiding your passions, choose a language that your colleagues are using. Right? So I teach mostly young scientists, and I love the problem that I'm faced with because it's kind of opposite to what most young, new data scientists face.
Rick Scavetta: The problem that my students face is that they have a lot of data, and it's really fantastic and very interesting and idiosyncratic, unique, and very expensive data that they've collected themselves through experimentation in the laboratory, and we're talking about everything from immunological data, data on primate behavior, climate data stuffed in, so oceanography, all kinds of different, interesting data sets that my students are working on, and they need help on how to work with this data. So they're very desperate, and they're hungry, and they're very eager to understand how to use it.
Rick Scavetta: So they just want to know, I need to answer this problem, and I want a tool that's going to allow me to do that, and they want to work in an ecosystem that would have that support. For many of them, it's going to be R. Right? So they have that kind of issue where that's what's guiding their decision. A lot of people that come to the field, they don't have a problem that they need to solve. They just are learning for the sake of learning, which is fine, right? Completely legit. But then they face that issue where they've got a lot of skills, but they don't have any data to actually work on. So they're just skilled people looking for problems to solve, and that's kind of the opposite. I think it's nice if you have a problem that you need to develop the skills to answer that, and it's much more motivating and in tune with what you're looking for. Then you choose a language that you have the most support and which your colleagues are using that helps you answer your problem as easily as possible, basically.
Boyan Angelov: So if you want to flip the switch maybe a bit and provide guidance for maybe experienced data scientists looking to pick up the language that they're not used to working with, what's your advice for their learning journey?
Rick Scavetta: Well, there's a really fantastic book that was just published that [inaudible]. It's really fantastic. But I think one thing that I recommend to my students when they're just starting out is take something that you know the answer to, that you know will work and try and reproduce it in the other language. This is not what I would encourage on a daily basis as a workflow, right. Because it's basically like a one-to-one translation. So one translation works in terms of really kind of trying to help you understand and really seeing that you got the so-called right answer is not something that you want to do every day as part of your job, right? Nobody's going to employ you to be the one-to-one translator.
Rick Scavetta: But in terms of learning, it's really nice. So for example, I tell students that are really just starting out, that take something that you've done in Excel and SPSS and GraphPad Prism in whatever Genome Viewer you're using, blah, blah, blah, and try and reproduce that then inR and then see the workflow. Don't worry if it's a big, hot mess. The main priority is that you got the right answer, that you got the answer to the question that you thought you were asking and not that you got the right answer to the wrong question, which is the worst thing ever in data science, 100%, the worst thing. Worse than not knowing what you're doing is getting the right answer to the wrong question.
Rick Scavetta: So that's what I tell about my new students, and I would say for the intermediate that already are very good in R and Python, I say do the same thing. Take your Python and try and reproduce it in R and see how you can build upon that and see how you can make it a bit more elegant or vice versa, right, and see where the event is, and you can kind of tease apart the waste inside of there.
Boyan Angelov: Yeah. Maybe about the more experienced data scientists, how do they get into the other language? I can say at that level, I think they would have the stamina to go the hard way. I studied biology, right, biochemistry. While that happened, I was still interested in code. The problem was my father bought me a big book, more like a reference book on visual basic, and I gave up because of this, obviously. You get this huge reference book on really nitty-gritty of the language. Yeah. That wasn't a great idea back then.
Boyan Angelov: But actually, I do think if you are super experienced in either tool, you might actually go that way. You really can look at the really practical differences, the one-to-one differences. You might look this complication in Python. That's super interesting how Pandas is also structured, how the Panda syntax is different from [inaudible] for example. You might really go the hardcore around where you really look in the theory, not that much on the data and the practice. Yeah.
Rick Scavetta: So on that point with the visual basic, actually that was one of those languages that I learned in high school. I learned it, and the motivation for learning it was because we were... Basically, it was, I think like an upper level computer science class in high school, and we were given the task to make a program. The program my team decided to make was a Bubble Bobble. Well, Bubble Bobble. But you know these Bubble Bursts. You point the thing, and there's all these bubbles at the top, and then you match three resonant identifiers. So we made that game in visual basic, right? That was highly motivating to work as a team to do that because we saw the output at the end. Right?
Rick Scavetta: So that kind of comes back to this point of you have a problem that you want to solve. Our problem was we need to get credit for that class. So we needed to do this project. But that's the difference in learning of having a clear goal that you want to achieve and the reason why you want to achieve that goal, right? I need to answer this problem with this data for my research, and nobody else in the world is addressing this. So I'm very motivated to be the first person to address this. That was one of the motivations when I was a PhD student. I wanted to look at gene order between different species. At that time, it was really just coming on board that you could look at syntony and homologous genes between the two organisms. The idea was to find where you lacked syntony and where you had orphaned genes.
Rick Scavetta: There's no program that's going to show you orphan genes in different organisms or where you lacked syntony between homologous genes. So I pitched this to my supervisor, and I was really lucky because my supervisor was completely hands-off in a classic German style. I absolutely loved it. He said, "Okay. Well, I guess you're going to need to learn how to program." I was like, "Yeah, I guess so." So I got a book on PERL, and there was an expert in the lab who actually had written book about PERL, and I got his book, and I started just going through it and learning PERL. Then I realized that the problem was as interesting as I thought it was. But that was a motivating factor of me to learn PERL, and that was an entry into R and then later into Python.
Adel Nehme: That's really great, and I really admire this means and ends approach to learning programming languages. So before we wrap up, I'd love to get your thoughts on the future of programming and data science and where you think it's headed. What do you think are some of the major trends that will affect programming and data science as we know it today?
Boyan Angelov: From my experience, the future is bright. I would say I don't think there's ever been a better time to be a data scientist. Both of those languages are extremely active. There's some amazing things happening every week. The tools are so nice. Your IDs are nice. There's so much material online where you can learn. There's so many problems to solve. So I do think the future is very bright to be a data scientist. One thing I can argue now is you could see the strength of tools become better, which democratizes data science is much, much easier to do things, which comes with some dangers, right? Because if it's easy to do machine learning, you might do bad machine learning.
Boyan Angelov: More people statistically have a chance to not such a great ethical work, for example. So we have to pay attention to that, for sure. But more importantly, the question which sometimes I get asked is yeah, as a data scientist, do you think if in the future, you will get the job automated? We'll run out of work because we have GitHub Copilot and stuff like this. I would argue perhaps as possible in my career so far, I haven't kept just in the field grow, but I will definitely argue for new roles to come up, for example, like mine, which is a data strategist where skills like communication, empathy to the team, how do you think about business, for example? How do you think about the domain? This gender remote problem-solving consultative skills will be much more important in the future, perhaps through the most jobs, I will say. So it's super important for data scientists to learn those things as well.
Rick Scavetta: I mean, I would agree with those, but I think there's some other clear trends that I think are going to dominate in the future. One is, I mean, there seems to be a trend away from being really strong Python and R programmers. So one of the trends that we see is, as Boyan mentioned, things like AutoML or the Copilot that was just released by GitHub. I think AutoML is still kind of developing, even though it's been around for a while. It still is not really running completely independent. People are still having intent to be manually and take a look at the results there. Copilot is basically still in its early phases, and people have already reported on a lot of problems and reasons why we should not get so excited about it.
Rick Scavetta: But the trends are there. Right? Whether those things really flourish into really being standard bearers of how we do data science is yet to be decided. But they could very well potentially be there, and that means that we're getting less and less actual input from programmers, and it's less important that you're really a good Python or R programmer. One other trend for that is actually the Tidyverse in R. One of the reasons why the Tidyverse exists is... There's several reasons, but one of them is to just make the learning curve easier for new entrance into the field. So to just get stuff done. So there's a school of thought that is just teach Tidyverse, just do the Tidyverse, and everything's in Tidyverse.
Rick Scavetta: Part of that is that you want to just show how to analyze a problem, and then that's the entire story, and then you're done with it. These are not people that are going to be career data scientists, and they don't care about how it works under the hood. They don't care about attributes, and they don't care about the ethos of a list and metadata and vectors and how things are put together in R. They just want to get their stuff done. I can see that. But what that means at the same time is that you're having people that are just using a set of commands which are umbrella commands or convenience wrappers that are not really getting really deep into understanding how those things work.
Rick Scavetta: So there are trends which are kind of moving away from a very detailed, fine understanding of Python and R. If that continues, then it means Python and R are actually less important. Right? Aside from that knowing that you have this automation and this convenience wrapper packages and tools, but you also have other packages or other languages that are coming up, like Julia, right?
Rick Scavetta: So the Julia conference I think just wrapped up yesterday. They had fantastic learning tutorials the whole week, really fantastic things on all variety of different topics, including a lot of things related to inferential statistics, basic statistics, machine learning, really hardcore data science topics, I think even some stuff in the life sciences. So really interesting stuff happened with Julia and who's the same. Maybe Julia has been increasing for the past years, and maybe it will continue to increase and overtake Python and R. So I think one of the key takeaway messages in the community in looking at the field is that you have to kind of keep on your toes about keeping your skills up to date and really thinking about, okay, what's the thing I need to learn to really keep on board to make myself relevant for the company with the positions that I'm really adding value to that.
Rick Scavetta: It comes back, I think, two things, like critical thinking skills, analytical thinking skills, understanding the business problem. Something that Boyan mentioned as well was the communications and the empathy. Speaking on that regard, I think those things get talked about a lot in recent years. But I feel they're oftentimes just paid lip service to. There's very little really understanding, what does that mean? Right? What is actually communication skills, and how does that release on the ground?
Rick Scavetta: So for me, that is in the larger context, really focus on the idea of data design, right? So what is the design of a data product? That's really from start to finish, and part of that is really understanding the use case and understanding the user as a core feature in designing data science solutions, data science products, right? When you think about design, then you start really thinking about the user. You start thinking about people and the design community in all various aspects. They're much more deeply embedded into really understanding, developing a product for a user and understanding the persona and the people that are going to use it.
Rick Scavetta: I think that's something that is not discussed enough in data science. So I think the empathy and the communications will increase, and I would also predict that there's going to be a bit more of a focus on the actual design. So there was just I think in the last year from Berkeley, the Institute for Data Science in Berkeley was able to get funding for data science by design group within that. They just had this data science by design creator conference.
Rick Scavetta: That I find really, really fascinating and really exciting, but here we see this focus on design. But there's also this focus on the creativity aspects of that, which is wonderful, which is really a nice aspect. But arts and creativity as part of design, but you can be a designer without making art, without making visually stimulating creative objects. So design is really something that runs through a lot of what we're doing, and I think that's something that data scientists have neglected to think about. I think it's something that will increase in popularity coming up.
Call to Action
Adel Nehme: That's all super fascinating, and it's great to see a world where data science is becoming increasingly democratized. Finally, Rick, Boyan, do you have any final call to action before we wrap up today's episode?
Boyan Angelov: From, from my side, I'll just repeat what I said about, it's just never been a better time to be a data scientist. I think it's important to be also a good data scientist, to work on problems which are important to the people around and to the world, and there's so many problems to solve. I think both of us have such rewarding careers. So far, I don't think we regret anything about our field. So I would just invite more people [crosstalk]-
Rick Scavetta: I regret everything. I don't agree with that.
Boyan Angelov: I invite more people to be a part-
Rick Scavetta: If you didn't make choices that you regret, you're not making the right choices.
Boyan Angelov: Okay. So for sure.
Rick Scavetta: I think you should be regretting some parts of the things that-
Boyan Angelov: So for sure. Join us on the journey, become a data scientist, move into the field, and solve the problems that we have. Yeah.
Rick Scavetta: So one thing that maybe I would like to mention before we wrap up is that we do have a website for the book. So first off, get the book. It's really fantastic. But we do have a website for the book, that coming back to the idea of design, and I really do see being bilingual as part of understanding the design of data science products and really understanding your skills as a designer, not just a data scientist, and the website is called moderndata.design, not .com, .designs because we're fancy. So moderndata.design is a website, and there you can sign up to get updates on the book, and we are preparing a nice little cheat sheet, which is the R Python bilingual dictionary cheat sheet, which you can sign up to receive by entering your email address into the website and keep up to date with any updates on the book.
Adel Nehme: We'll definitely include all these details in the show notes. For whoever visits the site, it's really beautifully designed. With that, thanks a ton, Rick and Boyan for coming on the show.
Boyan Angelov: Thank you.
Rick Scavetta: Yeah. Thanks a lot, Adel. It was really nice talking to you.
Adel Nehme: That's it for today's episode of DataFramed. Thanks for being with us. I really enjoyed Rick and Boyan's perspective on creating a more inclusive data science learning journey that incorporates both R and Python and promotes a needs-based approach to learning. If you enjoyed this podcast, make sure to leave a review on iTunes. Our next episode will be with Noah Gift on creating pragmatic AI solutions. I hope it will be useful for you, and we hope to catch you next time on DataFramed.
Introducing The State of Data Literacy Report 2023
How Data Scientists Can Thrive in Consulting
Pratik Agrawal, Partner at Kearney, joins us to discuss how data teams can scale value in consulting environments.