Data Science Tool Building
Since 2007, Wes has been developing data analysis software, mostly for use in the Python programming language. His primary objective has been improving user productivity, increasing performance and efficiency, and enhancing data interoperability. He is best known for creating the pandas project and writing the book Python for Data Analysis. Since 2015, he has been focused on the Apache Arrow project. He also contributed to Apache Kudu (incubating) and Apache Parquet (where I am a PMC member). He was the co-founder and CEO of DataPad. He later spent a couple years leading efforts to bring Python and Hadoop together at Cloudera. In 2018, Wes founded Ursa Labs, a not-for-profit open source development group in partnership with RStudio. In 2018, he became a Member of The Apache Software Foundation.
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC.
Transcript
Hugo: Hi there, Wes, and welcome to DataFramed.
Wes: Thank you. Thanks for having me.
Hugo: Great pleasure to have you on the show, and I'm really excited to have you here today to talk about open source software development, to talk about your work at Ursa Labs, Apache Arrow, a bunch of other things to do with tool building, but first I'd like to find out a bit about you. Perhaps you could open by telling me what you're known for in the data community?
Wes: Sure, I'm best known for being the original author of the Python Pandas project, which I understand that a lot of people use. So, I started building that as a closed source library a little over ten years ago, and I've been working on a lot of different open source projects for the Python data science world and beyond. I also wrote a book called Python for Data Analysis, which is now in its second edition. I think that's become a pretty ubiquitous resource for people that are getting into the field of data science and are wanting to learn how to use Pandas and get their feet wet with working with data.
Hugo: And, congrats on the second edition. That was in the past year or so, that that was released, right?
Wes: Yeah, it was just about a year ago, the end of 2017.
Hugo: How did you get into data science tool building originally? Because, I'm aware that your background wasn't in CS, per se.
Wes: Right, I was a mathematician, so I studied pure math at MIT. I did a little bit of computer science. I had had some exposure to the world of machine learning i... See more
Wes: There were a number of MIT grads that had gone to work there. Some of them were math majors, and they kind of sold me on the idea of getting experience with applied math, and then maybe I would go back to grad school later on. I found that in my job there, that rather than doing very much applied math, that I was really doing a lot of data munging. I was writing SQL. I was using Excel. Really, I just found that I wasn't as productive and efficient working with the data as I felt like I should have been.
Wes: Part of it was, well, I'm just starting out my career. I'm 22 years old. What do I know? But, I looked around, even at people who were a lot more senior to me and a lot more experienced. It seemed like they weren't very productive either, and they were spending a lot of time, obviously their skill with Excel and Excel shortcuts and so forth, keyboard shortcuts, was a lot better than mine, but still it seemed like there was just something missing to working with data.
Wes: I started to learn R at the end of 2007, beginning of 2008. There were, at that point in time, the R ecosystem was a lot less mature. It felt like an interesting, valuable language for doing statistics and data analysis, but we also needed to build software. So, I learned a little bit of Python and thought, wow, this is a really easy to use programming language. I had done some Java programming and thought that I just wasn't very good at Java.
Wes: So, I thought maybe I'm just not cut out for building software. But, I decided to have a tinker with building some data manipulation tools in Python. That was March, April, 2008, and just went down the rabbit hole from there. Once I had made myself more productive working with data, I started evangelizing the tools I was building to my colleagues. I kept pulling on one thread and ended up becoming more of a software engineer than a finance or math person.
Why did you choose Python?
Hugo: Yeah, there are a lot of interesting touch points there. For example, your background in pure math and that you're in Connecticut, I actually was working in pure math and ended up doing applied math in a biology lab in New Haven, Connecticut, not in Greenwich, but at that point, I actually started dealing with data a lot as well. That's when I started getting into data science also. It's also interesting that Pandas, when you first developed it, was closed source, but before we get there, you've spoken a bit to why you chose Python. Could you explain a bit more about what was attractive about Python then? Because, of course, a lot of the attractive things for researchers and data scientists now about Python is the data science stack, Pandas, scikit-learn, NumPy, all of these things. What made you really like it back in the day?
Wes: Yeah, at that point in time, 2007, 2008, in terms of doing statistical computing, Python was not ... Let's think of it as a promising world that has not yet been terraformed, think that there were kind of the nuts and bolts of a really interesting environment. I learned about the IPython project and said, "Okay, here's a really nice interactive shell where you can plot things. It has tab completion and really basic interactive affordances that really help out a lot." You had the nuts and bolts of doing all of the analytical computing that you need to do for data manipulation.
Wes: NumPy had its 1.0 release, I think, in 2006 and had become a mature project and the scientific Python world was de-fragmenting itself after the numarray/numeric rift, which had persisted for several years. Travis Oliphant had worked to bring those communities together. Really, I think what attracted me to the language was the accessibility and the fact that it was really very suited for interactive and exploratory computing, that you didn't have to set up an elaborate development environment, an IDE, to be able to get up and running doing some really basic things. Having had experience with Java, I think one of the things that put me off about Java was the elaborateness of the environment that you need to really be productive.
Wes: You really need to set up an IDE. There's all this tooling that you need to do, whereas with Python, you could do some pretty complex things with a few lines of code in a text file, then you run the script. So, that kind of interactive scripting feel of doing exploratory computing was really compelling to me at the time. But, obviously, Python was missing a lot of tools. So, it was a bit daunting to start the process of building some of those tools from scratch.
Hugo: Yeah, and you mention IPython, NumPy, and Travis, and I suppose this was the time where John Hunter was working a lot on matplotlib and working with Fernando to incorporate it with IPython. There was a lot of close collaboration. I suppose this speaks to the idea of community as well. Did you find the scientific Python community something that was also attractive?
Wes: Yeah, I didn't have much interaction with the community until much later. I think the first person ... There's two people that I met from the Python community who were my first point of contact with that world. One person is Eric Jones who is a founder of Enthought, which is the original Python scientific computing company based in Austin, Texas.
Hugo: And, they also run the SciPy conference.
Wes: Yeah, they run SciPy. Enthought was doing a lot of consulting work in New York City with financial firms that were getting big into Python during that era, training, and custom development. I got in touch with Eric some time during 2009 and gave him the very first external demo of Pandas. This was right around the time that we were getting ready to publish the Pandas bits on PyPI and so forth, the first open source version of the project. The second person I met was John Hunter himself from matplotlib. I met him in Chicago in January, 2010. At that point, I was looking around for how to engage with the Python world having just open sourced Pandas, and because John was working, he worked for Tradelink up until his death in 2012.
Wes: But he was a quant there, having been a neuroscientist and had been building matplotlib for many years. He took me under his wing. He was my mentor for a couple of years and helped me enter and get involved in the community. I definitely feel that I found it a very warm and very inviting community, very collaborative and collegial. I think I was attracted to that feeling. It didn't seem like a lot of people competing with each other. It was really just a lot of pragmatic software developers looking to build tools that were useful and to help each other succeed.
Hugo: Yeah, and you actually still get the sense of that when you go to SciPy in Austin, Texas every July or every second year. You still get a strong sense of community and people just loving building the tools together.
Wes: Yeah, totally, totally. Obviously, the community has grown much bigger. I think the ratio of project developers, people working on the open source projects, to the users, that ratio has certainly changed a lot in that there are a lot more users now than there are developers. I think the very first SciPy conference was probably the majority of people there were people who were the developers of open source projects. But still, I think it's a great community, and I think that's helping continue to bring people into the ecosystem.
Hugo: Actually, I had Brian Granger on the podcast recently, and we discussed this. Several people are discussing at the moment that we're now entering a phase transition from having individual users spread across orgs and spread across the globe of a lot of open source packages that actually having large scale institutional adoption, right? I'm wondering, in terms of Pandas starting off as a project, I'm under the impression it was started as a tool to be used in finance. Is that the case?
Wes: Yeah, it was focused. You can go back and download Pandas 0.1, which was published to PyPI in December, 2009 and see what was in the library. Compared with now, the functionality was a lot more geared towards time series data and the kinds of problems that we were dealing with back at AQR. I wouldn't say that it necessarily is finance specific. It was very general data manipulation. It was a pretty small project back then, but it was just about dealing with tabular data, dealing with messy data, data munging, data alignment, essentially all those kind of really basic wrangling and data integration problems. It wasn't really until 2011, 2012 that the project got built. I built the project out and created a more comprehensive set, relational algebra facilities. It didn't have complete joins, all the different kinds of basic joins until late 2011. So, its feature set was certainly skewed by use cases that we had in front of us back at AQR.
How did you get the project off the ground?
Hugo: How did you get the project off the ground? I know that's a relatively ill formed question, but just in terms of hours and people and resources?
Wes: Well, you smelt metal or you forged weapons. You have to get the crucible really, really hot. We open sourced the project at the end of 2009, and I think we had deliberated whether or not to open source at all for about six months or so, and ultimately, powers that be decided that we would open source Pandas and see what would happen. I gave my very first talk about Pandas, you can still find online. It's PyCon 2010 in Atlanta, and the subject of the talk was about using Python in quantitative finance, but the project didn't really go anywhere after that. So, it was hosted on Google Code. GitHub existed, but it was a Ruby thing at that time. I left AQR to go back to grad school. I went to Duke to start a PhD in statistics, statistical science, it's called there. I continued to do a little bit of contract work developing Pandas for AQR.
Wes: Somewhere, I think the catalyst for me was in early 2011 I started to get contacted by more companies that were exploring using Python for data analysis use cases. They had seen my talk at PyCon and were interested in getting my perspective on statistical computing. I just had this feeling that the ecosystem was facing a sort of existential crisis about whether or not it was going to become truly relevant for doing statistics. It was clear to me that Pandas was promising, but really had not reached a level of functional completeness or usefulness to be the foundation of a statistical computing ecosystem in Python.
Wes: So, I guess I felt that feeling so strongly that I sort of had an epiphany where it wasn't quite like shouting, "Eureka," and jumping out of the bathtub, but I emailed my advisor and said, "Hey, I would like to take a year off from my PhD and go explore this Python programming stuff, and we'll see how it goes." I had some money saved from my first job, and I moved back to New York into a tiny apartment in the East Village which had mice and stuff. Really, not the best place I've ever lived, but I essentially was like, "I'm just going to work full time on Pandas for a while and build it out and see what happens."
Wes: I think that's when, as soon I started socializing the functionality of Pandas and filling in feature gaps, implementing joins and fixing some of the internal issues ... Of Course, I created other internal problems, but there were definitely some design problems in the early versions of Pandas that got fixed in the summer of 2011. But, as soon as Pandas could read CSV files pretty reliably and could do joins and a lot of the basic stuff that you need to be productive working with multiple data sets, I think that's when it started to catch people's eye toward the end of 2011 and start to take off the ground.
Wes: So, around the same time, I pitched the idea of a data analysis book in Python to O'Reilly and they agreed to do a book, which thinking back on it was a bit risky, because who knows what would have become of... Pandas was not at all obviously going to be successful back in 2011. They decided to take a bet and so much so that I asked them later why they didn't put a panda on the cover, but they said, "Well, we're saving the panda for something really big." So, it wasn't even clear then that Python and Pandas and everything was going to be a popular thing. So, it's important to have that kind of perspective.
Hugo: So, when leaving in the East Village, supporting yourself to build out the package, did you have any inkling that it would achieve the growth and wide scale adoption that it has?
Wes: Not really. Obviously, I had the belief that the Python ecosystem had a lot of potential and that projects like Pandas were necessary to help the language and the community realize the potential. I think there was a lot of computational firepower in the NumPy world and all the tooling, Cython, and tools for interoperability with the native code. I just wanted to help realize that potential, but I didn't really have a sense of where it would go.
Wes: There were some other significant kind of confluence of things that happened, particularly when you consider the development of statsmodels and Scikit-learn, which brought meaningful analytical functionality to Python. I think if Pandas, really, the big thing that made Pandas successful was the fact that it could read CSV files reliably. So, it became a first port of entry for data into Python and for data cleaning and data preparation. If you wanted to do machine learning in Scikit-learn or you wanted to use stats models for statistics and econometrics, you needed to clean data first.
Wes: So, using Pandas was the obvious choice for that. But, yeah, it wasn't obvious. I recruited a couple of my former colleagues from AQR who worked with me on Pandas, and we explored starting a company around financial analytics in Python powered by Pandas, but we were focused on building out Pandas as an open source project while we explored that startup idea. Ultimately, we didn't pursue that startup, but it was clear that by mid-2012, that we'd sort of crossed the critical horizon of people being interested in Python as a language for data analysis.
Hugo: Since then, you've found certain institutions which have employed you in order to work on Pandas, right?
Wes: I wouldn't say that. Outside of my time at AQR, when I was building Pandas initially, I've never been employed directly to work on Pandas. I started a company called DataPad with Chang She. It was a venture backed company, and we were building a visual analytics that was powered by Pandas and other Python ... DataPad was acquired by Cloudera at the end of 2014, so Chang and I landed there to work on ... My role at Cloudera was to look holistically at the big data world and figure out how to forge a better path for Python and data science tools in general in the context of the big data world.
Wes: That's the Hadoop ecosystem and Spark and all the technology which was largely Java based, which had been developed since 2006 or 2008. But, I wasn't working on Pandas in particular at that point. I sort of had taken stock of the structural and infrastructural problems that Pandas had. I gave a talk at the end of 2013 at PyData in New York. The title of the talk was "Practical Medium Data Analytics in Python." The subtitle of the talk was "Ten Things I Hate About Pandas."
Hugo: I remember that.
Wes: I had, in the background, this feeling that Pandas was built on a fantastic platform for scientific computing and numerical computing, so if you're doing particle physics or HPC work in a national lab with a supercomputer, Python is really great, and that's how the ecosystem develop in the late 90s, early 2000s, but for statistical computing and big data and analytics, the fact that strings and categorical data wasn't a first class citizen in that world made things a lot harder. Missing data was not a first class citizen. So, there were a lot of problems that had accumulated. At that point, I started to look beyond Pandas as it was implemented then into how we could build technology to advance the whole ecosystem and beyond the Python world as well.
Apache Arrow
Hugo: I think a through line in this is really encapsulated by a statement you made earlier, which is you wanted to build technologies and tools that are truly relevant for doing statistics or working with data. I know as a tool builder, you're committed to developing human interfaces to data to make individuals more productive. I think that actually provides a really nice segue into a lot of what you're thinking about now, in particular, the Apache Arrow project. I'm wondering if you can tell me about Apache Arrow and how you feel it can facilitate data science work.
Wes: Yeah, I got involved in what became the Apache Arrow project as part of my work at Cloudera. One problem that had plagued me as a Python programmer was the fact that when you arrive at foreign data and foreign systems that you want to plug into, whether those are other kinds of ways of storing data or accessing data or accessing computational systems, that we were in a position of having to build custom data connectors for Python for Pandas or whatever Python library you're using.
Wes: I felt that we were losing a lot of energy to building custom connectors into all of these different things. This problem isn't unique to Python. So, if you look at all of the number of different pair-wise adapters that are available to convert between one data format and another, or serialized data from one programming language to another programming language, sharing data was something that had caused me a lot of pain. Also, sharing code and algorithms was a big problem.
Wes: So, the way that Pandas is implemented internally, it has its own custom way of representing data that's layered on top of NumPy arrays, but we had to essentially re-implement all of our own algorithms and data access layers from scratch. We had implemented our own CSV reader, our own interfaces to HDF5 files, our own interfaces to JSON data. We have pretty large libraries of code in Pandas for doing in memory analytics, aggregating arrays, performing groupby operations. If you look across other parts of the big data world, you see the same kinds of things implemented in many different ways in many different programming languages. In R you have the same thing, many of the same things implemented in R. So, I was kind of trying to make sense of all of that energy loss to sharing data and sharing code and thinking about how I could help enable the data world to become a lot less fragmented. People building systems, people like me who build tools for people, how to make people like me who are building tools a lot more productive and able to build better and more efficient data processing tools in the future.
Wes: This was just kind of feelings that I had, so I started to poke around Cloudera and see if other people felt the same way. So, I was working with folks on the Impala team, people like Marcel Kornacker who started the Impala project, Todd Lipcon who started the Apache Kudu project. It's now Apache Impala, joined the Apache Foundation. So, there were a lot of people at Cloudera that essentially agreed with me, and we thought about what kind of technology we could build to help improve interoperability.
Wes: We sort of centered on the problem of representing DataFrames and tabular data. As we looked outside of Cloudera, we saw that there were other groups of developers who concurrently were thinking about the exact same problem, so we bumped into folks from the Apache Drill project, which is a SQL on Hadoop system. They were also thinking about the tabular data interoperability problem. How can we move around tabular data sets and reuse algorithms and code and data without so much conversion and energy loss?
Wes: Very quickly, we got 20, 25 people in the room representing 12 or 13 open source projects with a general consensus that we should build some technology to proverbially tie the room together. That became Apache Arrow, but it took all of 2015 to put the project together. Now, how is all this relevant to data science? Well, what the Arrow project provides is a way of representing data and memory that is language agnostic and standardized and portable. You can think of it as being a language independent DataFrame.
Wes: If you create Arrow based DataFrames in Python, you can share them with any system, whether that's written in C or C++ or JavaScript or Java or Rust or Go. As long as they implement the Arrow columnar format, they can interact with that data without having to convert it or serialize to some kind of intermediate representation like you usually have. The goal of the project, in addition to providing high quality libraries for building data sciences tools and building databases, is also to improve the portability of code and data between languages.
Wes: Outside of the interoperability side of the project, there's also the goal within the walls of the particular data processing system to provide a platform of algorithms and tools for memory management and data access that can accelerate large scale data processing. We wanted the Arrow columnar format to support working with much larger quantities of data, the single node scale, data that is particularly data that does not fit into memory.
Hugo: I love this idea of tying the room together, as you put it, because, essentially, it speaks to the idea of breaking down the walls between all these silos that exist as well, right?
Wes: Yeah, yeah. I know, I think if you look across, just within the data science world, even though functionally we're solving many of the same problems, there's very little collaboration that happens between the communities, whether collaborating at the software design level or at the code level. As a result, people point fingers and accuse people of reinventing wheels or not wanting to collaborate, but really it's if your data is different in memory, there's just no basis for code sharing in most cases. So, the desire to create an open standard for DataFrames is just ... If you want to share code, it's essential. You have to standardize the representation in RAM or on the GPU or essentially at the byte or the bit level agreeing on what the data looks like once you load it off disk or once you parse it out of a CSV file, is the basis of collaboration amongst multiple programming languages or amongst different data science languages that are ultimately based in C or C++.
Hugo: I remember actually Fernando Perez spoke to this as well in his keynote where you also keynoted, the inaugural JupyterCon, saying, "We welcome so many contributions, but we need to agree on some things, right? These are some things that we've all agreed upon. So, if you're going to contribute, let's build on these particular things."
Wes: Right, yeah. I think the Jupyter project certainly socialized this idea of open standards by developing the kernel protocol providing a way ... Here's the abstract notion of a computational notebook. Here's how if you want to build a kernel, add a new language to the Jupyter ecosystem, here's how you do it. That certainly has played out beautifully with ... I think it's over 40 languages have kernel implementation for Jupyter. But, I think in general, I think people are appreciating more the value of having open standards that are community developed and that are developed on the basis of consensus where there's just broad buy in. It's not one developer or one isolated group of people building some technology and then trying to get people to adopt it. I think Jupyter is unique in the sense that it started out in the Python world, but I think it's there: they set out with the goal of embracing a much broader community of users and developers. That's played out in really exciting ways.
Hugo: I really like the descriptions you gave and the inspiration behind the Arrow project, in particular the need for interoperability, the importance of these portable DataFrames. I don't want to go too far down the rabbit hole. I can't really help myself, though. I'd like you to speak just a bit more to your thoughts behind the challenge of working in the big data limit. For example, that we have computers and hard drives that can store a lot of stuff, but we don't actually have languages that can interact with ... unless we parallelize it, right?
Wes: Right. A common thing that I've heard over the years from ... People will say, "Wes, I just want to write Pandas code, but I want it to work with big data." It's a complicated thing, because the way that a lot of these libraries are designed, the way that Pandas is designed, and a lot of libraries that are similar to Pandas, it's the implementation and the computational model, when computation happens, what are the semantics of the code that you're writing?
Wes: There's a lot of built in assumptions around the idea that data fits in memory and that when you write A plus B that A plus B is evaluated immediately and materialized in memory. So, if you want to scale out, scale up computing to DataFrame libraries, you essentially have to re-architect around the idea of deferred evaluation and essentially defining a rich enough algebra or intermediate representation of analytical computation where you can actually use a proper query engine or a query planner to execute operations.
Wes: So, really, what is needed is to make libraries like Pandas internally more like analytic databases. If you look at all the innovation that has happened in the analytic database world over the last 20 years, of columnar databases and things that have happened in the big data world, very little of that innovation in scalable data processing has made its way into the hands of data scientists.
Wes: So, really, one of my major goals with my involvement in the Arrow project is to provide the basis for collaboration between the database and analytic database world and the data science world, which is just not something that's happened before. Ultimately, the goal is to create an embedded analytic database that is language independent and can be used in Python, can be used in R, that can work with much larger quantities of data.
Wes: But, it's going to take a different approach in terms of the user API because I think this idea of magically retrofitting Pandas or essentially retrofitting Pandas with the ability to work with hundreds of gigabytes of data or terabytes of data ... I hate to say, it's a little bit of a pipe dream. I think it's going to require some breaking changes and some kind of different approaches to the problem. That's not to say that Pandas is going away. Pandas is not going anywhere, and I think certainly is occupying the sweet spot of being the ultimate Swiss army knife for data sets under a few gigabytes.
Pandas 2?
Hugo: So, does this conversation relate to the murmurings we've heard of a potential Pandas 2 in the pipeline?
Wes: Yeah, at the end of 2015, I started a discussion in the Pandas community. Just FYI, I think people are often thanking ... I see people out in the community. They're like, "Wes, thanks so much for Pandas." I have to remind them to go out of your way and thank Jeff Reback, and Joris Van den Bosche, and Phillip Cloud, and Tom Augspurger, and the other Pandas core developers that have really been driving the project forward over the last five years. I haven't been very involved in the day to day development since some time in 2013, but at the end of 2015, I started spending some more time with the Pandas developers that had been building this project for ... It's a little over seven years old, the code base. Are there things that we would like to fix? What are we going to do about the performance and memory use and scalability issues?
Wes: I don't think at that point, I don't know that Dask DataFrame existed. So, Dask has provided an alternative route to scaling Pandas by using Pandas as is, but essentially re-implementing Pandas operations using a Dask computation graph. But, looking at the single node scale, the in memory side of Pandas, we looked at what we'd like to fix about the Pandas internals. That was what we described as the Pandas two initiative. Around that time, we were just getting ready to kick off the Apache Arrow project.
Wes: So, I wouldn't say that we reached a fully baked game plan in terms of how to create a quote/unquote Pandas 2, but I think we reached some consensus that we would like to build an evolved DataFrame library that is a lot simpler in its functionality, so shedding some of the baggage of multi-indexes and some of the things in Pandas that can be a bit complex and also don't lend themselves very well to out of core, very large, don't fit into memory, data sets. But, to something that's focused on dealing with the very large data sets at the single node scale. So, large out of core just big data sets on a laptop. We are working on that, and I think the project itself is not going to be called Pandas 2, just to not confuse people.
Wes: I think the Pandas project, we all got together, the Pandas team, we all got together in Austin over the summer. This was one of the topics that we're going to continue to grow and innovate and evolve the current Pandas project as it is right now, but my goal is to grow a parallel kind of companion project which is powered by the Apache Arrow ecosystem and provides the Pandas-like user experience in terms of usability and functionality, but is really focused on powering through very large on disk data sets.
Biggest Challenges for Open Source Software Development
Hugo: I'd like to step back a bit and think about open source software development in general. I suppose, spoiler alert, where I want this to go is to talk about one of your latest ventures, Ursa Labs. But, I'm wondering in your mind what the biggest challenges for open source software development are at this point in time?
Wes: Well, we could have a whole podcast just about this topic.
Hugo: And, of course, it depends on the stage of a project.
Wes: The way that I frame the problem when I talk to people is that I think open source projects face funding and sustainability problems of different kinds depending on the stage of the project. So, I think in the early stages of projects, when you're building something new or you're essentially solving a known problem in a different way, it can be hard to get support from other developers or financial support to sponsor individuals to work on the project because it's hard to build consensus around something new.
Wes: There might be even competing approaches to the same problem. So, if we're talking about the kind of funding that can support full time software developers, it can be a lot of money. So, committing a lot of money to support a risky venture into building a new open source project which may or may not become successful can be a tough pill to swallow for a potential financial backer.
Wes: Later on, as projects become wider adopted, they start becoming, particularly projects that are foundational, and you can call them ... I think the popular term is open source infrastructure. There was a report, Nadia Eghbal wrote the report called Roads and Bridges about open source infrastructure with the Ford Foundation, and sort of is about this idea of thinking about open source software as a public good, roads and bridges and public infrastructure that everyone uses. With public infrastructure, it's great, because it's supported by tax dollars, but we don't exactly have a open source tax.
Wes: I could get behind one, but we don't have that same kind of mentality around funding critical open source infrastructure. I think that as projects become really successful and they become something that people can't live without, they end up facing the classic tragedy of the commons problem where people feel like, well, they derive a lot of value from the project, but because everyone uses this as a project, they don't want to foot the bill of supporting and maintaining the software project.
Wes: So, whether you're on the early side of a project or in the early stage or a late stage, I think there's different kinds of funding and sustainability challenges. In all cases, I think open source developers end up particularly as projects become more successful, you end up quite over burdened and a burnout risk and I know I've experienced a burnout many times and many other open source developers have experienced periods of significant burnout.
Hugo: So, what can listeners who are working or aspiring data scientists or data analysts in organizations or C-level people within organizations do for the open source ... What would you like to see them do more for the open source?
Wes: I think users and other folks can help with ... People like me, I guess I've recently been working on putting myself in a situation where I am able to raise money and put to work money that is donated for direct open source development. The best way a lot of people can help is by selling the idea of supporting and either through development work or through direct funding, supporting the open source projects that you rely on. So, I think a lot of companies and a lot of developers are passive participants in open source projects.
Wes: So finding a way to contribute, whether it's through money or time, it is difficult because many open source projects, particularly ones that are systems related to infrastructure, they don't necessarily lend themselves to casual, quote/unquote, “casual contributions”. So, if it's your five percent project or your 20% project, it can be hard as an individual to make a meaningful contribution to a project which may have a steep learning curve or just require a lot of intense focus. So, I think for a lot of organizations, the best way to help projects can be to donate money directly.
Ursa Labs Overview
Hugo: So, I think this provides a nice segue into your work at Ursa Labs. I'd love for you to just give us a rundown of Ursa Labs, in particular, how it frames the challenges of open source software development.
Wes: Ursa Labs is an organization: I partnered with Hadley Wickham from the R community and RStudio to found Ursa Labs earlier this year. The raison d'être of Ursa Labs was to build shared infrastructure for data science, in particular, building out the Arrow ecosystem, the Apache Arrow ecosystem, as it relates to data science and making sure that we have high quality consistent support for all of that new technology in the Python and R world and beyond, and improving interoperability for data scientists that use all of those programming languages, but the particular logistical details of Ursa Labs is that we wanted to be able to effectively put together an industry consortium type model where we can raise money from corporations and use that money to hire full time open source developers.
Wes: So, at the moment, Ursa Labs is being supported by RStudio, by Two Sigma where I used to work right up until the founding of Ursa Labs, and it's now being funded by Nvidia, the makers of graphics cards. So, we're working actively on bringing in more sponsors to build a larger team of developers. I think it's really confronting that challenge that I think for an engineer at a company as a part time contributor to an open source project may not be as effective or nearly as effective as a full time developer.
Wes: So, I want to make sure I'm able to build an organization that is full of outstanding engineers who are working full time on open source software and making sure that we are able to do that in a scalable and sustainable way, and is organized for the benefit of the open source data science world. Anyway, having been through the consulting path and the startup path and working for single companies, I think a consortium type model where it's being funded by multiple organizations and where we're not building a product of some kind, it's kind of a new model for doing open source development, but one that I'm excited to pursue and see how things go.
Hugo: Yeah, I think it's really exciting as well, because it does approach a lot of the different challenges. One in particular, it's a trope, it's a common problem, right, of developers being employed by organizations and being given a certain amount of time to wok on open source software development, but that time being eaten away because of different incentives within organization, essentially.
Wes: Yeah. I think there have been a ton of contributions to Pandas and Apache Arrow from developers that work at corporations, and those contributions mean a lot, so definitely still looking for companies to collaborate on the roadmap and to work together to build new computational infrastructure for data science. I think it's tough when a developer might show up and be spending a lot of time for a month or two, then based on their priorities within where the company where they work they might disappear for six months. That's just the nature of things. I think the kinds of developers that make big contributions to open source can often be more senior or tend to be very important developers in their respective organizations, so frequently get called in to prioritize closed source or internal projects. That's just kind of the ebb and flow of corporate environment.
Future of Data Science Tooling
Hugo: So, I've got a relatively general question for you: what does the future of data science tooling look like to you?
Wes: Well, speculative, of course, but I think by spending my time on the Arrow project, my objective and what I would like to see happen in data science tooling is a de-fragmenting of data and code. So, to have increased standardization and adoption of open standards like the Arrow columnar format, storage formats like Parquet and Orc, protocols for messaging like GRPC. I think that in the future, I believe that things will be a lot more standardized and a lot less fragmented.
Wes: Kind of a slightly crazy idea, I don't know how crazy it is, but I think also in the future that programming languages are going to diminish in importance relative to data itself and common computational libraries. This is kind of a self serving opinion, but I do think that to be able to leave data in place and to be able to choose the user interface, namely the programming languages, the programming language that best suits your needs in terms of interactivity or software development or so forth, that you can use multiple programming languages to build an application or pick the programming language that you prefer while utilizing common libraries of algorithms, common query engines for processing that data. So, I think we're beginning to see murmurings of this de-fragmentation happening, and I think the Arrow project is ... kicking along this process and socialize the idea of what a more de-fragmented and more consistent user experience for data scientists, what that might look like.
Call to Action
Hugo: That's a very exciting future. My last question for you, Wes, is do you have a final call to action for our listeners out there?
Wes: Yeah, I would say my call to action would be to find some meaningful way to contribute to the open source world, whether it's sharing your ideas or sharing your use cases about what parts of the open source stack are working well for you or what parts you think could serve you better. If you are able to contribute to projects, whether through discussions on mailing lists or GitHub or commenting on the roadmap or so forth, that's all very valuable.
Wes: I think a lot of people think that code is the only real way to contribute to open source projects, but actually I spent a lot of my time as not writing code. It's reviewing code and steering discussions about design and roadmap and feature scope. I think the more voices and the more people involved to help build consensus and help prioritize the work that's happening in open source projects helps make healthier and more productive communities. If you do work in an organization that has the ability to donate money to open source projects, I would love to see worldwide corporations effectively tithing a portion of profits to fund open source infrastructure.
Wes: I think if corporations gave a fraction of one percent of their profits to open source projects, the funding and sustainability crisis that we have now would essentially go away. Obviously, I guess that might be a lot to ask, but I can always hope, so I think corporations can lead by example. Certainly, if you do donate money to open source projects, you should make a show of that and make sure that other corporations know that you're a good citizen and you're helping support the work of open source developers.
Hugo: I couldn't agree more. Wes, it's been an absolute pleasure having you on the show.
Wes: Thanks, Hugo. It's been fun.
blog
Top 10 Data Science Tools To Use in 2024
blog
Seven Tricks for Better Data Storytelling: Part I
podcast
Building Data Science Teams
podcast
How to Build a Data Science Team from Scratch
podcast
The Gradual Process of Building a Data Strategy
tutorial