Spreadsheets in Data Science (Transcript)
Introducing Jenny Bryan
Hugo: Hi Jenny, and welcome to DataFramed.
Jenny: Thank you very much for having me.
How did you get into data science?
Hugo: It's such a pleasure to have you on the show and we're here today to talk about spreadsheets in data science, the do's and don'ts, the why's and why nots. Before that, I'd like to find out a bit about you, how you got into the data science community, but even before that, and I know I'm already constructing a complex web of ideas and conversation topics, I'd like you to give us a little teaser of what you're thinking about what you're working on the moment, particularly with respect to using R to interface with spreadsheets.
Jenny: Sure. So yeah, my current main project for our development is doing a reboot of a package I released a couple of years ago called googlesheets. So, I'm doing a reboot of it for two reasons. One is that I have had a major upgrade since then and have better ideas of how to organize things internally and also the google sheets API has had a major version update since then. And so between those two things, it would be very nice to reboot this package. I also spent a week or two recently doing a release of readexcel, so I feel like I'm actually completely in my spreadsheet zone, which is, it's more about the internals of the spreadsheets than the usage, but it definitely feels like where my head is at the moment.
Hugo: And what's the basic idea of this package? I mean, what would the user experience be like or what can they do with it?
Jenny: Basically, we want there to be a function that plays the role of read.table or read.csv, but it's read.sheet or read.excel. And so the idea is to have this blissfully simple experience where you point this function at a local file in the case of an excel spreadsheet or a google sheet, in the case of google sheets, and it just shows up in your workspace as a dataframe. So that's the goal. Googlesheets can write spreadsheets, which we do not attempt to do with read.excel, but I think it's mostly about the reading.
Hugo: Yeah. And this is incredibly important, right, because a lot of people are working in spreadsheets, a lot of people are working in languages such as R. And it's really, I suppose, essentially to facilitate communication and collaborative, reproducible workflows between these communities.
Jenny: Yeah. I think if you're completely honest with yourself, like spreadsheets are kind of like the English of the data world, like everyone speaks it and even if that's not your language of choice, unless you live in a pretty rarefied environment, like you're going to have to be able to cope with data that comes in that format and maybe even emit that format as well. I suppose people who work at companies that are web 2.0 companies and all the data is being collected by professional data engineers from the beginning, like they certainly don't have to deal with this and they are mostly dealing with databases, but the long tail of data analysts and data scientists are working with data that was collected in very, very different ways and spreadsheets just seem to be unavoidable.
What is Jenny known for?
Hugo: And this is something we're going to delve into a lot later: arguments for and against even using spreadsheets, there are people in the community who want to get rid of spreadsheets entirely. So we're going to delve into all of these types of things and see why essentially we need to accept and almost develop a system of best practices around the use of spreadsheets. But before all that, I'd love to talk about your background and also what you're known for in the data science community. So what do people know you for currently?
Jenny: That's a good question. Maybe googlesheets or readxl. I hope I stand for someone who is right at the interface between the R user community and the R developer community so that I do enough package development at a high enough level that I have credibility there, but that I also am very, very in touch with my roots as an applied statistician and data analyst and remember what the pains and the joys of that are and let that inform the packages I work on, or influence the people around me who are working on other packages and really try to be like a diplomat there, to create productive conversations between the people making the tools and the people using the tools
Hugo: And is there a need for diplomacy in this situation?
Jenny: I don't know if it's diplomacy because that makes it sound almost like people are in disagreement, so maybe it's more a translation and pointing out specific things about a user experience that might not occur to you if you aren't a primary data analyst all the time anymore. I mean, ironically, the more deep you get into tool development, like the less time you have to do data analysis yourself. So it's also increasingly easy to get off in lalaland about what you think is important. And this is why even though I'm transitioning away from being a faculty member and in the process of resigning, I think I will continue to do some teaching because that's actually my best current mechanism for constantly reminding myself what it actually feels like to use all of these things in the real world and to use them starting from zero, not having some huge tower of abstraction that you have access to.
Hugo: I liked this idea that you described, to be sitting between sitting at the interface between users and developers and perhaps instead of diplomacy we can think of it in terms of facilitating conversations and acts of translation. And I think, I want to say one of your greatest creations, but that isn't actually what I mean because I have no idea about all the things you've created... But your first package I was introduced to, I think was reprex, which is about ... I'm just about to tell you what your package does, so hopefully I get it right, but for the purposes of the listeners, it helps people to create reproducible examples to put on GitHub issues or StackOverflow or whatever it may be. StackOverflow, right? This is a place where the wall, I'll call it a wall between users and developers can be really tough sometimes, right? So maybe you can speak to this barrier and how reprex will help in the situation.
Jenny: I guess that also reminds me of... I think another thing that maybe people know me from because it was the first thing I was brave enough to start really putting out there was a course I taught at UBC and developed called STAT 545. It started as our department's data analysis course and I took ownership of it about 10 years ago and then I just gradually made it more and more about modern R and data science and data wrangling. And so for the past couple of years I taught that course entirely on GitHub and at the end of the semester I would analyze our GitHub conversations, we used issues as a discussion forum basically. And I saw that I was commenting on 300 to 500 issue threads per semester and those conversations are most effective for everyone involved if there's a lot of code in them either instead of or in addition to prose. And so I developed reprex out of desperation basically, because I constantly wanted to encourage these students to like give me a little piece of code that actually runs-
Hugo: That you can run. Locally run.
Jenny: That I can immediately run, like, it's self contained that shows me your R and drops me right into your personal pain. And so I developed reprex really as a tool for me to stay sane. You can always create a standalone R script or R markdown file and use existing facilities with knittr and the R markdown package to do this, but that was always just enough friction that I didn't do it. And so I was like, it seems like a silly idea for a package, but I'm just going to create a package so that I can put four lines of R code on my clipboard and get it perfectly formatted for the venues where I most often share these things and it encouraged me to talk more through code than sort of hand waving and speculation. And so that was how it got started.
Hugo: That's fantastic. And I think GitHub is an incredibly important place for it or any type of collaborative code writing, issue raising, but I also think StackOverflow where... a common trope of StackOverflow is first time question askers asking a question and a snarky developer or expert coming in and saying, "Hey, why didn't you include reproducible code? Why can't you do this?" And so forth. So reprex actually very much helps dissolve that barrier between users and developers in that case, as well.
Jenny: Yeah, because I'm fairly sympathetic to ... It feels like this person didn't do their homework. But then I also know a lot of people don't even know what that means. So part of what I like also about reprex is that it administers a great deal of that tough love in the privacy of your own home. So it will error if you haven't loaded the packages and it will error if you're using data that doesn't exist or that you know that only exists because of all the code you ran before lunch yesterday. But you're not having to expose all of these missteps publicly. So it really lets you get your ducks in a row before you go to StackOverflow or GitHub.
Hugo: Before stepping out of the house.
Jenny: Yeah, like you could still have an example that was way too complicated and that people might regret, but to the extent that there are just like technical checks as to whether this is a self contained example. It lets you sort all of that out for yourself, which I like.
Hugo: Yeah, and that's incredibly important because when people get shut down as first time users of StackOverflow, there's a high probability they'll never return to ask questions and so we're literally not even letting people in the gate.
Jenny: Yeah. So having this like very unemotional sort of tool that can help you get maybe 60%, 70% there, there's still an art to making a minimal example, but then again, once you can do that, you're often in the situation where you can answer your own question too.
Why do people use spreadsheets?
Hugo: And something you mentioned there was the idea of empathy for early users and I think this really ties into the conversation about empathizing with people who use spreadsheets. So before we get into arguments for and against the use of spreadsheets in the data community, data science, analytics, whatever you want to call it, I want you to tell me why you think so many people use spreadsheets.
Jenny: Yeah. I think there's a couple of reasons, but the first one, it's like everyone has it installed. And I think developers underestimate the difficulty of getting software environment set up and so I think one huge draw of spreadsheets is they just feel and basically are ubiquitous, like even if people don't have Office installed, which is pretty unusual for your sort of average person, like they usually do, you can just go to google drive and create a spreadsheet. So I think the fact that there is essentially no barrier to entry is really important.
Hugo: And that means you can do stuff right away, right?
Jenny: And there's no just sort of black screen of death with a blinking cursor. You like you just like put your cursor in a cell and start doing things and most people can follow their nose and get something accomplished and they don't have these wildly frustrating experiences early on. You could have it later, but early on were like, you just are completely shut down because you have a misplaced comma. Once you start doing fancier things with Excel where you're basically programming with it, you can also get treated to these very frustrating experiences. But I think that ubiquity and the fact that lots of people can follow their nose and do basic tasks makes them incredibly appealing.
Hugo: And I also think though, in the programming community, RStudio has done a good job at for example, abstracting over the shell or terminal. So you can just launch Rstudio and install packages there instead of doing it through using Unix commands or whatever it may be.
Jenny: Yeah, that's true. More things that I've mentioned in the past that are very appealing about spreadsheets are people will put basically but different datasets on their different worksheets, and they might use one worksheet as a place where they put notes. But the spreadsheet still acts as this receptacle that holds the disparate parts of a project together. And if you aren't someone who has a lot of file system disciplines about putting things in folders and or maybe using zip archives to transmit things, that basic like it's file organization that the spreadsheet is doing for you. It's like keeping several tables together, is also useful and the fact that they are reactive so people will create summary cells and charts and pivot tables and those things automatically update when the data updates.
Hugo: Yeah. You don't need to run a script again.
Jenny: Yeah. Like you don't need to rerun a thing. So there's a lot of practical problems that they do provide very simple solutions to.
Hugo: Yeah, and this is enough for a lot of people, right?
Jenny: It is. If you don't analyze data for a living, I think there's a lot of people for whom spreadsheets is probably the right place to be doing this. It doesn't mean that some of their practices can't get better, but I think it's a reasonable place to do data analysis for a lot of types of people and types of jobs.
Why is the data science community anti-spreadsheet?
Hugo: So having established that, why are certain subpopulations of the data community, so anti-spreadsheet and, in particular, a lot of us? People who evangelize and use R and Python on a daily basis?
Jenny: Yeah. So I've tried to figure this out because I also can very easily get into a scornful, mean attitude about this and I think it's because so many of us have had some spreadsheet or multiple spreadsheets sent to us that exhibited so many problematic behaviors that wasted so much of our time that all that negative energy, we store it up and then occasionally just emit it back out either at spreadsheets in general or at people who use spreadsheets. And it makes sense because if you do an update analysis and people provide you this data and spreadsheets, there are lots of things that people do in spreadsheets that on the other end just waste colossal amounts of time on your behalf and it is hard to not get testy about that.
Hugo: Yeah, and I think, correct me if I'm wrong, but you actually have a GitHub repo somewhere, we'll find it and included in the show notes if this is true, but that collects scary Excel stories, right?
Jenny: Yeah. I think I didn’t send you that link, but maybe you found it. Yes, because there's scary technical things about Excel, which honestly are not the ones that get me that excited. Those are about numerical accuracy and properties of it's random number generator, whatever. I am more, I don't know if enthusiastic is the right word, but you know, interested in the stories of where sort of human frailty and spreadsheets combined to create great messes. I gave a talk at user R when it was at Stanford on spreadsheets and leading up to that I ask people to share scary Excel stories with me and I got a ton of really interesting responses, many of which people were like, I will tell you this privately, but I can't tell you in public because our company lost $20 million dollars because of this spreadsheet, and that there were multiple things where the reason that people couldn't share them publicly was that the scale of the loss caused by some regrettable spreadsheet behavior would have been too embarrassing. So those are the stories that I find more interesting because I think it indicates when someone's workflow has outgrown the spreadsheet model. So that's, I guess the downside there. They're ubiquitous, they're easy, but it's really hard to recognize the point at which it is no longer the right tool for the job because on any given day you're going to be more effective if you just keep using it than stopping what you're doing, training everyone up on databases and R and Python. And so I think that, you know, that's an organizational challenge.
Hugo: So then knowing strong arguments for using spreadsheets or why people, a lot of people use them and knowing good reasons for people being anti-spreadsheet, what are the strongest arguments for a tighter integration between data science and spreadsheets as opposed to getting rid of them completely?
Jenny: I think it's just realism.
Hugo: Yeah, totally.
Jenny: I think that's it, in one word. Even if we stopped producing them today, which is completely implausible, we already have too much data that's locked into them to not support that format. I maintain readxl now and I spend as much or more time on the xls side of it as the xlsx, so on the real legacy spreadsheets. When I first started working on it, I was like, "That's insane. Like are they going away?" But they're not going away because first of all we have this huge stockpile of legacy spreadsheets and then there are a lot of legacy tools that are still pumping out weird xls files as we speak. So yeah, I think just pure realism is you have to keep dealing with it.
Hugo: And I suppose also spreadsheets are good for a lot of things that maybe you don't want to do in R or Python, right?
Jenny: Yeah. I mean I think they're very handy for data entry. They're not the ideal tool for that, but for lightweight data entry, they're very handy. I've used some of the CSV editors that aim to wean people off spreadsheets, but you can just tell that Microsoft has lavished engineering hours on the Excel user interface. And even though I want to love other tools for editing rectangular files, I end up back in Excel even myself.
So, they're great for data entry and then, I think they're good for just looking at your data. Of course, if you use RStudio, for example, that has a pretty good dataframe inspector as well, but I think a lot of people just like the tangibility of seeing their data in a rectangle, preferably a wide rectangle right in front of their eyes and they can scroll around in it.
Hugo: Yeah, and you actually, you shared with me recently an article which we’ll link to in the show notes by Matthew Lincoln at the Getty Research Institute, it was called “Best Practices for Using Google Sheets in Your Data Project” and, if I recall correctly, this was a huge project of theirs and maybe still ongoing and it was a database migration in which they had to get a lot of experts involved and they couldn't automate it using a lot of reproducible scripts, but they didn't need to reproduce it as well. It was a one-off thing they needed to do and we always rant about reproducibility, reproducibility, but that isn't always the conversation that we need to be having in practical terms.
Jenny: Yeah. I loved this blog post. I think Matt posted it in response to a tweet by Hadley and I'm actually not even sure exactly what ... I mean you would think I would ask this, but so Hadley asked for good papers on best practices for using Google sheets.
Hugo: Can I just quickly ask when he said for using Google sheets, did he mean Google sheets by Google or the package that you developed?
Jenny: I think he meat Google sheets by Google. That was my assumption. Hugo: But Matthew also uses your package I think.
Jenny: Yeah. I think we were just more interested in the way people use that as a platform, Google Sheets. But yeah, so I think Matt, I have long been suspicious of spreadsheet training that tells people they should keep their spreadsheets in tidy form because... So that would be like, usually that means like tall and skinny.
Hugo: What a data entry job, right?
Jenny: Yeah. I think for data entry and data inspection, wide is usually better and that's why we get paid the big bucks to reshape this stuff in R. So I think you should, you should let people make wide spreadsheets and do the wrangling in a scripting language. So I was really excited to see that that was one of his points that he has a heading called “80% of sheet design should be for the editors and only 20% for the data scientists” as opposed to forcing people to create that beautiful tidy dataframe because it causes a lot of repetition. There're more chances for typos and you know that that gets eliminated if you use some of other practices he talks about. But I thought that was a really good point that we can do JOINs later, we can do reshaping later. So if you apply the principle of do not repeat yourself, try it out. That means to let people make separate sheets and wide sheets and we take care of that stuff on the other side.
Hugo: And could you just remind our listeners what the difference between wide and long is?
Jenny: So wide, I think the classical way I think of it is maybe you have data on five people and you take their blood pressure at six time points, the wide form of that data would have five rows, one per person and probably a column of names or IDs. And then what did I say six columns, one per time point. But then if you want to start making visualizations of that with ggplot2, or if you wanted to take subjects specific averages or fit subject specific models, you'd want that to be in tall form where you have a column of subject IDs, a column of time points in a column of those blood pressure measurements or whatever. And so in R world, if you use tidyverse and tidy R, that's like spread or gather or going back further in time, it's melt and cast. But these are the verbs that move you between those two forms.
Hugo: And, of course, as you say, when using statistical programming languages such as R and you want to look at summary statistics across groups or across people, summarize and groupbys and this type of stuff, you want it long format, whereas for data entry and just “look at your data purposes” you want it in white and I love the idea that, we don't want to necessarily get all these expert editors in such a project as this Getty Research institute's database migration to learn about all this stuff and all the programming so that we can benefit from it. We want them to work in their native environment in wide format, putting the entries in, inspecting it, so that they can do their best job as well, right?
Jenny: Yeah. I mean literally look at the aspect ratio of your screen. Like you're going to be able to put more data in front of your eyeballs if you make it wide. And so when you're doing a task that will be enhanced by having more data in front of your eyeballs, I think you should be allowed to work with wide data and then for someone else or you operating in a different mode, we'll make it tall.
Should data science be rid of spreadsheets?
Hugo: So these are incredibly strong arguments for a tighter integration between data science and spreadsheets. What are the strongest arguments for data science to get rid of spreadsheets completely?
Jenny: I think the biggest problem is that it's so easy to create this tossed salad, they have data and programming logic and formatting and figures, and it's all mixed together and if you need to come back and operate on one of those things or see how changes in the data lead to changes in the summaries because it's just all mixed up together in a way that's incredibly hard to separate, you outgrow this workflow very quickly because it's so hard to control and track evolution and make sure that you're not changing things you don't mean to change... we all, we've all had that dialogue box where you close Excel after quote just looking at something, and it's like, would you like to save changes? And you're like, I don't... I don't think I meant to make any changes. Did I mean to make a change and then you can't remember. So it does feel kind of like the Wild West where it's very hard to control things quite the way you'd like to. So that's where I think it really breaks down.
Hugo: Would version control help with this in some way?
Jenny: Oh, so Greg Wilson, I think this has been like one of his biggest wishlist items for a long time. So he's the founder of Software Carpentry and is now with you at DataCamp. It'd be amazing if we had like truly usable version control for spreadsheets, and I think there are some early efforts at that. It's just also, it's really hard to know how would you inspect the diffs, like what would that even look like? That's just a very, very hard problem that maybe by that time you should be working in a system where your data is separate from your logic, but it is true that Google sheets in particular as being just part of Google drive and Google docs in general, their version control is getting better and better, and I do find it reasonable to look at Google sheets evolution over time. I don't work with it in the same sophisticated way that I work with git histories and commits and diffs, but I have some hope actually the that version control for spreadsheets is part of the answer.
Hugo: Something you spoke to was the fact that we've really got this salad of everything, data, logic, formatting, UI, all of these things. Is there a world in which we could slowly tease these apart and separate them by encouraging, maybe forcing in some ways, best practices. So one example I think would be having spreadsheets in which formulas aren't allowed in cells.
Jenny: Yeah, well, and that's part of the promise of these CSV editors that, in which a whole whack of features have just been deleted. So there's a lot of stuff that has simply been made impossible, which is good. Yeah, so forcing people to keep data separate from logic like one way to do that is to not have formulas. I think another way in our existing spreadsheet applications is to designate by naming certain worksheets our data sheets and other worksheets or places where we do summary statistics or makes figures and use the tools that the spreadsheets provide for protecting different parts of the sheet in different ways and naming different parts of the sheet in different ways. In this new reboot of the googlesheets package, I'm going to make sure that Named Regions are a first class way of reading data out of a sheet, so that it works just as well as specifying a worksheet or a range because I feel like they could be a mighty weapon and fighting this battle that if you could get your spreadsheet-using collaborator to make sure that they're always adding data to a specific Named Region then they can go do whatever they want and other parts of the spreadsheet. But you will always know that you can pull out the raw data cleanly.
Hugo: That's really cool. And of course, there's a whole list of things that we could potentially prohibit. Matt Lincoln once again from the Getty Research Institute in the blog post we've been discussing, mentions a variety of examples like using formatting only to help communicate or highlight data content if that, right. But definitely we don’t want any formatting to carry data.
Jenny: Yes. So conditional formatting is great, but not formatting to convey, imputed versus measured. There's a spreadsheet in my life that spawned my interest in Google sheets where my husband and one of his university friends had a ridiculous and I now believe sort of unsafe, weight loss bet, where they tracked their weight. And we actually now have I think something like seven years of daily weight on these two guys. But recently my husband emailed me and he was like, "How do I get out the data as for whether something's in italics or not?" And I was like, "Why do you even need to know that?" And he's like, well the Italics are where, like someone was on vacation and mostly these two men actually travel with their scales, which is crazy, but sometimes they don't or so the Italics might mean-
Hugo: Do they sell travel scales?
Jenny: No, we travel with the real ones!
Hugo: Okay, that's commitment.
Jenny: It's crazy. So the Italics either indicate fake data or there was some interpolation or using like random scale and hotel or whatever. But yeah, so now they've backed themselves into this corner where they have embedded data in font, which is a big no no.
Hugo: Absolutely. I think one of the basic ways I'd like to think about this and I think that a lot of people do is, if you can't export it to a CSV retaining all the information in the data, there's a challenge there.
Jenny: Yes, for sure. Although it is true, another thing I'm doing in this reboot of googlesheets is one of the ways you can get the data back out is as, like you can just get raw cell data, so it's still arranged in the rectangle that you expect, but all of the columns are list columns and the data available for a cell is just everything the API sends, including the formatting. So if you do find yourself in one of these really regrettable situations, I'm trying to provide as much help as I can to let the data analyst dig around in there and get it out without having to use some sort of special purpose package.
Hugo: That's awesome because once again that speaks to your role as a realist, right?
Jenny: As realist and like I don't think I can build every possible ... I can't build something to fulfill everyone's downstream need. So it's more like, while I'm here and I have all of this information sitting in a little list, let me just create an exit point for this function right now and you know I've dealt with AUTH for you. I've translated human oriented ranges into the language of the API. I've arranged it in a rectangle, but before I do my column typing and column coercion, I could just give it to you right now. And so yeah, I'm just trying to make it more of a platform for people who need to actually get in there and munge the actual cell information. I might as well hand it over since I have it.
Hugo: So before we move on, I just want to focus on something that when your relationship with spreadsheets run so deep, that you said something I've never heard anyone say before. You said, "There's a spreadsheet in my life that..." And I just wanted to say I love that. I love that figure of speech.
Jenny: It's part of our daily life and it gets updated once a day and I think I have a fondness for spreadsheets also because I had an undergraduate degree in economics and German and did management consulting, but then I decided that I really liked doing data analysis and it was because of my deep immersion in spreadsheets in the management consulting world, but I didn't have all the math that I needed to apply to statistics grad school. So in the year or two where I was preparing, I supported myself doing high end Excel temp work. So even though it is not my tool of choice now, I do have some nostalgic fondness for it. It was there for me when I needed a way to earn money.
Hugo: It's an old friend.
Jenny: It is.
Future of Spreadsheets
Hugo: So we've discussed bits and pieces of this in passing, but I'd like to focus on a future in which spreadsheet use is tightly integrated into all of our data science workflows. And I want you to tell me what this future looks like to you. What's involved in this future?
Jenny: I don't really know if this is the future I want to live in, but if we were to do that, I think it would be great to have something that's like Excel, but where you can shut down some of the features that are especially problematic. So, for example, that you can shut down the use of formulas or you could shut down anything that wasn't conditional formatting. I'd love to have some of the featurelessness of some of these CSV editors, but with the really slick user interface that Excel provides.
Something I've talked about with Aaron Berg who works at RStudio, I believe in either Solutions Engineering or Customer Success, is also trying to make it easier to template spreadsheets for people where you create the spreadsheet that you would like to get back and you kind of leave those post it notes with like a yellow arrow and it's like sign here.
This is all you're allowed to do. A few people have opened issues on readxl where they clearly have some workflow that uses the Excel templating facilities and so, I started to engage those people in conversation to figure out like, what do these workflows look like. If you're going to need to deal with spreadsheets, maybe you should. I guess it's a bit like statistics where people complained that they're brought in to analyze the data after the fact, but they should have been involved in experimental design. So if you know that spreadsheets are going to come to you later, it would be better to be involved in their design and the first place and be able to present templates for people to fill in where you've kind of locked things down.
How can spreadsheets be incorporated with R and Python?
Hugo: Yeah, and this actually almost answers my next question, which is how do we develop a culture in which we ensure that spreadsheet use can be incorporated into a workflow where downstream people are using languages such as R and Python?
Jenny: I think it is creating these spreadsheets that have predictable structure. Yeah, I think that's the main thing and then, as you said, making sure that the data you need to get back out is something that can survive that translation. I mean, of course, we can recover the formatting. It's much easier to recover it from a Google sheet than from Excel. So I've never made myself do that with an Excel spreadsheet, although it is possible and some of the other packages do, but it's still quite painful. It'd be better to just design that away.
Hugo: I feel like we've imagined a bright future. It may not be quite the future you and I want in all our idealism.
Jenny: Well it’s one that it keeps us employed.
Hugo: Yeah, absolutely. So how do we get there? How do we get to this future for you and me as a community of tool builders and educators?
Jenny: Well, I guess I have a sarcastic answer to that.
Hugo: Let's have that one first.
Jenny: When agencies like the NSF and NIH, so these are the two main funding agencies in the US. So NSF, National Science Foundation and NIH, National Institutes of Health, make these big splashy announcements of how much money they're going to invest in things. All I can think of is that like, the rounding error on most of those efforts would fund team of professional software engineers to make a really good CSV editor. I mean for both of those agencies, but especially the NIH, Excel has done so much damage to genetic and genomic data because of its obsession with converting things to dates. That would have paid for itself. And so I kind of feel like there are a lot of agencies that are funding work that if they could ever be seen to see like this is like a really fundamental tool that we're missing.
Hugo: I mean there is a great example of gene names being mangled and turned into dates, right? Which I'm sure that either of these institutions would not want to happen or be published.
Jenny: Right, I mean, we should put some of these links in the show notes, but like major genomic databases have non-negligible contamination that is characteristic of data being passed through Excel that probably never should have seen Excel. And so that's an example of where like this is, it does seem like something that it would be in the best interest of certain agencies to fund. I don't really think that's going to happen though so that was a sarcastic answer.
Hugo: So what's the non sarcastic or is there a more sarcastic answer or what's the realistic answer?
Jenny: You could say like hey, I now work at a company that puts nice user interfaces on things, like would I ever internally campaign for RStudio to make some sort of editor of rectangular data? And so far, first of all I don't even know if I would be successful. Probably not, because it's a really big job. I think that's why someone else should do it. But yeah, I haven't done so. But that's some company that is really good at putting user interfaces on things or Microsoft could choose to create a version of Excel where some of these things can be shut off in a configurable. But I don't really think any of these things are going to happen, which is why I think these other practices are more realistic.
Hugo: An open source solution will be great downstream. That doesn't necessarily seem likely at the moment.
Jenny: Yeah, I guess that would give you the ability to find that environment variable that turns off date conversion, then toggle the value to “no”. Yeah. So that's, I suppose this would be a very complicated application. I don't know how many people would actually tinker with it, but yeah, creating a configurable spreadsheet where some of these types of functionality could be configured or shut down would be great. I don't really think that's very realistic though. Maybe Excel’s just too successful doing exactly what it's doing.
Hugo: So what's a realistic solution or future?
Jenny: I think it's just the type of things that we're already doing, which is making it easier to get data in and out of these things and developing best practices for working with it. I think Matt's blog post is a good example. Karl Broman and Kara Woo have a recent paper in this collection of articles that Hadley Wickham and I elicited from people that appeared as a preprint collection and then is also now a special issue in the American Statistician. So they wrote a lot of concrete practices, stuff that you could give to a collaborator and say like, "Hey, next time you make a spreadsheet, think about these things. It's going to make it easier for me." Data Carpentry, which is an offshoot and sibling organization of Software Carpentry also has quite a lot of curriculum around making spreadsheets that you won't hate yourself later when you or someone else needs to import them for analysis in another system. So I mean maybe I'm not aiming high enough, but I tend to think some of these parameters we don't have control over, so it's more about like harm reduction than eliminating the use of illegal drugs that we should just cope with helping people do this more responsibly and then wean themselves off of it.
Hugo: Yeah, so I actually think both you and Hadley in varying contexts have referred to certain aspects of these conversations being like sex education and safe injection sites.
Jenny: Yeah. I think that's probably a bad analogy because there's a lot of sort of value judgment there, but it is this sense that you can't completely control everything and people decide to do what they're going to do and they are responding to a whole lot of different incentives and just try to pick up where we are and make things more productive.
Hugo: There is a question whether encouraging best practices or having software with a subset of features so people actually can't do the things that aren't best practices is the best approach.
Jenny: Yeah, I don't know how frustrating people would find that, but it certainly would be a great mercy to those of us on the receiving end of these spreadsheets.
Hugo: Yeah, exactly. We'll link to all of these things. We'll also link to a repo you have up called sanesheets --
Jenny: That was a bit of a rant that I got baited into, but it was kind of fun.
Hugo: You also have a header on it, which says, "This is a rant about spreadsheets." So you definitely managed expectations really well there.
Jenny: Truth in advertising, for sure.
What is your favorite data science technique?
Hugo: Fantastic. Okay, so we've had this great conversation about spreadsheets, something I'm interested in when interviewing people on the podcast is getting a sense of what they really enjoy doing and of course spreadsheets is one of those things and developing, and being an interface between users and developers, but I'm just wondering generally what's one of your favorite data science techniques or methodologies?
Jenny: I like thinking about workflow a lot, which is I think a discovery I made about myself when I decided to take my teaching very seriously and take teaching STAT 545 very seriously, because a lot of that is not inventing new theorems, it's not necessarily developing packages, although that was a precipitating event for me getting more involved in that. But just like recognizing patterns that recur in data analysis, and giving them a name and like articulating th is the input, this is the output, this is the challenge getting from the input to the output and here is a template for how that can work. I enjoy that very, very much and it's kind of hard to find your place in the world doing that. I feel like I have found that at RStudio and then it leads naturally to helping develop packages, either packages that I own or providing sort of constant drip of input on other packages. Because it shows you what those gaps are and there are repetitive patterns that reappear and people's data analyses and it's really hard to make this move from step three to step four and it comes up a lot. And that also explains why I work a lot on data import. Our life in the tidyverse, especially for data manipulation, is good. There's still a lot of room for improvement, but none of that matters if you can't get your data in in the first place. And this is part of why I was excited, although a little intimidated to start maintaining readxl because I've required students to use a package for bringing data in from Excel in the past. And some of the R packages that do that, understandably rely on some preexisting Java libraries or Perl libraries. But every year when I did this, like 20% to 40% of the students wouldn't be able to get these packages to actually work, because of some fiddliness with that interface. And then it's as if the package doesn't exist. Like if you can't get it installed or it won't import anything, for you it doesn't exist. And so that's why I actually enjoy working on data import packages and I'm happy to write glue code so that you could embed some library that knows how to read xls files and embed it in an R package so that people don't have to figure out how to install it themselves. Because if you can't get the data in at the first place, you can't enjoy all this downstream goodness and it's an area I love working in because I think it's super useful. And then also given how useful it is, there are relatively few people willing to do it because it is fairly unglamorous work, compared to working on machine learning or sort of fancier statistical methods. So it also means that you're allowed to do stuff that's very important and appreciated and there's not like a ton of cutthroat competition doing it. So there's a lot of things I enjoy about it.
Hugo: And I think something you spoke to then, this idea of recognizing patterns in workflows, is essential and also it helps you to decouple various patterns. We actually had a segment based on one of your blog posts once on this podcast, your Project-oriented workflow, which was great because, in this short post, you actually are able to decouple our workflow which, you described as you know, personal tastes and habits from the product, which is the actual project, and that's incredibly helpful for a lot of working data scientists I think.
Jenny: Yeah, a more recent blog post ... actually I haven't written the blog post. I've done a webinar on this, but I really got a little kick about how do you structure workflows when you need to do something for every row in a dataframe? That's an example of a pattern that comes up fairly frequently in real life and is fairly awkward in R due to our columnar orientation. So that's like a perfect example of like nobody needs to go out and prove some theorem but someone needs to really articulate what are the common feature of these problems. A lot of times people think they have this problem and they don't. So like how do you know when you're missing an easy solution? Like, are you overlooking an existing vectorized function? Are you overlooking a groupby summarize operation, there's a list of ways you can get yourself out of this pickle and then, no, there really are some cases where you actually have to do this and then laying out what your options are and again, like outlining the three to five typical complications and how do you work your way around them. But I enjoyed doing that a lot and then it inevitably points to either features you could put into packages or sort of gaps or bugs and existing packages. But so all of it still informs package development, but sometimes it's also just making our existing packages, like getting more mileage out of them by like showing people how to use it.
Call to Action
Hugo: Totally. So to wrap up, I'm wondering if you have a final call to action for all our listeners out there.
Jenny: I think my final call to action is to encourage people to continue to think and work and learn out loud. I feel like I've been using R for a really long time but did so very quietly for a long time and starting to be more public about that and opening up my teaching materials, imperfect though they may be, has overwhelmingly been very positive and it's a little scary to put imperfect work out there. But for the most part you'll get positive engagement with people and it absolutely has accelerated my own learning and why I'm doing what I am doing today, is working up the courage to do things a bit more in the open bit more collaboratively. So I think that would be my call to action is that it's an exciting time with how quickly practices can be transmitted from one person to another. This is so like a weird note to close on, but I spend a lot of time on Twitter probably more than I should and I came across this amazing video of someone whose twitter handles is Breyonce and it's her doing the background dancing of Beyonce’s recent performance at Coachella, I don't even know if that's the right way to say that. It's an example of something that I only see in print.
Hugo: I think it's right.
Jenny: Okay, good. And then it was this great juxtaposition of like the pros doing something and this woman doing an incredible rendition of it like in her living room and it did remind me of a lot of what I feel like I see going on in R. And I'm sure other software communities today where the famous person like the Hadley Wickham or the Mike Bostock or the Wes McKinney does something and, because they are very much working and they're open and working on open source things, other people can immediately access it, see it and mimic that. And it's kind of amazing how quickly people can level up if they are listening and willing to engage in that conversation. And it does take a little bit of courage, no doubt. So that would be my call to action.
Hugo: I think that's a great note to end on. I think that's a wonderful analogy. I'd also like to see Mike Bostock, Hadley and Wes do the backup dancing for Beyonce one day.
Jenny: That is another actual call to action here.
Hugo: Yeah, exactly. And please send me the link to Breyonces tweet because I'll definitely include that in the show notes.
Jenny: I will for sure. It's a great joy to watch.
Hugo: Jenny, it's been an absolute pleasure having you on the show.
Jenny: Thank you very much. It has been an honor.