Max Kuhn is a software engineer at RStudio. He is currently working on improving R’s modeling capabilities. He has a Ph.D. in Biostatistics. Max was a Senior Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He was applying models in the pharmaceutical and diagnostic industries for over 18 years.
Max is the author of numerous R packages for techniques in machine learning (caret, C50, Cubist, recipes, and others) and reproducible research. These packages are downloaded over 200K times per month. Max is an Associate Editor for the Journal of Statistical Software. He, and Kjell Johnson, wrote the book Applied Predictive Modeling, which won the Ziegel award from the American Statistical Association, which recognizes the best book reviewed in Technometrics in 2015. They are currently writing another book on Feature Engineering.
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC.
Hugo: Hi there Max, and welcome to DataFramed.
Max: Hi. Happy to be here.
Hugo: Great to have you here. Really, really excited to talk about the role of data science in pharmaceuticals, which you've worked on for a long time. But before we do that, I'd like to find out a bit about you. What are you known for in the data science community?
Max: Well, I've been a working statistician for 18 years. I think the thing that people know me most from is some R Packages for predictive modeling, like caret and a few others. I wrote a book in about 2013 about predictive modeling, and I think people know me from there.
Hugo: What's predictive modeling?
Max: That's the term I use basically to cover machine learning and pattern recognition and I mostly use that instead of machine learning because the phrase machine learning has been co-opted maybe three times in my career. When I was first in graduate school that meant old school neural networks and then five or ten years after that if you said machine learning, people assumed you meant kernel methods like support vector machines and now we've back to deep neural network. For me predictive modeling is a little bit better because it doesn't have those connotations and it's like a direct name about ... describes exactly what you tried to do, I think in a more understandable way.
Hugo: So it's about making predictions?
Max: Yeah, basically.
Hugo: Awesome. And of course, I started using caret, I think, six years ago when I was working in applied math research in bio... See more
Max: Yeah, yeah I remember that.
Hugo: ... to where you are and one of the things that I found really wonderful about is how intuitive the API was. Also, your vignettes and your book provide wonderful educational resources. But the other reason I think historically why caret is strong is because it arose out of a real ... you made it because you needed it right?
Max: Yeah, I worked in pharma, at Pfizer for a while and when I first started there, it was like 2005, the good and bad of working for a huge corporation is you show up on day one, they give you this nice, high-powered laptop and you feel like you're ready to go. But then, you don't have access to anything. It takes 50 help desk tickets to get access to any data, and so I knew I was going to have two to three weeks at least of no productivity, besides going to meetings and stuff. I talked to my boss and said I know I'm going to be doing computational chemistry support and all that and I've been thinking about writing an R package that'll make that a lot easier. And I've got a couple weeks I think before I would get any data and so I'd spent almost all my time outside of meetings just writing the skeleton of that and what it should do. And I've re-written it three times but that was developed to support computational chemistry inside Pfizer.
Hugo: How long ago was that?
Max: That was 2005.
Hugo: And how at that point in time did you know how to write an R package or even what that meant? Or does that question make sense?
Max: It does, but I think if you go back and look at the sources back then, maybe I'm not the best person to talk about that. I'd written S-PLUS packages, so I was using S-PLUS before R was a thing. I had an idea of the basic framework of doing that and back then writing ... back then. (laughter) I'm old now. Back in my day. Back then, it wasn't as ... there weren't as many rules and things. When you run a R package now there's like a thousand, it feels like to me at least, there's a thousand things that if you send it to CRAN you might be dinged on or what's the best way to do this and it's somewhat complex now. At the time I never felt that it was and so I don't know, I just started throwing things together like literally throwing things together. I know we used it inside of Pfizer for about a year and a half before I really contemplated sending it to CRAN. Yeah, I even had to change the name. It originally had a different name. And then, there was a package maybe a couple weeks before I sent it to CRAN that took the name, so I had to change it all and figure out what to call it and all this stuff so, yeah.
Hugo: And it's called caret, c-a-r-e-t.
Hugo: And what is that? That's an acronym?
Max: It is. My graduate advisor gave me a lot of great pieces of information, like nuggets of wisdom. And one was, come up with what you want to call it and then back fit the acronym. So it stands for classification and regression training, which is a total hack on that acronym.
Hugo: Were you contemplating doing a reboot or another package called carrot but spelled like the vegetable?
Hugo: Do I remember that?
Max: So yeah. This is funny. So-
Hugo: A root vegetable theme as well, right?
Max: Exactly. I made a joke that yeah ... that would be the next package, would be called carrot, spelled like the vegetable so I could just mess with everybody. And then Hadley said no, we should do something else or he came up with parsnip which is a white carrot, very similar to a carrot. And I actually have a package called parsnip right now, which does something similar to the original caret, so it's just become a confusing mess for how I've named these things.
What do your colleagues think you do at Pfizer?
Hugo: That's hilarious. Okay, great. So, I hope to make it back to talk a bit more about caret and open source software package development and CRAN and all these things later, but I want to jump in and find out about your work at Pfizer. But I want to take a slightly different approach. I want to know what your colleagues at Pfizer thought that you did there.
Max: So, I was hired as a nonclinical statistician, and it was the second job I had with that title and so what that is, it's exactly what it sounds like, it's everything but clinical trials. Clinical trials tend to be these very dogmatic and formalized processes of doing in a general sense, data collection and analysis. And nonclinical is almost the exact opposite of that. So, a lot of times when people talk about nonclinical, they're usually talking about the early part of the drug development phase. Somebody comes up with an idea of what we should target in the body, all those activities fall under nonclinical statistics. That would be drug discovery, discovering targets for drugs and things like that. And then, after that the medicinal chemists start coming up with compounds and chemical matter to try to make that happen. That would be early medicinal chemistry and drug discovery and then, once you have a drug candidate, which is the drug you think you want to put in people, then you start working on the formulation part of it, the toxicity testing, how it's going to be made, how we characterize its safety and efficacy. And once you think you have all that worked out, then you go to the part where clinical takes over, where you start considering that we're going to put this in people. How are we going to do that? What dose are we going to do that? Before it goes into people or a person, all the stuff that happens before that is nonclinical statistics. So it's usually connoted with the scientific side of the business, the scientific research part.
Hugo: And you were in the research and development branch of Pfizer, correct?
Max: Oh yeah, yeah. Mm-hmm. And it was always hard to describe to people because they were like, what drug do you work on? I'm like, yeah, I know their compound numbers. I can never remember ... If you worked on one that actually became a drug, we had no part in what those weird names were so I'd be like yeah I worked on 7-8-5. We just called them by the last three numbers, their serial numbers basically. It was that early. It was before it was even a real drug.
How is data science integrated in pharmaceuticals?
Hugo: Cool. So, I think the three take-aways there for me are drug discovery, medicinal chemistry and developing drug candidates. And then, thinking about toxicity and this type of stuff. I'm wondering how data analysis and data science are integrated into this process, in pharmaceuticals.
Max: It's interesting. I've heard, and I've even said ... a lot of people say well you know we were doing data science back then, we just didn't call it that and I think in a lot of cases that's true and I'm going to do my best to avoid saying that, but I probably will in two minutes. (laughs) But I've seen a lot of statisticians, so I'm trained as a statistician. My PhD is in Biostat and I've seen a lot of statisticians say "aren't we the data scientists?" And my response to that is almost uniformly, "no". Statisticians, some of us have been functioning in that role for quite a while, but statisticians ... sometimes the stereotypes are true. The stereotype is historically have been people who sit back, somebody brings then data, they ask a lot of questions about how the data was collected and how it was sampled and then they generate a P-value and then go on the next thing. And that's not at all what nonclinical statistics was or is. So I kind of view one of the main differences between a data scientist and a statistician as being almost a matter of involvement or intent. A nonclinical statistician tends to get in the mud with the people who generate the data. We're in the labs looking at how they do things, we're involved in the biology and the chemistry. We don't have to be experts in those things, but knowing what's important to those groups of clients ... so it's not very passive at all. It's very proactive and we were always trying to get ahead of the projects. In fact, the best things we ever did were to try to solve problems that we knew that they had, but they never articulated. And we would develop a solution to a problem and then bring it to them and say hey I know you haven't really asked us about this but we've noticed that you have a lot of issues with predicting these types of compounds. So we got some data together and built something for that and let us know if this is any good or not. To me that's the antithesis of how I grew up with statistics. The stereotype is bring us data, we'll judge you and if we feel like we should bless your data in the analysis, we'll do that. So it was very passive and sort of like "come to the mountains" approach to clients. And you just can't do that in nonclinical. You have to be involved in the science side. You have to be involved in the nitty gritty of how the data are generated. I see that as how data scientists tend to work. They tend to be more involved in the actual question of interest and how the data is collected. We're all writing software to do these things. Sometimes I worry that statisticians just don't do analyses because it's not in their favorite software and they wouldn't really think to write their own software to do things. That's what I mean by it's a matter of involvement or intent. I just feel like, if you're going to get in the mud with the process that you're working on, you tend to be more of what somebody would call a data scientist or in my case a nonclinical statistician.
Hugo: I like this, because this idea of involvement speaks to two very interesting aspects of data science. The first is this idea of domain expertise, how much you need to know about your subject matter and whether you're not quite the expert there and your collaborator is, but you do know enough to do your job is really important. The other thing it speaks to, that I want to pick your brain about is being involved in experimental design and I know in your work at Pfizer there were two types of things you did ... well several types of things. But two in particular, which is working on historical data, which you had no input into the experimental design, but then working on data which you'd been involved in from even setting up the experiments. Is that right?
Max: Yeah well the experimental design part was mostly ... I'd worked at Molecular Diagnostic Company before Pfizer and we were pretty hardcore about experimental design so we had a lot of high-throughput systems where we're optimizing assays. An assay is just a fancy word for saying a laboratory test and these assays would have 15 or 20 critical components, we had to get the concentrations right. So with that company, we'd design and experiment in the morning, between ten and two they'd execute it. The data systems were pretty good so we'd get the data back in a consumable form very quickly. We'd do the response services design or the factorial design analysis very quickly. We'd discuss it at like four and then plan the validation experiment or the confirmation experiment. So within three days we would have done two or three experimental design iterations of optimizing these things. I gravitated toward problems where ,I don't think it's through my impatience, where you get very quick feedback. You design, you think you have a concentration or a combination of concentrations or these components to the assay you think works and then they test it two hours later and you know whether you failed miserably or you did very well. That was a really nice aspect of doing that. The chemistry work I did was the same way that you propose that they make this compound based on your model prediction and it takes them two three days to synthesize it and a day to test it and you know pretty quickly whether you did a good job or not and that's very different than, I generated a P-value, I think these two things are different but we won't know for another year whether that is a real thing or not. I've always liked the idea of getting empirical feedback that says this really worked or this didn't. That was the experimental design part ... the other bit that you mentioned, before I left Pfizer, we supported a lot of the assay scientists there so a chemist designs a compound that gets synthesized so we physically have it and then it gets run through a battery of anywhere from three or four to a dozen different assays that measure does it work, what's its permeability, will it get into a cell, what's its lipophilicity, how greasy is it. The chemist often would build models to make predictions of these things, so we had boatloads of historical data on compounds going back forever because these assays don't really change, and maybe around 2010 I think, maybe it was a little bit before that, we wanted to get in that data set so we could query the summaries of the data but we didn't have the original replicates and things like that. The assay scientists were a little bit wary of that. We had good relationships with them but they weren't quite ready to ... I think they were afraid of "what would these guys show with this data?” We kind of left that alone for a while but then about four or five years later, they came to us and said we'll give you the keys to the kingdom, look at all this and help us because they'll have a chemist run a compound twice in an assay. If the chemist didn't feel that the data were tight enough, they'd raise a ruckus and say, this assay's not any good and then the management of the assay group would have to come back and say, "well no, this is actually pretty good and your results aren't really that any more consistent or inconsistent than any other compound and here's some data." And so then rather than fighting fires, we got to the point where we did a lot of analysis of their historical data, using Bayesian analysis. The idea would be that given I have a measured value of 10, let's say 10.5 from this assay, what's the probability that the true value is within certain boundaries or what's the credible interval and things like that. So, you just have tons of historical data which really opened up a lot of new and interesting things we could do with Bayesian analysis to make ... and this is when I started really using Shiny a lot. We had this web portal that had ... it's probably up to two dozen now, but at the time it was maybe 15 different Shiny applications.
What is Shiny?
Hugo: Can you just tell us what Shiny is?
Max: So I work for Rstudio and one of the things we have is Shiny, which is a web server for R. What's great about Shiny is with a little bit of code, you can come up with a really nice looking website that has controls on it and behind those controls is R code. If you change the dial a little bit, it can automatically recalculate things behind the scene and give you a nice interactive visualization to show histogram or box plot or the data themselves. What we did is we pre-populated all these data sets with all the Bayesian analysis results that anybody might ever ask for. Then we would use Shiny to basically visualize these. So a chemist would come in and say, “oh when I ran this assay I got an assay value of 10.5”, so they'd turn the slider to 10.5 and we wouldn't call it this but it would give them the posterior distribution that describes what the plausible range in distribution of values it could be. Based on that they "oh well, based on these thousands of historical data points, I know that this is my measure of uncertainty with this particular compound that I measured." That really helped everybody understand what the baseline conditions of all these assays were. They could tell what was an aberrant value and what wasn't. And it also opened the door up ... so I did a presentation about this and the link will be in the show notes. The name of it was "Statistical Mediation In Early Discovery" because we found ourselves being the mediators. The consumers of the assays, which were the chemists or biologists would want it to be the highest quality data that they could get which is a good idea. But then the producers of the results, the assay scientists, they've got budgets and they're getting thousands of compounds of assay every week and so, we tended to be the people in the middle to mediate between like, yeah this one was a wild data point or know this result that you got was really consistent based on all the historical data. And we basically just automated all the analysis and visualization of that stuff so that it could take ... it was just a known thing that everybody could consult what the real values were. It was a real interesting project statistically and it just seemed to solve a ton of problems so that was one of my favorites.
What can data science help solve in pharma?
Hugo: That's cool and I think what you're really speaking to as well is the role of data science thinking and data scientific approaches to all of these problems. And I'm wondering if you could speak more generally just about what the biggest challenges that you've seen in pharma that data science can help to solve.
Max: In pharma, especially the part of pharma that I was in, the complexity of the data is just astounding. Chemistry data can be complicated. You have these molecular structures that ... we have all this software that can parse them and measure them in different ways like the size of the molecule or its surface area and that generates a lot of data. And those data have a lot of interrelationships between these variables that you have to be aware of and deal with. So chemistry has a lot of opportunities, have a lot of data and the complexity is pretty high, but in a way it’s nothing compared to the complexity of biological data. Biology is just ... it always astounds me. Once I learned a lot about molecular biology, like what a game of mousetrap it is, right. You modify this gene here, that triggers a compensatory reaction this pathway over there, that then sends a signal to this other pathway and so just the biological complexity is astounding. Back in the early 2000's, with a lot of microRNA technology, we were able to conduct very large scale biology experiments, so with a single sample, you can get 54,000 RNA measurements pretty easily. And those, we have multi measurements per gene. We know the sequences. We know all this stuff. Just the complexity of these data, it was very daunting in a lot of ways. And we had a lot of good tools, Bioconductor, NMR to help us with that. If anything, it's only getting worse, from the standpoint of complexity so, we're able to measure even more than we ever did before and we're asking more sophisticated questions, so the experimental design might be more complex. Things aren't getting more simplistic, they're just getting more complicated and they were already pretty complicated to begin with. Having tools to visualize the stuff and make ... good methodology counts because I've seen a lot of really, really bad methodology. Having people that can give a defined baseline analysis that you can trust basically because it considers all the aspects of the data analysis that might be lost on a neophyte or someone who hasn't really thought about this too much.
Hugo: Yeah, that's really important because when you were talking about these types of communication necessities at the end of a data scientific experimental pipeline, who do you need to tell this stuff to? Are they technical in any sense?
Max: Yeah, we were lucky enough ... they moved a lot of biology from Connecticut, where I was at to Boston but when it was here, we had a pretty good, not committee really, but we always had a statistician, a biologist and a chemist who had dealt with a lot of these problems before would be the initial people they would consult. So you would have a scientist come in and say, "hey, I want to run this big biology experiment and use these microRNA technologies."And then when you sit down with them and say, "okay, what's your experimental question?" And then they'd tell us and then they'd tell us what they thought the design would be and we'd say things like "hey, most of your design doesn't have anything to do with that question." They'd say "well yeah, there's this other stuff I want to do. We're like, "okay, well hang on". You wouldn't compare 15 things in this experiment and you're going to get 40,000 answers per question. Are you ready to get a 40 megabyte Excel file of just P-values and fold changes and gene names. A lot of times we were communicating to people who maybe at first didn't really understand the technology or at least what they were getting from the technology. We saw a lot of experiments where they had good designs but they were very ... they just bit off more than they could chew. And then it's so overwhelming that the results never really got ... were taken advantage of. You run this experiment, you answer this one question but there's 15 other questions that you answered but not even the post-docs have time to work on that stuff so we were very much about let's do very focused experiments that answer one experimental question at a time. It was more the intuition or the practical aspects of the experiments rather than how they isolate the RNA or how what P-value correction or whatever they want to do. Those things are all important, but the design ... you can't resurrect a dead experiment because the design was bad. So that is where we spent a lot of our time and we were mostly communicating this to the bench scientists and the lab heads. I had just top notch scientists to work with when I was in drug discovery so, if they didn't understand it at first, they were usually pretty respectful of saying "okay, explain that to me again," or if they just didn't get it say you know I just have to rely on your judgment. If you think this is a bad idea, then let's not do that, what do you think we should do? That was always, for the most part always a pleasure to work with them.
Hugo: Okay, great. And I'm sure that emerging technology is that point such as the ability to build Shiny dashboards was very helpful in these types of conversations.
Max: Yeah, sort of the joke I made is, we don't need another P-value correction, right? Really early on we came up with the idea that we'd rather give them a interactive volcano plot, which is a plot of the signal versus the statistical significance of something. And with these types of plots, you can very quickly see what the main drivers of the analysis are. We had systems where they could click on it, and have the gene name pop up. You would have a cluster of points on this volcano plot that would stand out, and then we could really quickly see are those measurements from the same gene or the same family? What were they? That was just so much better than giving them detailed statistical analyses or reports or again, this massive Excel file with all these numbers in it. Just to let them visualize all these things and capture what they're interested in and give them other visualizations where they could isolate those particular parts of the experiment and say okay what was that? What this an aberration or did we just discover some new aspect of biology we weren't aware of? By far, that was the most productive part of our work there.
Hugo: So you mentioned how complicated the data is in chemistry and how complex and large the data can be in biology or so are two of the big challenges that data science can help with in pharma. Can you speak to a project or two that you worked on at Pfizer that speak to these challenges?
Max: Yeah, so the one I mentioned earlier where we had all the historical data, that was a pretty interesting problem. I think if you look at the pdf in the show notes, I think it does a pretty good job of describing it. Other things were ... a lot of times people come to us ... I think statisticians, especially PhD statisticians have gotten this ... I don't think we've done this on purpose but people perceive us as these wizards, right. We live in some dark cave that nobody goes into and people want their analysis blessed and so they come up to the cave and they're like, “Oh would you please look at this? Don't anger the wizard.” And then we come out and look at it and be like, "Yeah, yeah that's just fine.” So there were a lot of situations like that where people would have new methodologies that they have read about or started to develop themselves and wanted to know, like okay am I completely off base? So there's this area of chemistry, of computational chemistry called Cheminformatics and this is where I will say they are the original data scientists. So I just did it.
Max: Yeah, I know right. I'll regret that later. They are, for the most part people with chemistry and computer science backgrounds and their job is to interrogate very large databases of compounds and assay results and come up with solutions and tools like that to help accelerate the drug discovery process or the medicinal chemistry process. There were so many things that came out of that that we looked at. One aspect is this thing called match molecular pairs. The ides is, you can break down most compounds into certain groups. They usually call them R groups, like the letter R, but that's old terminology. It has nothing to with the programming language. What people figured out is when you have a million compounds, you can find common structures in these compound structures, substructures of them and if you hold those constant, you can see all the other things that changed. It's kind of like taking the tail of the dog and grabbing it and let the dog waggle around a little bit. In doing that, what they could do is they could, just from looking at these databases, they could find interesting accidental or deliberate transformations that chemists did based on that substructure and find which one of those increased whatever they were looking at. If they want to make a compound more permeable, they would look at, oh well cross the two million compounds we saw, when we saw this substructure, if people combined that with this other substructure, it really increased the permeability. It's this really interesting mathematical way to go back and see what ... observing things that were probably unrelated experiments and seeing how they affected this thing that you might be interested in. They would take the output of that, which looked kind of like a heat map and so they would come to us and say well, what do we do with these? We have all these maps to look at, or these compound structures, how do we organize it? Do they help us with this aspect and so it allows us to come in and use what we know and say look at it, let's say, from an experimental design standpoint and say, well okay, here's how you can tell a good transformation from a bad transformation. Here's what you can do with these major seeds that you get back from the analysis. Things like that I just found infinitely interesting. And the cool thing about the job was that was like once a month, or something like that, that somebody would bring you something and it just blew your mind. It was very, very hard to say no to these things. One aspect that's not so great about nonclinical statistics is, at least in drug discovery is you tend to have about a 200 to one scientist to statistician ratio. So you're really, really, really outnumbered. And they're not bringing you problems that are t tests, or an ANOVA, they're bringing you problems that are really, really difficult. That can be a little bit ... every once in a while, you just want a t test to work on because you're having to work on these really complex and interesting novel problems, but things that might take you six or eight months to figure out. And sometimes you feel like, am I getting anywhere with this? So that can be ... it's a double edged sword because these things are so interesting and cool to work on, but they're really, really hard, so if you've got four of those in your project list, that could kill you just because it's so daunting.
Hugo: For sure and I do think they are interesting, but as you said the fact that a lot of people approach statisticians and data scientists as wizards or sorts or oracles of truth extraction in a lot of respects is dangerous. And I can't remember who said this. It was probably Nate Silver or someone along those lines. But the quote, which I'm going to mangle horribly is along the lines of if a data scientist gets a prediction right, they get more credit than they deserve and if they get the prediction wrong, they get more blame than they deserve, currently.
Max: Yeah, yeah. I would definitely sympathize with that. I believe that's true. So my solution to that is just make millions of predictions. It's safety in numbers.
Hugo: Yeah, absolutely. And I think an emerging trend and paradigm is also, as you discussed, you're very interested in Bayesian thinking and Bayesian inference is, with our predictions, expressing uncertainty as well.
Max: Yeah, absolutely. I begin to really use a ton of Bayesian methods for that exact reason, just because a lot of times it's really difficult to put estimates of uncertainty on parameters or combinations of parameters using old school mixed models, some things like that. We had tons and tons of repeated measures data of different levels of hierarchy. But if you wanted to get the uncertainty of the ratio of two parameters, there's some tricks like Fieller's theorem and things like that you can do, but all that just comes out in the wash in Bayesian analysis. It makes everything so easy to do. Once you've got a model that you like, the things you can do with that model are far and away more interesting than what you could do with their non-Bayesian analogs.
Hugo: And I love the idea that you spoke about earlier of coupling Bayesian thinking With Shiny dashboards to make the posterior distribution or ... for people who don't know what the posterior is, it's really the probability to distribution of the parameter of interest after you've incorporated your model and data and your prior knowledge and all of that. But using a Shiny dashboard to communicate this, I think, is a really it's really beautiful coupling of two great technologies.
Max: Yeah, we spend a lot of time trying to eradicate as much jargon as we could from those pages. We never said posterior. I don't even think we even said Bayesian. We would basically in as much straight English as we could muster, try to describe what these things were computing and to the credit of the scientists, they really began to figure that out. There were some situations where it gave results that we thought they were going to think were wrong, so there were some assays like our activity assays that have extremely bimodal distributions. We have this really big, normal distribution that's pretty wide, let's say it's like plus or minus 30. The tails of that distribution are centered around zero. They could be as low as minus 30 and as high as plus 30 and then the hits, the active compounds tend to have a much smaller normal distribution that maybe ... or let's say 75-100. So you have this prior distribution that you come up with, with your Bayesian analysis based on a lot of compounds which is extremely bimodal. You give them a result and let's say the data that they hand you has an activity of 40, so it's probably above the part of the prior that is for the non hits. The distribution that the Shiny apps would give them would be a really, really bimodal distribution. And it makes sense why that is because it's sort of in the middle of these two distributions and until they collect enough replicates, the posterior distribution doesn't get very tight. And we thought that they were going to be like, "I think you computed this wrong," but before we even brought it up said "Oh yeah, this totally makes sense that there's a slightly higher probability that it's dead and then a 40% probability that it's an active compound. And I can see why that’s what it was." So I was really impressed with people with little to no statistical background just consuming these things and accepting them and rationalizing what they were doing without us have to get into so discussion about priors and posteriors and things like that.
How does new evolving computational paradigms impact the data community in pharma?
Hugo: Great. So you now no longer work in pharma. You work, as you said earlier, at Rstudio, but you work as a software engineer. But of course, what you do is impacted a lot by all the work you've been doing in pharma for decades. So I'm wondering how you see the work you do now, new, evolving computational paradigms such as the tidyverse impacting the data community in pharma.
Max: Everything we do at Rstudio, and especially in the tidyverse group is really just enabling people to do more or do things more easily. If you think about dplyr and things like that. dplyr and the tidyverse take things that, especially in R were feasible to do, but kludgy. That's a Baltimore word meaning really awkward. And so, it makes those things really easy to accomplish. That happens all the time everywhere. I would say the things that, before I joined our studio, that I thought were the biggest, the best tools that satisfied that "let's just make it really easy” goal is ... Number one was R markdown, so I never was really totally prescriptive about saying if you're going to work in my group, you have to use R. But, one of the most important things we did was really try to enable reproducible research that ... We made sure that when somebody had a project, we knew what data was used, we knew what the analysis code was, we could tie that directly to a report because we might not work on that project for five years and then somebody comes to us and says, "Yeah, what did you do for this?". And if we didn't have something like RMarkdown, Sweave and Knitr and all those tools that we've had, and I think have gotten taken for granted at this point, we wouldn't be able to effectively have any record of what we did and how we did it. And it really wouldn't help anybody down the road. Or even I'd work on something and six months later somebody would say “Well, what did you do here?” And I've done like 15 projects since then, so RMarkdown and that enabling reproducible research was just a godsend. I mentioned Shiny ... I can't overemphasize how much that has made a difference and for me it's been interesting because at Pfizer, at least some of the biggest proponents of using Shiny were the clinical pharmacologists, so these are people that are measuring concentrations in patients in clinical trials. And they, I think, behind nonclinical statistics, were the people who were most into Shiny and how could we utilize this. And they were the very, very, very regulated part of that pipeline. And to see them embracing computational and visualization aspects of what they do is ... that was really unusual in a good way. So, these are the things I think that, as far as Rstudio stuff that has made a huge difference. For me, caret was originally designed to make pretty good modeling easier in R and I think it does that pretty well. It makes a lot of decisions for you and it's pretty high-level code. But my job right now is basically to take, from a broader standpoint of modeling, just make it easier to work with models in R. So if you're in survival analysis model or time series model or a standard regression model, there's this package called broom that David Robinson wrote and that's a great way to talk to ... that exemplifies what I'm getting at is we've all had code that would take the summary object of a linear model that gave you output back that you wanted but not in a format that you would ever want it to be in. And then we all had code that would take that and maybe make into a data frame and format it and make the ... even the title or column name for these things, depending on what type of model you fit, would be named differently. A lot of us have had code laying around to just make this all work and basically what the tidyverse does, and especially what I'm trying to do with modeling in the tidyverse is to extend that more and say let's just take these things that have been frustrating to work with in the past and make them smooth and enable people to do these things without banging their head against the wall.
Hugo: So something I find really interesting in this conversation is a trade-off that occurs in your career and several colleagues of yours as well. And I'll be specific about this, in that you started developing software to meet needs that you and your colleagues had. And eventually you ended up at a place like Rstudio where you're developing this software constantly but you don't necessarily have direct contact with people in the ground who have all the needs that you're trying to serve. So I supposed my question is now you're at Rstudio and not at Pfizer, how do you stay grounded in the needs of people using the software you're developing?
Max: Part of me thinks it's not that difficult to do because when anybody has question, they ask you. One great thing about the community is there's ... I like when I was in graduate school or my first job, I would have never thought to email Jerry Friedman and say “Yeah about MARS. How does this work?” The thing about, especially the R community and statistics is it's not a perception, but there is generally speaking a much lower barrier to contacting people that you think can answer your question. So, maybe twice a week I'll just get a blanket question about like, “Hey, have you ever had data like this and what did you do with it?” And so, that helps me a lot because it helps me understand weak points in things I think we can do better. It helps me understand what people are working on, if it's something that ... especially if it's something that I didn't really have a lot of exposure to. That gives me some motivation to learn more about it and to figure out like ... If you email me, I can be a consultant and answer everybody's question, but a lot of times if there is a particular question that strikes me I'll spend two or three hours figuring it out and then sometimes give people a really detailed explanation back. I think there's still a lot of opportunity to learn and keep your ear to the ground and figure out what's going on. I do have to admit that probably the biggest adjustment for me working at Rstudio is I don't get data. I don't get, I'm used to getting a ton, maybe more than I wanted sometimes, but a ton of new data that sometimes is really simple and sometimes really complex. I'm constantly on the lookout for more data, especially for teaching and putting in R packages to demonstrate things. That is one thing that's taking me quite a bit to get used to is ... there are areas that I'd love to do more with, but I just don't have any data to work on that sometimes. That's more of a problem than anything else that I have.
What advice do you have for those who want to become a data scientist in the pharmaceutical industry?
Hugo: And I do think your point to people reaching out now and that being a paradigm of communication. Stack Overflow has proved a really important place and you've always been ... I see questions about caret on Stack Overflow and your very commonly one of the first people to respond there. We've been talking about budding data scientists. I'm wondering what advice you'd give to a budding data scientist who wants to get started in the pharmaceutical industry.
Max: So I think the first thing I'd say is go into nonclinical statistics. I know I keep talking about it and the reason is ... there's a couple reasons actually. That's where the fun is. Everybody I know who works in clinical ... not everybody but, the majority of them ... nobody would describe it as fascinating. I don't think anybody would ever in a million years describe this statistics that goes on for clinical studies as being as infinitely fascinating as I hear myself and other people describe nonclinical. If you're interested in something that's really going to knock your socks off every day, then that's the place to go. I think more broadly, the thing I'd say is that it's easy to underestimate how much stuff is out there for learning. Look at example, and again this like ... what year would this be? This would be like maybe 2000. I started having to work on micro-experiments and these high dimensional biology experiments and I kind of knew a little bit about that stuff but I never really had to really do an analysis and I had a very high profile, this is pre-Pfizer for me, very, very high profile set of data that were coming my way. I just started, for example running Google with the site filter for edu and looking for courses that taught how to do this analysis. There's software you can just really easily download that whole website and so very, very quickly, I was able to get very conversant and very functional in doing a data analysis where I'd never been exposed to the actual data. They had example data there but the real data that I was going to get was months down the road and again, the infrastructure and overhead for analyzing that data is not insignificant. If I had just waited until I got that data and then figured out what should I do and what software should I do and I should look around, that would have been a disaster. Between courses that you see in things like DataCamp and even YouTube videos and Twitter especially, and blogs. If you're looking to learn about something, there are no end of resources out there. And the benefit of data science and statistics is that if you want to be a medicinal chemist, you need a lab. If you want to learn how to synthesize compounds, you need some big machinery to actually do that. And we're not in that situation, so we can very, very easily download and get access to tons of real information that basically emulate what it would have been like to get that data in the first place and do the analysis. It might be stating the obvious but, a lot of times I'll look for resources on how to do something and I might find one and maybe they're talking to me in a way like, maybe they're ... sometimes I'll read things and they're talking to people like they already have a PhD in that subject and that's no good for me because I can't do anything with that. Originally I would be like, oh you know that seems like all I can find, and it's not. Just throw that away and go find something else. There's no reason to not be able to learn about whatever topic that you want to learn about in data science or statistics. Sometimes there's high level theoretical statistical stuff but arguably, I don't know why anybody doing data analysis might be seriously interested in that. Not that it doesn't have utility, but if you're trying to get things done, measure theory is maybe not the answer. I would just say look around as much as you can, especially on Twitter, R bloggers and things like that and you will most likely find ... at different levels of sophistication and complexity you'll probably find plenty of resources for what you're trying to do.
Hugo: For sure. And it takes time and patience and also, when I run workshops, I always pretty much at the very start tell everyone that search engines will be their best friends in these types of endeavors, if they can't figure anything or figure something out.
Max: Yeah and it sounds stupid but that ... when I'm doing my Google search and putting site colon .edu, that was a huge help because it gets rid of most of the hits you would get based on people trying to sell you things, right? So, if you want to learn about microarrays, you might get the first 50 hits being companies that sell instruments that can do microarrays, and so that enabled me to get very, very far pretty quickly.
What is your favorite data science technique?
Hugo: So to wrap up, I'd love to know what you love doing. What's one of your favorite data science techniques or methodologies?
Max: It may be a little higher level than… but I'm just becoming a bigger and bigger fan of base, for certain things. It's not like ... people are going to get mad at me and start sending me hate mail for saying this, but I don't know that it's there for every problem. So Leo Breiman, the guy who invented Random Forests and was one of the inventors of CART, wrote this really inflammatory paper called "There's No Bayesians in Foxholes", which is a riff of "There's No Atheists in Foxholes". What he was trying to get at is, his thoughts at least maybe in the 80's when he wrote this was the complexity of what we need in machine learning in terms of predictive power will never be ... you'll never be able to do that using Bayesian methods. And I don't know that that's true. I think he was being a little bit instigating there. But for the types of questions I'm usually trying to answer that are not straight up prediction problems, I'm finding more and more that Bayesian analysis is the best way to go. There's so many advantages and as a statistician of my vintage ... I was in graduate school when Bayes was kind of looked down upon. It was just starting to become something that people were getting more and more interested in and reviving it, right? But I always found it funny that when people would talk about some non-Bayesian model, they would set it up and motivate it from something that sounded suspiciously like a Bayesian model, except they just didn't want to say that word. It was like a bad word and so I find that this perceived hold of that people have with Bayesian analysis about its complexity and what about this prior or that prior, you're doing the same thing in a lot of non-Bayesian models, it's just you don't have the mechanism to actually evaluate that assumption. I find it to be a very rational and very advantageous way of doing data analysis, especially if you have to do anything around inference or any measure of uncertainty. A good example may be, before I left Pfizer, there was safety group that was trying to predict particular type of toxicity that's extremely rare but extremely hard to predict, so they would get compounds and they would have to give a prediction back by saying yes this is toxic or no this is not toxic or here somewhere in between. And this group got a bunch of modeling people together and they gave us a long talk about self driving cars and neural networks. We want to really solve this problem. We think we need some really high powered stuff to do it because we've never been able to crack this nut. I like to hear that, right? I'm like, alright, what are we going to do. When you talk to them they had like 15 data points and so at that point, when they wanted measures of uncertainty, they didn't want to get a prediction back without a measure of what the noise was and that's a Bayesian analysis problem, so not because it's necessarily simplistic, but with that amount of data, you're probably going to have to rely on a prior or do something that involves more than just the data you have at hand and factor that into the uncertainty estimates. It just seems like such a good solution in a lot of problems that ... At the time when I was learning statistics, I feel in hindsight it was little bit of a disservice that people weren't more excited or proposing more Bayesian solutions at the time.
Hugo: Yeah. I think it is historical. It's also these things are generational as well. We have the computational power to be doing a lot of Bayesian inference now. But it is generational. My background is in academia and it's definitely generational in institutions like that. There are so many interesting things in what you just said. One of the most important is that the technique needs to be specified by the question and the data as well. So Bayesian works for a lot of stuff but there's some things that it's not so good at thinking about. The other thing you mentioned, which is about it being called Bayesian. I've always joked, kind of half-joked that part of the reason ... You referred to it as some people think what is a bad word and part of reason it hasn't been so adopted so historically is because calling it Bayesian inference makes it sound like something niche and out of the ordinary, as opposed to it being a way of doing statistics and inference.
Max: Yeah I agree. It's funny, sometimes you can turn this to your advantage. It does have this, not intimidating, but this exotic quality to it. Oh it's Bayesian, right? I just found that funny. So one thing is, when I was at Pfizer, we used Apple laptops a lot, mostly because we needed Unix on our laptops. But we would have to go through this process of justifying to the IT people why you couldn't do it on a ThinkPad. We would just use the word Bayesian and neural network and support vector machine and people are like, oh, oh, they're doing neural networks or they're Bayesians, like they didn't know what that was. Even like a guy in IT, you would say well I'm doing Bayesian analysis and they'd be like oh my goodness. There's something inherent or instinctual about that word that makes people like oh it's complicated and I just found that to be really funny that you could throw ... there's certain words you could throw at people and they're just like naturally intimidating, and that for some reason is one of them.
Hugo: Max, this has been so much fun and it's been such a pleasure having you on the show.
Max: Oh it's been great. Great talking to you again.
What is Natural Language Processing (NLP)? A Comprehensive Guide for Beginners
What is Topic Modeling? An Introduction With Examples
What is Hugging Face? The AI Community's Open-Source Oasis
What is Bagging in Machine Learning? A Guide With Examples
Loss Functions in Machine Learning Explained
What is A Confusion Matrix in Machine Learning? The Model Evaluation Tool Explained
Nisha Arya Ahmed