Robust Data Science with Statistical Modeling
Michael Betancourt is the chief research scientist at Symplectomorphic, LLC where he develops theoretical and methodological tools to support practical Bayesian inference. He is also a core developer of Stan, where he implements and tests these tools. In addition to hosting tutorials and workshops on Bayesian inference with Stan he also collaborates on analyses in epidemiology, pharmacology, and physics, amongst others. Before moving into statistics, Michael earned a B.S. from the California Institute of Technology and a Ph.D. from the Massachusetts Institute of Technology, both in physics.
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC.
Transcript
Introducing Michael Betancourt
Hugo: Hi there, Mike, and welcome to DataFramed.
Mike: Thanks for having me, Hugo.
Hugo: It's a real pleasure to have you on the show. I'm really excited to talk to you today about robust data science, statistical modeling, the work you do on Bayesian inference and probabilistic programming with Stan, model building and what the differences are or aren't between statistics and data science. But before we get into all of this, this meaty stuff, I'd like to know a bit about you.
Mike: Go for it.
Hugo: What are you know for in the data science community?
Mike: So I'm probably best known for being a developer at Stan. I'm one of the people who've built a lot of the internal C++, along with many of the other team. I try to be as active as I can on the forums and on social media, trying to get the word out about a lot of the research we're doing and improving Bayesian modeling and Bayesian workflow, as well as some of the co-applications that we're really privileges to be able to participate in.
Hugo: What other things might people know you for? And we'll get into what Stan and Bayesian modeling in a bit, for those eager listeners out there.
Mike: Yeah, I also try to be pretty present in terms of a lot of introduction material. I certainly remember very vividly trying to learn statistics without a lot of resources available to me. And so now that I've been lucky enough to really be trained right, I try to spend a lot of my tim... See more
Hugo: That's great. So, in that sense, in your introductory material and educational material, you're really thinking about lowering the barrier to entry for people who want to do statistical modeling and data analysis.
Mike: Yeah, absolutely. I think there's been unfortunately this historical evolution where theoretical statistics and applied statistics have been kept separate. And you see that a lot in statistical literature, where a lot of the academic work is very theoretical. It's very formal and it can have limited application in terms of a lot of applied work that's going on in the physical sciences, social sciences, medicine, etc. Not that there isn't overlap. You just don't see a lot of literature going back and forth. Because I've been lucky enough to have one foot in both sides of that, I try to bridge that gap a little bit.
Hugo: I see.
Mike: So a lot of introductory material just tries to focus on the new users and what they think is just enough to get them started. And I think that runs to problems where it limits how far you can go. Right? If you're just getting the first steps figured out, you don't really know where to go after that. And so one of the things I'm trying to do is take the end point, take the finishing line, which is the ability to really construct a rigorous or a robust analysis, and figuring out a path towards that. So how do we take somebody who's just getting started, and train them up to be able to get far enough to build an analysis on their own? That really requires understanding not just where people are coming from, where they're starting, but also where they need to go, right? Where is that final end point? That's where a lot of the ... That's where there's a lot of need in terms of literature and documentation.
How Did You Get Into Data Science?
Hugo: I love that and I couldn't agree more, because meeting people where they are is one thing but, as you say, connecting it to where they need to be so that you don't teach a certain amount of tools or APIs and a bit of math, and then suddenly have them stop half way through and not be able to reach the end point, is essential. And that's something that we're going to delve into a lot later on, especially in terms of thinking about robust data science and robust analyses, as you called it. One thing I'd like to know is how you got involved in data science? And I want to preempt something by saying I'm well aware that you're very passionate about how the term data science may be misused, and we'll get to that part of the conversation, but you prefer to use the term statistical modeling, and we'll kind of delve into that later, but I'd like to know how you got into this area and discipline in the first place.
Mike: Yeah, so I was trained as a physicist in my undergrad and I was doing my PhD in physics, experimental particle physics, colliding stuff and looking for very rare signals. And in that PhD project, I was given this particularly hard problem and one of my post-docs basically said, "Hey, you should learn some machine learning to figure this out." And in the process of doing that, trying to teach myself machine learning, I inadvertently taught myself Bayesian inference and got really excited about it. It really clicked as this very intuitive way of building up an analysis and learning from these complex systems. So in that process of teaching myself statistics, reading as many books as I could, I started developing my own algorithms. In particular, there was this algorithm called Hamiltonian Monte Carlo that was in a few books, particularly Dave MacKay's Information Theory book and Chris Bishop's Pattern Recognition book, both fabulous books, had mentioned this algorithm and it seemed kind of physics-y and fun to play with, so I just jumped on that. And in the process of doing my thesis, I developed some variants, some generalizations, but I really didn't understand why it worked. There was this review article by Radford Neal. It's a very seminal paper. And there were a few points where he kind of just said, "Well, we do this because the physicists say we should do it this way." And I'm over there on the physics side saying, "Wait, who's telling you that? I'm not telling you that. I don't know which of my colleagues are telling you that." And there's this weird evolution where it came out of physics into statistics, but it wasn't really clear what the foundational understanding was. So I made the decision at the end of my PhD I wanted to work on this more. And so I started just cold emailing professors. And fortunately, I got in touch with Mark Girolami, who's a professor in the United Kingdom, and he said, "Well, if you can wait a year, maybe we can get you out here for a post-doc." And then that year, I went to New York where I randomly ran into the Stan team, who were starting working on a project of implementing Hamiltonian Monte Carlo themselves, and so we just naturally fit in. So while I was there, I started working on some of the early development of Stan. Then I went to London, where I was very, very fortunate to be paired with some amazing colleagues who basically taught me everything I know about formal statistical theory and probability theory, and really forced me to figure out how to do things right. It was an amazing opportunity, and by the time that was all done, we had this really solid understanding of what Hamiltonian Monte Carlo was all about. And that really set me up to really understand what statistical modeling is all about in a way that I could really start employing it in a powerful manner.
Tool building and generative models
Hugo: That's great. And I want to stop you there for a second because something you've mentioned in passing there is that you actually became a tool builder at some point. You weren't only interested in applying statistical modeling techniques to real world questions, but you started building tools, right?
Mike: Yeah. So I think the problem is, I didn't really understand what modeling was in any formal sense. Right? I was just a physicist. You hack these heuristic models together. It wasn't ever a thing. Whereas the tools, the algorithms, that was always a thing. That was always something to be done. So that's what started me in on this path that I've been on for a while now. And in doing so, I inadvertently learned what modeling was all about, how I could formalize all these heuristics that had been taught in physics labs, in physics research, and turn it into something that was mathematically self-consistent and rules that actually made sense. And that was really exciting, because I could now ... I got my foot in the door with the algorithms, but what really excited me was now I understood what the statistics was all about and I could start building these models and doing some really cool science.
Hugo: That's awesome. So I want to kind of zoom out slightly and talk about what model building actually means to you, because you develop tools that a lot of people in a lot of industries and academia use to build models of the world based on data that's been collected. So I want to know what model building is to you and how it relates to data science.
Mike: Yeah, so model building to me is this very personal story telling. Everybody has their own data, everybody has data that's been collected in very bespoke ways. And model building is a way of building up a story, a mathematical narrative, of that data collection process, of that experimental process. And on one hand, people tend to think of statistics being really mathematical, and that discounts how much it is storytelling. On the other hand, if you're just telling stories without having a mathematical language to write those stories down in, then it's kind of hard to do any formal statistics with it. So I think to me, model building is really just a way of telling a lot of stories about how data could have been collected, but writing those stories down in a mathematical language that sets us up to do a statistical analysis.
Hugo: And I know something you're very interested in is generative models in terms of describing how data is generated in the world, so maybe you can speak to that a bit.
Mike: Yeah, so I think people get a little bit overwhelmed sometimes when they're reading some intro statistics book and it says, "You build a model," or "You write down your model and you're done." And they look at their system and they've ... "I've got these thousands of measurements and I'm looking at tens of thousands of individuals and they have all of these possible parameters to describe how they behave, and how do I somehow turn that into this model?" And one of the ways of getting around that initial overwhelming intimidation is to break down that story. Instead of thinking about modeling everything at once, rather try to model where did the data come from, where did it start. So you have some population of people and from that population, you grab an individual. And then that individual manifests certain behaviors and those behaviors turn into some interaction that manifests as data, and then that data gets selected and it gets written down somewhere. Right? Each of those steps in that sequential story are little model components that you can build. And so generative modeling is all about, instead of trying to model everything at once, rather breaking this down and model things sequentially to drastically simplify that story building, that narrative building, but at the same time, still allowing you to build something that's bespoke to your problem, that incorporates all of the main expertise that you have available. I like to, when I'm teaching ... I do courses in Bayesian reference for Stan, and one of the ways I like to describe this is in terms of Legos. One of the things that we're trying to teach you are a bunch of modeling techniques, which are a bunch of these little Lego blocks, but it's up to you to put those blocks together and build a model. We're not going to teach you how to build the model; we're going to give you the building blocks so that you can build the model that's appropriate for your analysis. You can build your badass Lego spaceship better than anyone else can.
Hugo: Yeah, exactly. And if it needs to be a spaceship, right? And I think this is a really nice analogy because the way you put the Lego together, in terms of model building and statistical modeling, will depend on the domain that you're working in as well. Mike: Yeah, absolutely. And I think one of the things that often times gets overlooked is that when you build a statistical model, it requires equal amounts of mathematical and statistical expertise as domain expertise. Right? I can't walk into a social science department and claim that I know how to do their analyses. I know how the statistics works, but I need them to tell me what the model is. I need to work with them to build up something that makes sense to me mathematically, but also makes sense to them in terms of the underlying social scientific structure. Right?
Model building in different industries
Hugo: And I think this is a wonderful segue into ... The conversation so far has been relatively abstract and I'd like to dive into some particular examples, because you've worked in consulting different industries and academic research groups to use this type of modeling, and in teaching it. So you've seen a plethora of use cases. I'm just wondering what some of the most impactful or telling models, or ones that you've just found interesting that you've been involved in building, are?
Mike: So as a physicist by training and really at heart, the ones that really do excite me the most are the physics models, and I've been fortunate to have developed some collaborations with some colleagues who are doing some cool, really state of the art physics at the moment. And there's just nothing more fun than sitting down and just talking about the story, going through the experiment, and then writing it down. Right? Writing it down in the Stan program and having that whole structure in this really, really cool little story, and then running an analysis on it. So just personally, that's something I've been really excited about. But I think that as we've been able to see the scope of what these tools can do, and we start getting analyses that get into medicine, epidemiology, that's a whole nother story. That's something that's having an immediate impact, not because it's fun and exciting, but because it's literally saving lives, and that's really humbling to be a part of.
Hugo: Can you give us an example of how it's used in epidemiology or public health, for example?
Mike: Yeah. So, one analysis I've been working on recently with some colleagues at Imperial College London is understanding the efficacy of malaria vaccines. And this is a really challenging problem because most people in epidemiology talk about efficacy in terms of “you have malaria or you don't”. But what these vaccines do is they decrease the amount of malaria that you have. It's not this binary problem in which you have no malaria parasites or you have a lot of malaria parasites. There's a lot of intermediary situations. And when you're talking about whether or not a vaccine's going to be effective or not, you really have to understand how it affects the amount of malaria itself. Because all it takes is one carrier to have a little bit of malaria in them, and then all of the sudden it can re-fuel up an epidemic. And this is one of the challenging things with all the vaccines, is you really have to understand how they work together, or how new vaccines can work together synergistically to really drop that amount down so that we can't have these epidemic flare ups in the future. When I started this analysis, I really just sat down with my collaborator and asked her, "How was the data collected? What happened? What goes on in the lab?" And we talked about it and we wrote this model, and we went back and we iterated on it, and we had this ... It took a while to really iterate and make sure that it was complex enough to be able to answer the questions we wanted. And at the end, we had this really, really cool analysis. And we had these eight different variants. We had a control with seven different vaccine combinations. But there was this one arm that didn't work, which was really weird. We're validating our model, we're looking at the fit, we're seeing if it's really reasonable, and everything works well except for this one vaccine. And that was troublesome and we spent a lot of time trying to figure out what could have been going on, and eventually we just accepted defeat and started wondering whether maybe the data was somehow corrupted. So we went back to the lab and asked, "What's going on with this dataset? Can you tell us if there's anything weird about this particular measurement?" And it turns out that the day that the data was collected, the lab was painted, and there were literally fumes roaming around the lab where the mosquitoes were bred, and it was noted in the lab notebook that the mosquitoes were dying prematurely because of those fumes.
Hugo: Amazing.
Mike: Right? And this is not something that ... It's something that had been recorded, but it's not something that had been well appreciated, that this data was largely corrupt because of these external factors that were going on.
Hugo: That's incredible that it emerged out of the data modeling process, the statistical modeling process, you could go back and discover that this had happened.
Mike: Yeah, right? Because we take what we thought was going on, we built a model out of that, and despite the fact that it's a relatively crude model in the grand scheme of things, it was rich enough that we were sensitive to these kinds of variations. So it tells you, gives you a sense of just how powerful it can be in understanding how these vaccines are working, that we can see an effect as small as the lab was painted and the mosquitoes were going a little bit crazy.
Hugo: And I know something you're really adamant and passionate about is thinking about exactly how the data was generated or what the experiments were. So maybe you could just tells us a bit about the experimental process in this case.
Mike: Yeah. So basically the way these vaccines work is they try to limit the life cycle, or they try to obstruct the life cycle of malaria. So in a mosquito, a mosquito's going to bite a human, and it's going to pick up some blood, and there's going to be some malaria parasites in there. Those parasites are going to sexually reproduce. They turn into eggs. Those eggs then turn into these spores that clog up the salivary glands of the mosquito. When a mosquito feeds, the first thing it does is try to clear out it's proboscis so it has enough room to suck up blood, and when it does that, because it has these spores plugged in there ... It's literally this little plug of malaria matter ... it basically shoots the plug's malaria into the human. It's this amazingly well evolved system to propagate malaria as efficiently as possible. So these vaccines either limit how much the malaria parasites are able to reproduce in the human, or they try to limit, they try to disrupt that cycle within the mosquito itself. And so these experiments model that life cycle. We have these mosquitoes, they feed on infected blood, and then those ensembles of mosquitoes are literally dissected. You literally take a few of them, you pull them apart with forceps, you take the stomach out, which is like a little squid, you put it under a microscope, and you count the number of malaria eggs that you see. And then you take the salivary glands and very carefully pull them out by hand, and you count the number of spores that you see. And by building up a model of how many parasites were in the initial blood, how many eggs we saw, how many spores we saw, how much malaria was in the final blood, we're able to build a model of that propagation cycle of how malaria evolves in the mosquito ecosystem. And then we can compare how that works with controls versus different combinations of vaccines and different vaccine doses.
Is hands-on experience important for people building models?
Hugo: That's awesome and not for the faint of heart. I'm wondering: you're a model builder and tool builder, but have you done these experiments yourself? And that leads to another question is, is it important for people building models to try to get a bit of hands on experience with how the experiments actually work?
Mike: So I did not collect any data that was used, that’s perhaps for the best, but I was fortunate enough to go in and see the process. So I went in with one of the lab techs, and I was given the opportunity to try to dissect a mosquito. It is real hard. You have these forceps, and you're looking under a microscope, and you're trying to dissect this mosquito. Right? So you have these little micro-tremors in your hands which under the microscope just look like your forceps are going crazy. So the amount of patience and skill they have is remarkable.
Hugo: And is it important for a model builder to have this type of experience, to at least understand that in some way?
Mike: Yeah. Absolutely the model builder has to have a very deep relationship with the people who collected that data. As a statistician, I am a translator. I take someone else's story of how the data was collected and trying to translate it into a mathematical language. I'm not creating that story. And having the experience of going in and collecting some of the data or watching people collect the data, seeing it first hand, that just makes the probability that you mistranslate something all the smaller. Right? And I think that really stamps just how much statistics is a collaborative endeavor. This is not something where people collect data over here and the people analyze it over here. This is a collaborative process where everyone has to be working together to get the best out of the data.
Hugo: And so, these are a number of very interesting use cases you've told us about. I just want to say that also, the tools you build, Stan for example, are used all over the place. So, was it last year that Facebook started using Stan for its product Prophet?
Mike: Yeah. So I'm not sure when, it was sometime in the last few years. They have been developing internal data science tools. And one of the things that happens to these large companies is you have people who use R and people who use Python, and they're using all these different interfaces to do their analyses. And so they were able to use Stan, which is a C++ product, but can be used within Python and R and a bunch of different environments, to centralize a lot of their analysis. So they were built with this one tool, they'd analyze times series in a very cool way that a lot of their teams could use just straight out of the box and just dump it right into their analysis pipelines.
Stan and Bayesian Modeling
Hugo: So we've mentioned the word Stan at least 15 times together already. So let's shift there. Tell me about Stan, about the software you develop.
Mike: So Stan, which is not an acronym, it's named after Stanislaw Ulam, who is one of the original mathematicians behind the Monte Carlo method, which then gave birth to Markov chain Monte Carlo, and then Hamiltonian Monte Carlo which is what Stan actually does. Stan is basically a suite of software that's aimed at facilitating statistical modeling, in particular Bayesian modeling, and it's really three components. The first is a modeling language. It's what we call a probabilistic programing language. And it's just like in any kind of code you go and write a program that defines an executable. Well in a probabilistic programming language, you just sit down and write a bunch of code that specifies a probability distribution. And that's your model. In a Bayesian analysis, your model is specified in terms of probability distribution. And once you've specified that in our language, we then take care of the rest. So we have the state-of-the-art automatic differentiation library that allows us to take your model, differentiate it, get all kinds of information about it, and then we plug all that information into a state-of-the-art augmentation of what's called Hamiltonian Monte Carlo, which is able to fit that model in a very, very efficient, robust way. So the idea behind Stan is to try to separate out the responsibilities of analysis, try to make it easier for users to just worry about building their models and we'll try to take care of automating as much of the computations as possible.
Hugo: So maybe you can tell us a bit more about what Bayesian inference is.
Mike: Bayesian inference is a way of trying to quantify uncertainties in probability theory. So we talked a lot about data. Right? You have some observations and you have some model of how that data was generated. The problem is there's lots of ways that data could have been generated. Right? There's lots of just different variants we could have got, depending on when we collected the data, how we collected the data, who collected the data. And that variation always kind of obfuscates what we can learn. There's always uncertainty in what we can learn about a system, and there's various ways of trying to quantify the uncertainties. So there's frequentist inference, which quantifies it in a very particular way, and then there's Bayesian inference, which tries to quantify it using probability theory. So if you have a model with a bunch of different configurations, those configurations have very high probability, what we call a posterior probability. Those are the configurations that are most consistent with the data that we saw, and then those configurations that have very small probability are less consistent. And you get this really nice quantification of what's consistent and what wasn't, and we can use that quantification to report what we learned about this model, or then to make decisions based on what we learned.
Hugo: Could you unpack this with a concrete example?
Mike: Absolutely. So let's say that you have a coin. Or we have some kind of process that results in a success or a failure. And that process has some probability. How likely is it that we get a success? How likely is it that we get a failure?
Hugo: So we can think of this as infected or not, or survived or not.
Mike: Yeah, infected or not, survived or not, somebody clicked on your website or not, whether they engaged or not, you flip a coin, whether it's a head or a tail, you roll a die, whether it's a one through three or four through six. This is a crude model that approximates a lot of different processes with just this on parameter, P. And a priori, we don't know what that parameter is. We can go out there. We can try to generate some data. We can see how many people engage with the website. We can see how many patients survived. We can see how many patients were infected. But that's just one dataset. It doesn't tell us the actual probability of success or engagement or infections. And so we need to take that data, take the number of successes that we observed and turn it into some quantification of what that true probability is.
Hugo: Great. So to break that down even more, very briefly, let's say three out of 10 people are infected. You'd say the probability could ... estimation of the probability would be 0.3 or 3/10, but this is just a sample taken from the underlying process, so you need error bars on that three, or a distribution of possible probabilities in that sense.
Mike: Right, so if you had repeated that measurement, you might have gotten a two, or a four, or sometimes a five. Sometimes zero. And so, how do we take the fact that we could have gotten, we could have measured other things, into account in our analysis. And in Bayesian inference, we do that with two main ingredients. You start with a prior distribution over this parameter that we don't know, over this probability, which quantifies what is reasonable about that parameter based on everything that we know for the analysis. So it's really everything that we ... all the information available to us before we make our measurement. And then we have a likelihood function, which is just a mathematical way of writing down the fact that we have two outcomes and the success is controlled by this probability P. So for those who have introduction to statistics class, this likelihood function would be a Bernoulli likelihood, or a Bernoulli density. And then we put these two things together. So the prior tells us what we knew about our system beforehand. The likelihood tells us in some sense what we learned from the measurement that we made. And together they give us a posterior distribution, that quantifies everything that we know about this probability P, conditioned on the measurement that we've made. So we might start off with a prior distribution that's very diffuse, because we don't know a lot about this probability beforehand. But then when we go and make our measurement, 3/10, that tells us a lot about what the data's trying to tell us. And the more throws we have, the more trials we have, the more informative that likelihood is. And then we end up with this posterior distribution that concentrates, that contracts away from that initial prior to something that's a little bit more narrow. And that narrowing is our learning process. It's a reduction of uncertainty, because of the information that was contained in the data.
Hugo: Great, and so what you see in the posterior then is that maybe it was 50% likely that the click through rate was between 0.2 and 0.4. Or it was 10% likely that it was between 0.5 and 0.7, or something like that.
Mike: Right, exactly. If you give me any interval of probabilities, I can tell you how much uncertainty ... how much certainty I place on that, how consistent those set of values is. Hugo: Great, and one of the beautiful things is, you see the more data you get, the narrower the distribution gets, so the more certain you can be of your estimates.
Mike: Right. It's a self-consistent way of building up these inferences, which is a pretty remarkable mathematical feat. That said, we have to be careful, because that narrowing does rely on the assumptions that we put into our model. If you build a bad model, it will narrow to a bad value. It will pull away from what we actually see. And so you have to be cognizant of that. It's a powerful tool within the scope of the assumptions that we make.
Hugo: Absolutely. And that's similar to saying a bad carpenter will build not great tables.
Mike: Yeah, right. Garbage in, garbage out.
Hugo: There's another objection to this type of modeling that I hear quite often is, "How do you chose the prior? Your model is so dependent on the prior." And I'd just like you to demystify that for us a bit.
Mike: In frequentist inference, there this feeling of, "Okay, I just choose this model, I just choose this likelihood, and then I'm done and I get some answer out." And then you contrast that to Bayesian inference where you have this likelihood and you have this prior distribution, and it's very easy to look at that and say, "Well, there's more stuff you have to do here. Whereas over in the frequentist analysis, I didn't have to do any of that." And unfortunately, that's a misreading of how frequentist statistics works. To do a proper frequentist analysis, you need that likelihood, but you also need a lot of other things. You need loss functions, you need calibration criteria. It's a very sophisticated statistical approach with a lot of inputs. And ultimately it's not that frequentist inference or Bayesian inference is better than anything else. They're different approaches that require different assumptions. And that's one of the things that we have to come to grips with. When we're doing statistics, those statistics, that analysis that we do, is going to depend on our assumptions. There's no correct set of assumptions. My assumptions aren't better than yours. All that matters is that we can communicate those assumptions, we can discuss them, we can agree upon them. And if we agree upon them, then we have to agree what the consequences of those assumptions are.
Hugo: Absolutely. And something that I think is very much in favor of Bayesian inference is that you actually have to make your assumptions explicit, which you can do in a frequentist setting, but a lot of the time it isn't done.
Mike: Yeah, and that's, I think, an unfortunate consequence of a lot of the way that statistics is taught, where you learn these rote methodologies without really paying attention to what assumptions are implicit in them. Whereas Bayesian inference is really all about specifying the model and then putting everything together and getting the posterior. The very elegant workflow of Bayesian inference makes it very, very easy to see what your assumptions are. And a lot of people find that unsettling, but that's a really, really powerful feature. And it really allows you to not only acknowledge that you're making assumptions, but it really helps you understand the consequences of those assumptions. By looking at how those assumptions affect your analysis, both with the data that you collected and with simulated data, you really get a sense of what those assumptions mean in a mathematical setting, which is extraordinarily powerful. In fact, that's where a lot of our research is going these days, is really trying to formalize that procedure and giving users a way of really understanding the consequences of the assumptions that they make.
Robustifying Data Science
Hugo: This is really cool because one of the things we're here to talk about is robust data science and robustifying data science with statistical modeling. And I think we're actually converging on a very important part of developing a robust data science practice, which is, in terms of making your assumptions explicit.
Mike: Yeah, absolutely. Remember, it goes back to incorporating this idea of domain expertise and statistical expertise. As a statistician, I don't know what good assumptions are in a given field. I might have some biases about them. I may have some intuitions about it, but ultimately it's the domain expert who's going to tell me which assumptions are reasonable or not. But we can't have that discussion unless they can admit what the assumptions are. And building a model forces us to engage on what those assumptions are and whether or not they're relevant. It's really cool. It really does force this out of you. Many practitioners, when you start having these conversations, are very hesitant to admit to the assumptions that they're making. But it's almost ... it's a little addictive. Once you finally get over that hump, you just see these assumptions everywhere and all you want to do is share and discuss, which is a remarkable thing.
The difference between data science and statistical inference
Hugo: So I now want to pivot to the elephant in the room, which is something we've been tiptoeing around. How does data science differ from statistical inference? Because we've been discussing both, right?
Mike: Right. So this might be a little bit controversial, but I actually don't think there's that much of a difference between data science and statistical inference. I think a lot of the seeming separation between the two really has to do with the fact of how statistics is taught. In particular, in academia, statistics has become this very theoretical topic in a lot of cases. There's some very, very powerful exceptions, but in general, most statistical departments do a lot of theory. And so when applied people are trying to develop analyses, there's not a lot for them to use, and so they tend to make stuff up on their own. And I think this is one of the reasons you saw the rise in machine learning, was that there was this niche that really wasn't being served and computer science stepped in and offered a lot of tools. And there's been an evolution of machine learning, and it's become this very powerful field, but it's also separated out a little bit from a lot of applied analyses. It's really focused on what it's good at. And this has then given rise to this new niche where data science has come in. And the problem with this, from my perspective, is that these niches are all statistics. None of this is new stuff. It's all statistical analysis. The niches are not due to what the tasks that need to be done are, rather the niches are consequences of what documentation and teaching and tools aren't being provided. And the danger in all of this is that, that statistical analysis that we have to build, if we want it to be robust, if we want our assumptions to be clear, if we want everything to be mathematically self-consistent, that has to be an integrated analysis. I have to sit down and work with the people who collected the data, work with the domain experts who understand the context of that data, and the consequences of assumptions, to build an analysis that's compatible with all of that. And then I have to work with them to report those analyses and visualize the results. And I might even work with stakeholders to help turn those inferences into decisions. But if we start divorcing those steps, if I have to build a model without talking to people who collected the data or the stakeholders who are going to make decisions with it, then that analysis is going to suffer. And if we start having data science as this set of tools that people specialize in and statistics is over here on the side, and they're not talking to each other, that's going to lead to worse analyses.
Hugo: I agree with that.
Mike: Yeah, right? There's a lot of very powerful work being done in data science. There's a lot of very powerful work being done in statistics. And I think all of it would be much more beneficial to both the theory and the applied fields if people were speaking the same language. And I think that language is statistics, not because statistics is better, but just rather because it is that mathematical formalism that really does give us a foundation to build off of.
Hugo: So I agree with a lot of what you've said completely. I want to unpack a few things and I actually want to play devil's advocate as well, in the sense that there are arguments that data science is merely the statistics combined with programming skills or hacking skills, and that being a statistician doesn't necessarily involved those hacking skills. And this is something that gave to birth to the discipline of data science along with other things. So how do you view that?
Mike: Yeah, so very much in statistics, there is this one kind of compartmentalization. There is the inferential theory you put down, and then there's the rule of how that theory works, and then there's the assumptions that you have to introduce to it. And so, the math is all concerned with what are the rules, how do you have to write things down, what's the language that we use to talk about thing. Then there's the question of how do we implement that, those rules. And then there's the question of what assumptions we want to introduce and how do we develop workflows for introducing those assumptions. And there is an argument there that you could separate those out, but at the same time, if I'm implementing these statistical inferential rules in software, I need to know how those rules work.
Hugo: So then your argument is that programming is actually part of statistics.
Mike: Absolutely.
Hugo: How far do we go though? I want to keep it relatively high level but how far do we go? Is database management then part of statistics? Is scalable computation using Hadoop part of statistics? Putting machine learning algorithms into production or building data products? Would you argue that all of these are part of statistics?
Mike: Yes, so for example, let's talk about database management. Database management involves the possibility of data being corrupted. Database management involves how the data gets organized, collected and selected. That's all part of the measurement process. If I don't know how that's working, then I don't know how to build a model. Now I can approximate that by ignoring a lot of that subtlety, but that limits how much I can learn from that data. I don't have to be a database expert, but I have to be sitting in a room where I can talk to that database expert. We have to be able to communicate to integrate and build this model together. I think something like Hadoop, scalable computation, this is another really important factor. A lot of that scalable computation just isn't really compatible with the underlying rules of probability. These very rigid ways we have to do our inferences, they don't quite mesh with that stuff. And there are certain cases where they do. If you have very simple models, there is a natural way to exploit those computational resources, but if you don't have those simple models, then you can't. And so if you just have somebody who's an expert in Hadoop and tries to throw Hadoop at everything, they're going to be misapplying it in a lot of cases.
Hugo: In that case, I think, this is a matter of semantics in a lot of ways, because I think what I gather you're saying is that anything you may need to do, to work with data and to model, comes under the term statistics. Whereas my argument is that when you start having to do all these other things, such as actually building data products, you become a data scientist, as opposed to a statistician. But your statement is that these are enveloped in statistics because they're involved in the model building process.
Mike: Right. So in my opinion, the fundamental language that we're building all of this off of, is statistics. The computer scientists are implementing statistical theory. The model builders are building models within ... that should be compatible with statistical theory. The people how are collecting and curating data, they're doing that as part of the measurement process, that needs to then be incorporated into the statistical theory. So maybe a better way of saying it is that I think that when these analyses have to be built, you need a product manager. And maybe the best way of saying it is that, that product manager should be a statistician. And so if you want to ... It is semantics, it is very much semantics, but I think that whoever's integrating all that together, whoosever having that communication, needs to be very well versed in statistical theory, to ensure that the resulting analysis is robust and that it will lead to very solid decision making.
Hugo: Absolutely. And I agree with that completely. And I'm pretty much, 75% sold on there being not much of a distinction between data science and statistics, but I'll have to put large error bars on that 75%.
Mike: Large uncertainties.
Hugo: Exactly. Great, so what do you think are the biggest challenges facing our community as a whole, whether we call it statistics or data science, or really the modeling community, moving forward?
Mike: Yeah, I think perhaps the biggest challenge is being open and transparent. Certainly in the news, there's a lot of discussion about data being used in various ways. Even when data is publicly available, there's not a lot of transparency in how the data is used. And so I think to ensure that we're using data responsibly, to ensure that whatever decision making processes we're doing are fair and equitable as much as possible, as much as feasible within the mathematics, we need to be open and transparent. We need to speak the same language so we can all talk to each other and evaluate the assumptions going into these things. And I think one of the biggest challenges is deciding what that language will be. On one hand, it has to be compatible with the mathematics, but on the other hand it has to be accessible to the domain experts. If we're going to build up some analysis, we need social scientists involved, we need scientists involved, we need computer scientists involved. And somehow we have to all speak the same language that allows us to get that done. And that's, I think, going to be a real challenge. And hopefully, tools like probabilistic programming languages will go a long way to filling some of that gap. It probably won't be the final solution, but hopefully it's a stepping stone towards that. And I think a corollary of that is just being really vigilant about the assumptions. Just acknowledging that there's uncertainty and that uncertainty depends on the assumptions that we put into our models. And once we've done that, just being vigilant about understanding the consequences of those assumptions.
Hugo: Absolutely, and particularly as you say with more and more of the public eye on the modeling and data analysis and data science community. We do need to be vigilant and be responsible for the models we build.
Mike: Yeah, right. It's very easy to sit down, and take some data, and plug it into some program that automates an analysis, and just drop it out, and it becomes this just rote commodity thing. But if you're really exploiting statistics to make decisions, there's important consequences to that process, whether you're in medicine or science or industry. There's a certain temptation to just take data, plug it into some black box tool, and just do whatever that tool tells you to. Even if that tool isn't telling you to do something evil, if it's telling you to do something that's not consistent with the mathematics, or is pretty fragile, it can lead to really bad decision making that has consequences on real people. And I think the more data and statistics becomes integrated into our decision making processes, as potentially powerful as that is, the more we have to be cognizant and responsible to make sure we're doing that right.
Favorite Statistical Techniques/Methodologies
Hugo: So we've discussed a lot about the different type of work you've done, the tools you're building. I'm just wondering what one of your favorite statistical techniques or methodologies is.
Mike: Well, as I mentioned before, I kind of got into statistics via this algorithm called Hamiltonian Monte Carlo, which I'm particular biased towards. It's a really, really powerful algorithm, but the mathematics behind it is just super cool. I've done a lot of research on it and I'm just still in love. It's one of those relationships where you see the older statistician and his algorithm walking together, and they're still in love.
Hugo: It's a beautiful picture.
Mike: It's amazing. And I've just learned so much about statistics and computation from really trying to understand this algorithm and implementing it as efficiently as possible. But as I said before with Stan, all that stuff is ideally hidden away from users, and that leaves modeling techniques. And I think one of the most powerful, yet underutilized modeling techniques is hierarchical modeling. Hierarchical modeling is a very generic way of trying to incorporate heterogeneity into your analysis. So if you're modeling whether or not somebody's going to get sick or not, everyone's different. Everyone's going to respond to the same illness differently. Our physiologies are just so variable. And if we just assumed everyone was the same, that leads to some pretty significant bias in the results that we get. But if we explicitly model the heterogeneity, if we allow people just to be a little bit different from the average, then we can incorporate a lot of that variation into the analysis in a self-consistent way. And this really ensures that we get very well calibrated uncertainties. We can significantly improve our uncertainties in a lot of cases. It's just really, really powerful and it's just omnipresent in its applicability.
Hugo: Fantastic. And if people wanted to build hierarchical models, they could do this using Stan, right?
Mike: That's literally why Stan was developed. Stan came out of Columbia University. Andrew Gelman was trying to build these hierarchical models in WinBugs, which is a previous tool that was very revolutionary for its time, but it just wasn’t quite up for fitting these hierarchical models. And so they started playing around with these more modern tools and Hamiltonian Monte Carlo, and automatic differentiation, all as a way of fitting these hierarchical models, and we got this wonderful tool kit out of it as a consequence.
Hugo: So do you have any final call to action for our listeners out there?
Mike: I think one of the most important things when trying to robust statistics, isn't so much getting the answer right, it's acknowledging that you might not have gotten the answer right. And so just acknowledging that there's uncertainty, being cognizant of that, recognizing that uncertain is impacted by assumptions. Once that becomes a priority in your mindset, even if you don't necessary know how to model that better, or you don't know how to improve your analysis, just that you're aware of that, I think has a very powerful subconscious effect on how you report your results, the words you use, how strong your claims are, and if everyone was just a little bit more aware of uncertainty and presented the results in a little bit more of a careful way, I think it would go a long way towards improving communication between data scientists and statisticians, and data scientists and other data scientists, and all of us in the public.
Hugo: Or the stakeholders involved in all these analyses. So listen to Mike. Acknowledge uncertainty and the impact of the assumptions that you make. Mike, it's been an absolute pleasure having you on the show.
Mike: Thanks so much for the talk, Hugo. It's always, always a blast.
Hugo: It is.
blog
Data Science In The Trucking Industry (Transcript)
podcast
Uncertainty in Data Science
podcast
Data Science at Airbnb
podcast
Becoming a Data Scientist
tutorial
Demystifying Crucial Statistics in Python
tutorial
An Introduction to Statistical Machine Learning
Joanne Xiong
11 min