An Interview with Danny KaplanJune 14th, 2016 in Data Analysis
Danny Kaplan teaches statistics, applied math, and computing at Macalester College in St. Paul, Minnesota. He is active nationally in statistics education reform and the author of "Statistical Modeling: A Fresh Approach." This interview has been edited for clarity and flow.
In which Danny Kaplan discovers statistics
Danny, you're currently engaged in creating a course for DataCamp, Introduction to Statistical Modeling. Perhaps you could tell us a bit about yourself and how you got involved with DataCamp.
My background has been in science, not in math or statistics. I did physics and biomedical engineering, especially in cardiology. I used to work in a physiology department at McGill Medical School, but I came back to the US because I was interested in teaching, and at a small liberal arts college. It happened that, while I was here for my interview, they asked whether or not I could teach statistics. They said, "We always need someone to teach statistics. Could you teach it?" For no other reason than that, that was a job interview, I said, "Sure, that would be fun." I had no experience. I came to Macalester, and I started teaching statistics. I was about a week ahead of my students. I had never heard of a t-test.
I've always enjoyed writing textbooks and my first one was a book on nonlinear dynamics that I wrote with Leon Glass. Then, when I arrived at McAllister, I was teaching stats, but my main emphasis was on computational science. I didn't know anything about statistical software, and when I arrived here in 1996, they gave me a little cardboard box that had JMP on a floppy disc. I had some acquaintance with S+. I asked for the college to get a license for S+, but they said they wouldn't, that it was too expensive, and that we already had SPSS, whatever they say. I heard about this little project in New Zealand, and I FTPed to Auckland to get R.
It was in a very early state, but I did it because I could see that it was a real programming language. It has developed hugely since then, but I believe I'm the first person in the US to teach a course using R. I've been doing it for 20 years, since the very early days. I like this computationally-driven pedagogy. It's different from the usual textbook pedagogy. Since I'm interested in pedagogy, and I'm interested in computation, I've tried to stay familiar with the developments in those areas. A lot of it is developments in publishing. I look for new ways to reach out to students. One of the great accidents in my life was my introduction to Rstudio. At McAllister, we were the beta site for Rstudio. JJ, the principal, and Joe Chang, they actually came here before they released the software.
R has really increased my sophistication, and all of the things that Rstudio has encouraged, Rmarkdown, Shiny. No one will tell you that. I'm interested in how these things work, and I was interested in working with JJ on a better platform for teaching. JJ said, "Start with DataCamp, and see how that's working out for you."
When was that?
That was pretty recently. That was this last summer. I downloaded your teach development software, and I played around with it, and got some things to work. I wanted to see how easy or difficult it would be to write questions. I became interested in this, especially I wanted a way where I could reach out to students directly. My ideas are a little bit unconventional.
Could you tell me a bit about how you feel about the conventional ways of statistics and how it's being taught?
The first thing to say is that the statistics education community is a great community and they're really interested in doing things in an effective and useful way. If you look at statistics, what's taught in Introductory Statistics, from the perspective of a contemporary researcher, there's nothing in it for you. It's teaching you early principles, which were important historically, but have been supplemented by the ability to compute cheaply. The algebraic focus and the formal probability focus, none of that is actually needed to understand statistics. What you lose, when you do this, is the ability to do statistics of relevant complexity to the kinds of problems that people do. In almost every introductory statistics course, there is at most one explanatory variable. The word "Covariate" does not show up in Introductory Statistics textbook.
The idea of the model is completely foreign. Although sometimes you will see a model in the title, they don't really talk about modelling, they end up talking about these very, very simple kinds of...We could call them models, function estimations that are like the t-test and simple linear regression, they call it. My view is that the key skill, the key attribute that students need to have to succeed is motivation. It doesn't actually matter what their technical background is. Admittedly, there are some students who are so mathphobic. You're not going to be able to do much with them. Figure out the technology or the techniques for making that interesting topic accessible to students. Don't start with simple techniques because you think they're simple. Start with interesting techniques and useful techniques.
That always involves covariation. To be honest, it always involves causation, causality. It's a disaster in Introductory Statistics what they do about causality. They say, "If you can't do an experiment, don't say anything about causation." Almost, always we need to say something about causation, and almost all experiments except bench-top lab experiments, as in efficiency, there's non-compliance. There's what they call pollution. When we call them experiment, it's halfway between an ideal experiment and observation. You need those techniques for doing causal inference in statistics.
I like that a lot because what you're promoting is a more holistic approach to the field statistics as well. It comes down to realizing that statistics is a fundamental part of the scientific method, which ties into experimental design, data collection all the way to data analysis and not thinking of statistics, as it's all an entity in itself but it's part of this unified set of tools and concepts. Your course will be on Statistical Modelling. My question for you is, from the get-go, what is the difference between statistical modelling and statistics?
Modelling is always the process of constructing a representation of something for a specific purpose. In statistics, you don't hear much about what the purpose is. You hear about algorithms. That's how it teach us this thought, "Take the mean of this group standard deviations, combine them in a certain formula way, and then look at P at the back of the book. Either the p-value is less than 0.05 or it's not." It's really the way it's taught and it's not surprising then that people have these ideas about p-values, that this has come to dominate the non-statisticians professional views of statistics.
You need to bring to the table your knowledge about the system that you're studying. Data is not sets of numbers. Data is numbers or whatever but in a context. Always, we're working with data because there are some interests in the context and we know something about the context. In modelling, you can bring that information into play. A prerequisite for bringing it into play is having techniques that let you deal with covariance. So long as you're doing a t-test, there's very little creativity that can be applied. You really can lean much more on domain-specific knowledge, which I think is something people have to see at every level on statistics that it starts with domain-specific knowledge.
The role of DataCamp in Data Science Education
Now you're in the process of creating a course for DataCamp. What can a DataCamp course offer that a book cannot?
I don't want to count the things, but among the things that it offers are the immediate feedback that you get. In DataCamp, they've got it at a very fine level of granularity, so that as you're answering an exercise, you can get hints, and you can see very soon whether or not you have an answer that approaches the correct thing. They also provide an adjustable level of scaffolding. You can give a problem that says, "Do this," or you can say, "Here's some of the steps in doing this. Fill in the missing step," and elaborations on doing that. That's very helpful for pedagogy, whether it's about the computer or not. Feedback, being able to scaffold, giving people easy problems at first, and then as their familiarity with the technique builds up, you can give them less support, less scaffolding. In my case, I think that being able to use a computer properly is the skill that algebra, and trigonometry, and calculus used to be. All of those are ways of calculating things, and they have some mathematical beauty, undeniably. Most people don't see that beauty, they just see them as a source of aggravation, but with the problems we work with today, the kind of calculation you need to do is not algebraic rebalancing and moving things from one side to the other, inverting things. The computations are things you do on a computer.
Students need to learn about computing. Not all of them will become programmers, just as ninth graders who take Algebra II don't all become mathematicians, but they will at least have an appreciation for what's going on, and they can do much more than they otherwise could if they know something about how computers work. DataCamp is, of course, intimately wed to the computational platform. I know you do things in Python, and you do things in R. I use R, but it provides a way of interacting with students that's about computation. That's hard to read about. It's much more effective if you can read it, and do it, and if you can provide the scaffolding so that people can start simple, and then build up. That's a really promising aspect of these tools that you all are developing.
I find this really interesting, because what we're talking about now is a student, sitting at home, coding, having interactive feedback, and having short videos with instruction interspersed. Can you see this type of platform entering the classroom at any point, and would that be beneficial to the academy, to education at the secondary and tertiary level?
First, you said something about teaching them coding. I don't like the word coding. I think that programs are a way of expressing concepts. It shouldn't be a code, it should be a notation for expressing ideas. R is very good at letting you do that, and with the Mosaic package, which I've put together with two colleagues, Randy Prine and Nick Horton, we really tried to strip out the syntactical complexity in R, and let you get straight into just expressing yourself in a concise way. The priority is much the same. You're able to be very concise. It's easier to read than SQL, at least with me. ggplot is another one of these, and we're starting to see them, micro languages with data scraping, data cleaning, and all sorts of techniques. We really are moving away from the code, into concise languages for expressing computational ideas.
The classroom. Whether it will be useful in the classroom depends, first, on whether people use it. It can be hard for many instructors, college level instructors, to have students working with effectively another instructor while they're in the classrooms. There's a lot of prestige built around the idea of being in front of the classroom, maybe of lecturing. Not everything has to be a lecture, and there's much less lecturing than there was when I was in college, but you're handing things off to someone else. I don't know the extent to which instructors will accept that.
But instructors can create their own DataCamp courses, as well, with relative ease.
Instructors can. I find that it's very few people who create a course, and, in fact, there's a knack to writing exercises. It's not so many people who have that knack. That would be challenging. I mean, it's a big commitment of time to develop a course. One thing that I like about the platform is that you make things in small chunks. Developing a course is about five hours of student contact. That's at least approachable, you're not asking for 40 hours, which is what you spend in a classroom in a semester. That's, at least, approachable.
Classes exist because AV technology is poor. It was a voice. Presentation technology was poor, it was a blackboard. Maybe it was a slide projector. It was hard for students to get around. They had to commute to school, and so we arranged things in these hour long chunks, because of the constraints imposed by transportation and audio technology, and the like. These are now completely different. Is it good for students to get together, and to be part of a discussion? I think so. That's a very strong mode of pedagogy. Is a good for them to work together on a project? That's great, too. It's ridiculous to size things in semester long chunks. To look at the absurdity of how it's organized, who in their right mind would arranged that all exams happen in the same week? What are you trying to do, drive students crazy? That part's always going to be the discussions amongst students, students working together, professor watching what his student is doing. I think there is something in the human psychology of dealing face-to-face with someone, is something as well. The idea that students would come to a room with 500 seats, and each one would sit in front of a laptop, and do their electronic courses, I don't see why to do that. Save the classroom time for other purposes.
I agree. I'm speaking more toward a blended teaching approach, where you have traditional classes with this, perhaps mixed in as a module, and being incorporated, in some sense. Is that something you can see being beneficial?
It would be beneficial. Ultimately, we don't know until we try it. It's a little bit shocking to me, but hardly any colleges have policies on work that you do outside of class. You might take a DataCamp course, or a Coursera course for doing something. Colleges don't have processes by which they can accept that as a token of work. The faculties at many schools have indicated that they're not interested in accepting that as a token of work, but just as going back to your room, reading a textbook is a useful form of studying, so is this kind of interactivity that we're getting. It's the textbooks you'd expect at Hogwarts, that can interact with you. How could that not be better?
The role of Statistics in Data Science
Moving slightly away from the technology, to the content of what you're doing, and what DataCamp is interested in. DataCamp is interested in educating people with respect to the basic skill sets that Data Scientists need. I'm wondering what your opinion is on the role of statistics, and the role of statistical modeling in data science as a broad discipline.
This requires some care. What is it that makes data science different from what we had before? It's the scale of the data, both the number of cases, and the number of variables. It's the flow of the data, data streaming in. That's often an important component. It's often the kinds of goals, so a lot of data science is applied to problems like detecting fraud. It's not exactly building a model, but it's looking at residuals, rather than building a model, if you want. Let's just call it large data, is what's happening with data science. The standard error of things is 1 over root N, so if N is 100 times bigger, your standard error is 10 times smaller. If data are large enough, then you don't need to worry about statistical significance. It's going to happen. That's not to say I want to ignore it. There are statistical concepts which, if you ignore, it can be a disaster for you, no matter how much data you have. There's ideas like multi co-linearity. There are ideas like interactions. That's a modeling technique, but if you study Hadoop, you're not going to hear about interactions. You're not going to think about that style of modeling. I think statistics has a huge amount to offer. It's necessary for people who are interpreting data.
How would your statistical modeling course fit into an introduction to data science course?
A lot of data science is about data wrangling, and cleaning. These are our foundations to data science, because you can no longer type in the numbers on your graphing calculator, something that's still done in a lot of college and high school courses. Those are fundamental. Why is it called statistical modeling? A statistical model is a particular kind of mathematical model. It's one where the stuff you're building it out of is shaped by the data. It's not dictated by the data. You can bring other things into the modeling process, and I think a sensible person does, but it's shaped by data. The people doing data wrangling and visualization, they have the data. Now, they have to learn how to describe that data in a rich, and often quantitative, way. That's what you need statistical modeling for.
Statisticians and Data
I'm wondering if there are any other current practitioners of statistics or applications of statistics, or statistical modeling that affect your practice, and if so, who influenced you? Who would they be?
Oh, no. [laughs] I wouldn't be anywhere without this very rich community of people who are doing things in statistics. How many to name, but just to name one, Judea Pearl, for example, who has very insightful things to say about how to do the best job you can in thinking about causation, and in relating causation to data, which don't necessarily come from an experiment, or come from an imperfect experiment. That's incredible stuff. There's all sorts of statistical techniques, modeling techniques that are incredible things. Especially now, you're seeing machine learning techniques, a lot of which are very similar to traditional techniques, but explained algorithmically rather than in algebraic formalism. You've seen with this machine learning techniques come in. If I were starting from scratch now, and I'd something of an opportunity to do this with DataCamp, I'd I start by putting modeling in a machine learning context and conceptual machine learning can adequately simplify ways of constructing these models and interpreting the models. So cross validation to name one.
Even before that just having to hold a test set, jut put it into a training set and a test set?
That's right. This has been around for a long time. For a class I was rereading a great article from Brad Efron in the late '70s. He wrote an article for the SIAM Journal. He was writing about these techniques. The article is called, Thinking the Unthinkable and doing these crazy, crazy large calculations, which in 1978 seemed pretty small compared to what we do now. But he was talking about cross-validation. Let's see what else we've got here, my heroes. Here he is, this is an old book by John Chambers. He's had a huge influence on statistical computing, just as Hadley Wickham has, in making methods accessible to people, and providing a nice language. Box, Hunter and Hunter is another fantastic reference.
On technology, algorithms, pencils and computation
You mentioned thinking about trigonometry, and calculus, and geometry as ways of calculating things. Now, we can do it algorithmically in a lot of respects. Then you mentioned machine learning being an algorithmic way of approaching things. A lot of machine learning techniques, you can write down the equations and analytically in some sense -- not the solutions, of course. For data science purposes, that isn't strictly necessary. Even trying to understand that for budding data scientists can be a hindrance to their practice in a lot of respects. In essence, this field of machine learning or predictive analytics is very results-oriented in terms of the model with the most descriptive power wins. In fact, whether its basis is firm or not in terms of being able to run an equation might not even matter. In that sense, the future may be algorithmic. Is that something you can speak to?
It has always been about technology. The technology might have been moving stones from little bowls to other little bowls, to counting table, or an abacus. Pencils have been very influential in the mathematical sciences. Arabic numerals were a tremendous thing. It wasn't always so, at least in Europe. Starting about 1600 or so, we developed the modern form of algebraic notation. This is the exclusive way we had to communicate for 200 years. This started to be printed tables even in the late 1600s. That was the exclusive way. It is the prestigious way to communicate.
It has all of the hallmarks of religious ritual. It's got special notation that's obscure to people. It's beautiful notation. I love it. It's obscure to people. There's a kind of priesthood -- people who can make sense of it and people who can't make sense of it. We describe things as having a solution when what we mean is they have a solution using that technology. That's become the Platonic truth for us, is these equations, and this formalism for describing stuff. We have now effective other means of communicating about relationships. Software, of course, is the dominant one for all sorts of things outside of academic math. It's a really good way to communicate ideas. At the same time as you're communicating the idea, you're implementing the calculation. You don't have so much to worry about misinterpretation. You can allow people to reproduce things without having to translate a formula, even the quadratic formula. Depending on the numbers you might want to use the standard one or you might want to use a different one because they're unstable in different areas.
Software is just not as prestigious in the academy, but as a matter of practicality and increasing our capabilities it's...and accessibility. It's fantastic. It's fantastic. Newton did not have the idea of an XY graph and of a Cartesian graph. It was Descartes who came up with that, and that's after Newton. It's really hard to read Newton because he doesn't use this technology which we've assimilated into the way we think. We see things in graphical forms, in that XY graphical form. I think we need to learn to see things, or more people need to learn to see things in terms of software.
Algorithms are very important. What would software be without...What is an algorithm? An algorithm is a way of describing a computation. One of the things that people have to learn... It's one of the things I think that's valuable when people study a little bit of technical computing you learn that an algorithm is a way of expressing a computation in terms of computations that you already know how to perform. That it all depends on what you can do before you build something up out of that to make a new algorithm. With linear programming, for example, all sorts of optimization techniques, constrained optimization techniques become accessible computationally. What are the algorithms that everyone should learn? If you're going to be a literate person in technology what are the algorithms that everyone should learn? There are actually quite good lists of the top 10 algorithms or the top 20 algorithms.
The Importance of Statistical Modeling
I've got one last question for you, Danny. An ex-colleague of mine, he's a physicist, and when he meets people at a bar and they ask him, "What do you do?" he says, "Ah, I save the world one equation at a time." My question for you is what do you see all of the skills that you're equipping people with, what do you see their import in how you'd describe saving the world?
I'm not so grandiose as to think I'm saving the world. I'm helping people deal intelligently with the problems that they will actually face. 30 years ago it was assumed that managers couldn't type, and all of the work in voice recognition was based on the idea that managers would do better dictating than typing. Now almost everyone needs to know Excel or something like that. That's the lingua franca in business, right? As data become more plentiful we need to move to a higher level so that expected skills for people who work in regular jobs, in regular desk-type jobs, are going to be about data. Just like you wouldn't go into an office now and say, "No, I've never used Excel or a word processor," I think there will be a time when you really need to know something about how to deal with the data that's streaming in. But I was hoping you would ask me what do I think my job is. Your physicist friend said, "Saving the world."
I would love to ask you one final question, Danny, and that is what do you think your job is?
My job is bringing math education into the 20th century. I don't mean the 21st century. We're far from that. Into the 20th century, that's my job.
Covariance: A measure of how much two variables change together.
Covariate: A variable that may (or may not) be helpful in predicting the outcome (or target variable) in a study.
Cross validation: a model validation technique that allows you to assess how well a statistical model will generalize to a data set, independent from the data set the model was trained.
Statistical hypothesis test: A method of statistical inference, in which a hypothesis is tested against experimental data.
t-test (also known as Student's t-test): A particular case of statistical hypothesis testing in which the test statistic follows a particular distribution, the $t-$distribution.