This is a transcript from the DataFramed podcast "Data Science, Past, Present, Future (with Hilary Mason)", which you can find here.
Introducing: Hilary Mason
Hugo: Hi, Hilary, and welcome to DataFramed.
Hilary: Thank you.
Hugo: So great to have you on the show. I'm really excited to be having a conversation today about data science, what it is, where it's come from, the past, present, and future of this emerging discipline. But before we get into that, I'd like to find out a bit about you.
Hilary: Okay, sounds like fun.
Hugo: How did you get into data science?
Hilary: It's a good question, because when I started my career data science didn't exist, and so I started in academic computer science in machine learning a long time ago, about 20 years ago, and realized after some time I actually really like building products and systems that touch real people and real data, and that furthermore, most of the interesting data was not in academia, it was actually in companies that were starting to collect this data on the internet as a side effect of their business operations. And so I ended up moving back to New York City, which is where I grew up, and joining some startups to work on hard algorithmic problems that would open up some product possibilities. And so, data science started to emerge for a variety of reasons about that time, and we're talking about 10 years ago. And so I've been pretty involved with it for a long time, and I think whether we call it data science or machine learning it's really just different perspectives on using data to build interesting applications. And so it's been a while.
Data Science Projects
Hugo: When you say, "Touch real people using real data?, we're talking about seeing an effect on the ground, and I presume the startups that you started working for when you came back to New York City, you were working on these types of projects?
Hilary: Yeah, it's really the difference between working on an algorithm that may satisfy some theoretical requirement, or working on a toy data set, and then looking at data actually generated by people, or from human behavior, that can then be used to provide some application or a service. One of the first companies I worked with was actually building models of career progression and career evolution by looking at a few million resumes. This was long before LinkedIn had any of these similar features, we were able to actually see that if you're a software engineer and you want to be a CEO on day, here are the jobs that other people tend to get in the meantime that will take you from point A to point B. Or we would see other things like, if you start as a lawyer, there was a 50% chance you'd be out of law in five years. Whereas, accountants would stay in accounting. There was a 90-something percent chance they'd still be accountants five years later. And that's what I mean by real data. It was giving us this insight into human behavior that was previously not out of reach, but was too expensive to apply to these sorts of fairly trivial problems.
Hugo: This is an example of what you would refer to as a data product?
Hilary: Yes. The product piece was the part where a person could actually come along and ask these questions. But the data science part was getting these resumes, parsing these resumes, building the models, and then actually hooking it up to the product in a useful way.
Hugo: And what other types of projects were you working on when you came to New York?
Hilary: Looking at data extracted from 3-D environments to understand and infer likely intents and actions. So, looking at data from things like World of Warcraft, or Second Life, which was cool at the time I will remind you, to try to figure out what people were attempting to accomplish in those environments. And that also had an interesting realtime aspect to the classification problem. I also ended up working at a company called Bitly, which is a social media analytics company, in 2009. I was there for four years as their Chief Scientist, and that was really at a time when there was not a defined practice of analysis of social media data. There was a nascent computational social science movement and people were really just starting to get their heads around what we could learn about human behavior through this kind of data. It was a very exciting time to be able to play with it and think about the products we might be able to build.
Hugo: Did you work for The City of New York at any point?
Hilary: I was on Mayor Bloomberg's Technology Advisory Committee, so I wouldn't say work, but maybe volunteered is a better word.
Hugo: I recall that Bloomberg was involved in a variety of data science innovations and initiatives.
Hilary: Mayor Bloomberg, when he was our mayor, was very involved in encouraging the technology industry in general. I think he realized that we couldn't rely entirely on finance, as a city, for a healthy economy. And in many ways, using data more effectively, so building the Mayor's Office of Analytics, and using their own data as an internal tool to guide the use of scarce resources, and in many cases, to make city services more efficient to actually save and improve lives. And then also he was responsible for large initiatives like The Cornell Technion Project, which brought The Cornell University, which has just opened their new building on Roosevelt Island. It's pretty exciting to see that actually come to fruition.
Hugo: That's really cool. I recall there was an example of an initiative to do with ambulances and allocation of city resources in that sense?
Hilary: Sure. This is one of my favorite projects to talk about because it is fairly trivial data science work that actually leads to very important impact. This is work that was done when Mike Flowers was the Chief Analytics Officer, and I believe the work was done by Lauren Talbot, who is one of the statisticians on his team. They looked at where ambulances should be sitting in order to be optimally located for the likely distribution of incoming calls, and they found that, of course, they're not actually sitting in those locations. They figured out that what the ambulance drivers really wanted was 24-hour bathrooms, and coffee, and other services so that they could actually be comfortable. They found them those services in those more optimal locations, and actually managed to reduce the ambulance response time fairly significantly. And I love this project for several reasons. One is that it actually shows that data science can have a significant impact in the real world. It's not just, "Are you gonna watch this movie on Netflix?" Or, "Are you gonna buy this thing on Amazon?" It's actually making our city more efficient in saving lives. And then the second part of it is that it's actually about going outside and asking people, "Why are you sitting where you're sitting? What's important to you in getting to an optimal answer to this question?" And taking those human factors into account. And the last bit, of course, is that the math is pretty trivial and well understood, but it's still hugely impactful when it's applied appropriately. This is why I love this example.
Communication in Data Science
Hugo: I think it's incredible. You spoke to the ideas of not such complex math, having a well-formed question, and also communication, the fact that data science doesn't exist in a vacuum. Understanding the problem, and actually, as you said, they found a solution and realized that wasn't happening in the real world, and then went and spoke with ambulance drivers to figure that out. Something I admire a great deal about you is your emphasis on the role of communication in data science. Maybe you could speak to that a bit?
Hilary: I think the best data scientists are people who are pretty empathic, and are able to understand what is important to solve. I mean, the truth is that framing the questions is where the challenge is. Finding the answers is generally a trivial exercise, or an impossible one. When you think about it that way, a really great data scientist can sit down with somebody, understand the thing they're trying to make a decision about, or what they need to know, go away, MacGyver up some analysis with the data that is available, or could be available in the tools they have at hand. They can go back to that person and explain to them what they've learned in a way that lets that other person make a better decision. And when you frame data science work in that kind of context it becomes really about how well you can understand somebody else's domain, and somebody else's needs, and then how well you can do your own work to satisfy those needs. And it's not about whose math is the hardest math, it's really about, "How do I get to a robust problem definition that I can solve, that will actually give someone an insight they didn't otherwise have?"
The Past of Data Science
Hugo: So, we're here to talk about data science as a function of time, where it came from, and the direction in which it's heading. What factors led to the emergence of data science as a discipline?
Hilary: It's not an accident that data science emerged about 10 years ago, because it is a technological artifact in the sense that technology had progressed to the point where the multiple things that a data scientist does could be combined in one professional role. Those things being actually write code and build models. So, there's a technical skill set and tool set that had to be created. The data had to be available, which was also not the case before 10 years ago, or it was very expensive to make that data available. And then you also needed a set of problems and processes and ways of thinking about the world that let you put all of these pieces together, and that's the broader communication and empathy piece. It's not as if this was new work at all, people had been using databases for nearly 100 years for solving business problems. But, it was newly affordable, and newly so easy, that one person could take on everything from the problem formulation through to the analysis, to the visualization, to the communication, to the eventual decision making as well, in a way that it just hadn't been before. And it opened the door to the creation of this new job role of data science as something that is itself distinct.
Hugo: Where did we see it emerge? Which disciplines or fields?
Hilary: Keep in mind, my background is computer science so I have a bit of a bias here, but I do see data science as, it's essentially if computer scientists had stolen a lot of the wisdom of statistics.
Hugo: I like it.
Hilary: And so, it is a blend of computer science, statistics, and then we're also seeing more influence from social sciences now as well.
Hugo: And also, as you said before, communication, journalism, storytelling, all of these as well.
Hilary: Just the kinds of business fluency skills that we expect most professionals to have today.
The Present of Data Science
Hugo: Where is data science now, and what's it capable of?
Hilary: Data science has become a real thing, which still astounds me that there are potentially thousands of people running around with that job title. It is accepted as a role in an organization, and it's something that if you mention that you do data science at a dinner party you're probably not gonna get people turning around and walking away. People think it's actually an interesting thing to do. I also believe we're starting to see data scientists make large contributions to their organizations. There are certainly still challenges to overcome, but the value of data science from a business point of view is pretty clear at this point. The questions I have really are, "how will the practice of data science be changing over the next five years?", "Will we still be using that title? Or will we all be AI monkeys, or something else? Even though, the fundamental skills will remain the same?" Those things I'm not entirely clear on yet.
The Impact of Data Science
Hugo: And so, we're definitely gonna talk about the future of data science very soon. I want to know which industries, which fields, do you see data science being capable of having the most impact now?
Hilary: Right now, and here I'm speaking through the lens of our work at Cloudera Fast Forward Labs. I mean, we see huge impacts across industries, but some are more mature than others, particularly in finance. They're not necessarily in the places you'd expect. We see large progress being made, and this is largely because these companies have a lot of data already. Like finance has a long history of making data useful, and so there is already a culture of being fairly data driven in place in many of these companies, and they're also very interested in extending those capabilities to new kinds of data. And so, that's a place where we've seen people starting to make unstructured data useful in the ways that only structured data has been useful before, by which, I mean things like text. That's certainly one area. Another area where I see a lot of impact is in the pharmaceutical and healthcare space, where if you can shave a few percentage points off the cost of certain exploration activities, like you have a very clear win, and data is certainly a tool for doing that. We work pretty heavily in media as well. That's things like understanding your audience, helping them find content they'll love, helping them engage with that content, making sure it's shared optimally across different platforms. It's not only one place, but really pretty distributed. When I started the work at Fast Forward Labs about three and a half years ago I thought we'd probably end up working in one industry or maybe two, but that really hasn't been the case. It turns out that the thing that everyone has in common is the data, and the math is the same. Whether we're generating celebrity gossip, reporting on fashion, or we're writing a program to generate language about portfolio performance, it's the same mathematics and same techniques that enable those data applications. And so, we've seen pretty broad use.
Hugo: I like the examples you give. I think speak to this idea we were discussing earlier that data science existed in certain disciplines before the term came about, so in finance and pharma they've been doing data science, or analogs of data science, for decades, if not longer.
Hilary: Absolutely, and the same in insurance. Though I found that insurance as an industry is only now really picking up on modern data science. It's a pretty exciting time if you work in insurance analytics right now.
Hugo: And I think you also spoke to the idea that the math a lot of the time remains the same. The applications may change depending on industry, but there is an abstraction that the same techniques will apply, and that's something that's deeply integrated into your work at Fast Forward Lab, right?
Hilary: It is. We do our own program of applied research looking at emerging capabilities, and attempting to make them useful to our clients ahead of where they otherwise would be. And so we publish reports, which are a description of what the thing is, and how it works at both a conceptual and a technical level every quarter, along with working software prototypes that demonstrate it applied to a business problem. But, we try to choose problems where people have a fair amount of empathy for it, and so they can look at the prototype and say, "Okay, I understand how this technique can be applied to my work." Just to give you one concrete example, we recently did a report on algorithmic interpretability. These are new algorithmic techniques that you can run on top of black box algorithms, such as neural networks or other deep learning approaches to, at a very high level permute the inputs and look at how the outputs change, and then infer the significant features in the classifications that those black box systems are making. Our business problem demonstration was on a black box model of churn for a telco. This was a real problem we advised one of our customers about, where the interpretability capability was used to be able to see which features of the customer were leading them to churn. Were they paying too much? Were they on an old technology? And you could also change those things and get a model of how the predictions change, which actually enabled new marketing and new customer service actions.
Hugo: That's awesome. Once again, this takes us back to the idea of being able to communicate technical data science results to stakeholders.
Hilary: Exactly that. The idea is that an executive or an engineer can look at this and say, "I get it, I see how it works, I can now apply it to my problem." Whether that problem might be doing something like bias testing for regulatory compliance using the same mathematical technique, or it might be something like inspecting a prediction model for when and where to spend resources. Whatever it is, you can get a very good intuition for the algorithmic capability, and then figure out how to transfer that into your specific domain and problem set.
Hugo: I think that's a great example. I actually think your colleague, Mike Williams, gave a webinar on this, didn't he?
Hilary: Yes, he did.
Hugo: I tuned into that. That was awesome.
Hilary: Oh, that's very cool. That's up on our blog if you're curious.
Hugo: Oh, great. Well, we'll put that in the show notes definitely.
What Data Science Can't Do
Hugo: So, we've discussed a bunch about what data science is capable of, but we've also heard that data scientist is one of the sexiest jobs, or the sexiest job, of the 21st Century. There's a lot of hype around the term data science, and with such hype I think there is also a healthy skepticism that needs to be invoked. What isn't data science capable of? What can't data science do?
Hilary: That's a really great question. But it also comes from this framing where we assume that the default is that data science can do anything. It's pretty clear that data science can often tell us what to expect, or what might happen, but not why. And so, if you want to understand the why you really have to go talk to people. You have to understand that a lot of that knowledge is in someone's brain, it's from their experience. And this applies to the entire umbrella of data capabilities from analytics up to these more complex and interesting neural network models. You might get a result that you just don't know why it's doing what it's doing, or you might see something in your data that you can't explain. One company, I looked at their analytics, and it was an e-commerce company, and they had an unexplained, but repeated trend of increased orders in March. And this is something that you can see very clearly in the data, I had sufficient data to be able to make a prediction for the following year. You could understand the trend from a mathematical point of view, but you could not explain to anyone why that was happening. And it turned out to take quite a bit of digging to figure out what was going on there. And what the story actually was, is that there was a subset of a few products that had gotten written into some elementary school curricula, and the purchasing decisions for school districts had to be made a year ahead by March, and so what you were seeing there was this bizarre artifact of arbitrary deadlines, and a set of customers they didn't even know they had. And that was not the kind of answer we were ever going to get to just from looking at the graphs.
Hugo: Knowing something about the domain, and actually delving into the results, is essential in this case?
Hilary: Yeah, so data science, the science part will only take you so far.
Hugo: We also see a rise in awareness about such challenges as algorithmic bias, whether it be algorithms encoding societal biases or human biases, or algorithms creating their own biases, as well. Is this something you're actively thinking about?
Hilary: Absolutely. As part of this interpretability research, we hosted a research fellow named Julius, out of Bio, who did some fantastic work on bias extending the work that ProPublica had done on the recidivism data set where a black box proprietary algorithm was making sentencing recommendations. That's also on our blog. But, yes, this is something that we focus, not just on algorithmic bias, but on the ethical implications of the use of every capability. We have a chapter on that in every report we write. That's really to help people to understand the kinds of decisions they have to make in designing around these algorithms, and then also to give them an excuse to have the conversation, and to think about the ways in which something could go awry and impact the people on the other end of the product.
Hugo: What are the biggest concerns for you and Fast Forward Labs when thinking about ethical aspects of data science?
Hilary: I mean, the biggest concern I have is a very basic one, which is that the ethical considerations are rarely a part of the data product design or planning process. It's really a high level concern that the only time I see this routinely considered in the product development process is when it's a regulatory compliance issue as well. And so, if a company has a legal obligation to not discriminate then there certainly is a review, and a lot of thinking about how to validate that there's not discrimination in an algorithmic system. But, if that legal requirement is missing, it is still not a given that they'll even be thinking about bias in the data, or bias in the results, or how that may impact people in the products they eventually release. That's like level zero.
Hugo: Part or our challenge there is also the fact that tech is faster than legislation, right?
Hilary: Yes, and I don't personally think that legislation is necessarily the answer. I think the answer is that we are still developing a practice of what I'll call data product development here. A data product could just be a model in a report, or it could be an internal tool, it doesn't have to be a consumer facing, beautiful application.
Hugo: Or, Google Maps, right?
Hilary: Google Maps being my favorite data product because you don't need to know anything about the data, and the algorithms behind it, which are incredibly technically impressive, in order to use it effectively. But, that aside, that we as a field are still evolving the practice, and you can see this when you look from one company to another. If you're a software engineer, you're probably going to encounter pretty much the same process when you move from company to company. That is not the case with data science, and is certainly not the case with data product development. And so as this practice emerges, I would like to see us as a community consider ethics as a first-class design principle in our work.
Hugo: This, you stated, was level zero?
Hugo: And what's built on top of that?
Hilary: Well, level one is where we can start talking about the specific problems we've already seen emerge, and specific things to watch out for. And I can give you plenty of examples, but I think we're still stuck at that very beginning part.
Data Science Definitions, Deep Learning and AI
Hugo: So, bundled in with all of this, what data science is, what it isn't, there are a lot of buzzwords floating around that are very substantive as well, but I need your help to demystify them. Examples such as deep learning, and artificial intelligence are probably the most prevalent that have gained currency and getting a lot of attention. What terms and language do you think need to be clarified in the data science space? Particularly with regards to what they are actually capable of?
Hilary: I spend a huge amount of my time just clarifying the use of different terms in a given room, because we can't take for granted that when someone says AI they mean the same thing that I would mean when I say it. And the meaning of these words has changed dramatically in the last few years, and I expect will continue to do so. Right now at this moment, in December of 2017, we have seen a huge increase in the popularity of deep learning neural network techniques for very good reason. It's opening up capabilities that were simply impossible five years ago. Things like robust image object recognition, doing video classification, looking at audio in a way that is something that is actually novel. And beyond that, being able to model text and language in a way that is completely novel. So, looking at things like word embeddings, and sentence embeddings. These give us a bunch of new tools for addressing an entirely new and interesting class of problems.
Hugo: You actually have a report on word embeddings that came out recently, is that right?
Hilary: We do. We framed it around summarization. So the report is called Summarization, but it's essentially about word and sentence embeddings in order to do robust extractive summarization of documents. And it was a lot of fun, we have a great prototype for that where you get a Chrome extension you can run on any English language article on the web and see the summarization run in realtime, and play with the different network architectures in order to see the different kinds of summaries that get extracted, which is a lot of fun.
Hugo: Oh, that's awesome. I know what I'm doing tonight. Sorry, I cut you off. Talking about the capabilities of deep learning, in particular.
Hilary: Deep learning is one of these terms where there's no hard limit for how many layers you need in a network to be deep, so at this point anything that's pretty much a neural network, even if it's one or two layers, is deep learning and it, itself, is one set of techniques under the broad umbrella of machine learning. But, when we think about AI and the way that term has come to be used now... I mean, historically it was the field of research inside of computer science that gave birth to machine learning, and that was because there was such disillusionment with AI from a funding and accomplishment perspective, people were so overly optimistic that researchers essentially had to rebrand. And it also coincided with the use of probabilistic and statistical technique, so AI fell out of favor as the term of art, and started to show up really more in sci-fi and in movies. But now it's come back, and it's come back largely as a result of the rise of deep learning, and the capabilities there, and the hints we see of more intelligent machines. But, AI itself today is not a technical term, it's largely a marketing term, and it's one that I think shows a little bit more enthusiasm for the capabilities than what may actually be possible given the state of the technology.
Hugo: Yeah, and I think you made an interesting point that AI is also a term that has been in the cultural consciousness from science fiction, so people have a general idea that sentient robots and these types of things are artificial intelligence, and we see headlines play on that. I can't quote any off the top of my head, but you see headlines such as, "Artificial Intelligence Creates Copy of Itself in a Way That even Google Can't Understand," and stuff like that. We need to be careful about that, right?
Hilary: Right. One of the things that I've found most fascinating about the emergence of AI as a term now, is the way it has changed the language that people use to talk about what is still fundamentally a computer program in that it implies this kind of anthropomorphism, and we talk about "the" AI, and I'm putting this in air quotes, as if it's a person, in a way that just changes the expectations people have of its capabilities. And I do think we have to be pretty careful with that sort of language, and the impression that it encourages people to take away.
Hugo: I think part of this anthropomorphic process is that we have had this term ported from science fiction. Because we all know Blade Runner, right? And what replicants are, and these types of questions being posed in that space.
Hilary: Yes, and The Matrix, and we could go on and on.
Hugo: These are a bunch of interesting challenges that I think we're facing as a discipline. Are there any other major challenges that you think currently face the data science community?
Hilary: Do you think that imprecise ethics, no standards of practice, and a lack of consistent vocabulary are not enough challenges for us today?
The Data Science Process
Hugo: I definitely think so. We haven't really delved into the inconsistent practices, though. So, maybe you can speak a bit to that?
Hilary: Yeah, I mean it's really an artifact that data science is still a fairly new professional role, and as such it tends to get shoved in with software engineering, or sometimes it gets shoved in with traditional analytics, and the CFO sort of framing on the world. But, it's not managed using its own process, and so this is a very controversial statement, but if you are going to run an agile, by-the-book process, it is terrible for data science in the sense that it is the ideal process for software engineering. But when you start out with a software project you generally know that the exact thing you want to build is achievable and what you're figuring out in your experience is the methodology by which you will achieve it, but you are not inventing anything new, you're not doing experimentation. In data science, you're doing experimentation. You have a question, you're trying to get to an answer and you don't necessarily know at the beginning if it's going to work. And if you do know, then I'd say what you're doing is a little bit more analytics than it is data science. And so, trying to shove this into the established practice that works very well for engineering does not work for science. I've seen many companies where they end up with a lot of wasted effort and friction because they don't manage data science as its own thing.
Hugo: What are the most important aspects of the data science process that you think need to become more rigorous, or develop a process or methodology on? Whether it be data mining, documenting data lineage, through to the actual development of a product?
Hilary: All of these things are important. You need to know what data you have, where that data comes from, why it looks the way it looks. Did someone make a decision about a database field being a specific length, or a specific type? And if so, why did they make that decision? What might you be losing? All of that is important to being able to do accurate data work. But, from a data science practice point of view, when you set out to do a project you generally start with a question, or something you're trying to understand. I always encourage people to write that down, and to write it down in plain language so that anyone in the business can understand it, not just people who are technical. And that is actually harder than you think it is. I've had plenty of data scientists be like, "Oh, that's easy. I'll have it to you in 10 minutes." And then two days later they're like, "Oh, man. This is actually not so easy."
Hugo: I love that because that actually speaks to the fact that data science answers questions that exist independent of data science, if that makes sense.
Hilary: I'm thinking about it. I think it does.
Hugo: So, we can pose a question in the world before data science exists, and then data science is a tool to answer this question, which we can pose without using the language of data science.
Hilary: Oh, my gosh, you're right. This is beautiful. So, question two, once you have a problem statement is really, "What are the error metrics by which you'll know you have a successful solution to the question you have posed?" And hopefully these are quantitative error metrics, but sometimes they aren't. I have also found as someone managing data science teams, we have a lot of shame when we don't have proper quantitative error metrics, and so I want people to admit before they even start the work that this is how we're gonna know that we've solved the problem, and this is just the way it is. And these are problems like working on search engine algorithms where you can pull together a couple of things that give you some notion of whether your algorithm is better than random. But getting to a true quantitative metric really requires a volume of user data that may or may not be available to you depending on what kind of company and product you're working on. So, you might have to admit that there are not quantitative metrics and that's okay. And then the next thing you need to answer is really, "What is the product or business utility of this work? So, "Why are we doing this?" I always like to phrase this as, "Assuming we can answer the question successfully, what is the first thing we'll do with it?" And that phrasing is very careful because I find that well run data science practices have multiple uses in mind for pretty much every piece of work. So, everything you do opens up the ability to do something else, or to do something new faster and cheaper than you could have done it before, which also speaks to a set of requirements around practice. If you have a team, they need to be sharing business definitions of the data, and they need to be sharing coding capabilities, and so a lot of things fall out of just having a really nice process around how you frame the problems that you're going to explore, how you make sure that they're worthwhile and impactful. Because we also have this problem where we as a group tend to get very excited about interesting things that are not necessarily impactful, and you don't want someone vanishing down a rabbit hole for a few months, and coming back with something that's not really useful. And then how you know when to stop spending time on something. One of the big differences between academic computer science work and data science work is that you're in a business, you don't generally have two years to think about one problem. In fact, that's one of my favorite interview questions is, "What's the proper approach to thinking about this problem?" And then, "Okay, that's a one year solution. What's your one week version? And what's your one day version?"
Hugo: Yeah, and the desired metric, or how well your model performs, or whatever it may be, may be a function of how much time you have for the project, right?
Hilary: Absolutely, or what your budget is for testing and all of these things. And so it's really developing a discipline around those aspects, which, if we look at our sister disciplines of software engineering, those exploratory aspects are much less important than they are in data science.
Hugo: In terms of these processes and methodologies being developed and made rigorous, is there a concern that this may happen in silos?
Hilary: That's an interesting question because it does happen in silos. I've seen this done well at many companies, but they all do it differently. And that's something that I think will change over the coming years, and part of it will descend from our tooling. As our tooling gets more advanced it will encode some of this process and practice in it, and we can think about the way GitHub has had a huge influence on software engineering workflows, and pull request. And I think that's a good analog for where the tools will take us, but we'll also hopefully converge on some comminutions of what good work looks like.
The Future of Data Science
Hugo: What time scale do you predict this happening on? I know I'm really asking you to do data science now in making a prediction.
Hilary: I have no data at hand, so what I'm doing is purely intuition. I think it'll happen in the next few years because all of this has happened much faster than I would have predicted. And so, I want to say it's 10 years off, but I actually think it's more like three to four years off.
The Future of a Career in Data Science
Hugo: So, in 2010, nearly eight years ago, you wrote with Chris Wiggins, a piece called "A Taxonomy of Data Science," in which you proposed what you refer to as, "one possible taxonomy of what a data scientist does."
Hilary: That's right.
Hugo: How has what a working data scientist does changed since then? And has anything surprised you about this?
Hilary: Well, if you look at what we wrote, and it was a short essay to just put down in writing a thing that we had not seen put down in writing yet. So, keep in mind, seven years ago data science was not really a viable career option. It was not a common job, the phrase had just really come into use beyond a couple of companies. And what we wrote down is so obvious. If you look at it now, you're like, "What are these people thinking that they have to spend their time to articulate something that's so very clear?" But it was not clear seven years ago, and so I'm glad you reminded me of that because it really does help put the time scales here into perspective that this has only really been a viable practice in itself seven years. So, maybe I'll have to shorten my prediction and say that perhaps we'll standardize on that process in the next two to three years, not three to four.
Hugo: But, it is still a telling piece because, I mean, it makes it clear how being able to interact with the shell, with the terminal, is incredibly important. I still have aspiring data scientists come and ask me at DataCamp, "Is Shell as necessary as people say it is?" And these types of things. I think, in a world where there's a plethora of all types of tools to know and learn, it does set a good baseline for what people need to be doing.
Hilary: I mean, I think so. But, of course, I have a huge bias. I'm still a huge fan of awk and other bash capabilities. It's not necessary, but it does give you a speed advantage in a bunch of situations.
The Data Science Landscape
Hugo: For sure. So, we've touched upon what the future of data science looks like to you, but I want to ask you a relatively general question, which is, what does the data science landscape look like to you in the coming two, five, and/or 10 years?
Hilary: Oh, that's a fun one. I think two years looks largely like today in the sense that the kinds of problems we're solving won't have changed very much, and our tooling will progress, but it will ease some of the challenges we currently face. And by that, I mean we'll take things like, "Once I've trained a model, how do I deploy it and monitor its quality over time, and deal with retraining?" And that will be something that's covered by standard, hopefully, infrastructure tooling. That won't be a custom set of little nutty scripts that only the person who created the thing can actually manage to run appropriately, which is where we are right now. I think, in two years, we sort of know what that's gonna look like. We'll have a lot better tooling around model deployment and monitoring. We may have more standard tooling around with things like A/B testing, and multi-armed bandit testing, and some better experimentation tools. I think within enterprise we'll see better data provenance and data sharing tools. And so, even things like feature catalogs. But this is all stuff that is essentially an engineering problem in that we know it's possible, and we know how to build it, just sort of hasn't happened yet, right?
Hilary: Five years is where it starts to get interesting, because that's enough time for a creeping tide of commoditization to come in, and we may see some really interesting capabilities emerge from some of the AutoML work that's being done, which will change perhaps our fundamental approach to machine learning. And so as excited as everyone is today about deep learning, the vast majority of deployed machine learning systems are not using deep learning, and deep learning is by far not the best choice, certainly not for the beginning of those projects where you're just trying to get something in place, and perhaps you even want to know why it works the way it does.
Hugo: Yeah, could you remind me what AutoML is?
Hilary: It's automatic machine learning, and it has a couple of different meanings currently. Again, back to the clarity around vocabulary. So, some companies are using it to mean things like tooling for citizen data scientists. You won't need professional data scientists anymore. I think that particular framing is kind of bullshit, largely because you can give someone a button that will select the right classifier and do the hyper parameter tuning for them, but that doesn't mean that they'll know why it's doing what it's doing at a level where they can actually do useful work.
Hugo: And this speaks to interpretability. Once again, having someone who's able to translate from computational language, from the math to the real world situation, having that interface.
Hilary: Right, and you also have some use of AutoML where you have automated parameter tuning, you have the ability to take trained features from one model and use them in another model without a person having to make a decision about that. And there's some really exciting research work there that we have yet to see broadly impact practice. I think that at the five year time scale, if that is going to pan out, we will see it and it may change the way we think about design of our algorithms entirely. It's pretty exciting stuff to think about using deep learning to design deep learning systems. It's a little bit meta.
Hugo: Think this idea of pre-trained models and transfer learning is gonna be huge, and also in the deep learning space, right?
Hilary: Absolutely. If you're a large organization, you have many data scientists, and they should not all be working in isolation. And so, if one is training on a certain set of data, those things should transfer to other related problems. And so you end up perhaps with a kind of network effect capability that might happen. This is all supposition, because this is the kind of stuff where today we see hints of it working. We don't yet know what the real impact is gonna be, and we certainly don't know how it will change process and practice, and we don't know how it'll change our broader set of tools. There's a ton of exciting stuff in that five year timeframe that we can start to imagine. Let's just say we have a lot of uncertainty in that future.
Hugo: And I presume there's even more uncertainty in your 10 year prediction?
Hilary: Oh, the 10 year predictions are really far out there. Will we even have data science in 10 years? I remember a world where we didn't, and it wouldn't surprise me if the title goes the way of the webmaster, and maybe data science becomes a tool set that we expect every engineer to have, or maybe we expect every business professional to have it. I'm not saying that's what I believe, I do think we'll still need the specialist to sort of build, and tune, and monitor those machines. But, there's a potential set of universes out there where that's the case. I'm personally more excited about, like in 10 years I expect we will make rapid progress in understanding of language and emotion, which will open a whole other set of potential applications and capabilities. And we also haven't talked about hardware at all. But, even today we see systems and infrastructure have to evolve to run across fairly heterogeneous hardware, so you now have CPUs, GPUs. You might have interesting sort of GPUs running on the edge, have stuff running on mobile. I think 10 years from now machine learning will be running everywhere. And we may not even be carrying around our phone bricks, we might be carrying, you know, maybe it's a hairpin, or a little watch that has all of the computational power, and we're using a variety of ambient interfaces. Who even knows? But, all of those things will involve machine learning in some form.
Hugo: As you mentioned, aspects of machine learning such as feature extraction, feature engineering, these things are well on their way to being automated to a certain extent. And whether data science exists as the field it does today, whether it has a new name, processes of machine learning will still be incredibly important. I'm wondering what parts of the data science process do you think are less likely to become automated? What are the most valuable skills for data scientists to be working on?
Hilary: I think personally the ability to frame the best problems to work on is a skill that is underappreciated and unlikely to be automated.
Hugo: That's a great answer.
Hilary: Yeah, I mean it's the hardest part. Generally, the answers are trivial or impossible, but the problem statement is where the majority of the work goes. And knowing what's useful and valuable is still something that's really hard for people to do well, I don't think we'll get to the point where machines can just sort of drop in and do it across all domains. I also think that we in data science tend to neglect some of the human aspects of the products we build. I got into a fun debate a few months ago with someone who was arguing that a business professional who spends the majority of time working at their computer answering email could never be replaced, but a cashier could. And I think it really speaks to the assumption that they were making about the role of the cashier. Do you think the cashier in a store is there to push the buttons? Or do you actually think they're there to be a human presence and interact with someone and say, "How's your day?" And I think we tend not to pay enough attention to the important work of human relationships around the data systems that we build.
A Data Scientist's Advice
Hugo: My final question is, for aspiring data scientist and well-seasoned data scientists alike, do you have a final call to action? Or something you'd like to tell them?
Hilary: That's a lot of pressure to put on one question.
Hugo: I know.
Hilary: The thing I would tell aspiring data scientists, that I think the seasoned ones probably know already, is just to follow what's interesting in the sense that this field didn't exist seven years ago, and here we are today. And so, don't have any hard expectations about what your life is going to look like seven years from now, just follow interesting problems, and interesting people, and technologies, and that's where you're gonna find the really hard, fun, impactful work.
Hugo: Hillary, it's been such a pleasure having you on the show.
Hilary: Oh, this has been really a lot of fun. Thank you so much.
← Back to blog