Organizing Data Science Teams (Transcript)
Hugo: Hi Jonathan, and welcome to DataFramed.
Jonathan: Hi, how's it going?
What skills do data scientists need to work on?
Hugo: Really good. Really excited to have you on the show, and before we jump into our conversation, I've got a quick question for you. My first question is, in terms of skills that data scientists need to work on a daily basis, do you think it's more important to be able to develop sophisticated machine learning models or to be able to give a PowerPoint presentation?
Jonathan: Oh. I'm gonna say PowerPoint presentation, but I feel like that's a controversial answer, but I actually feel really strongly about this.
Hugo: Tell me a bit more about that.
Jonathan: When you think about the job of a data scientist, usually what that entails is taking data, understanding it, processing it, really trying to figure out what's going on, and then, once you've figured out what's going on, trying to convince someone else of what they should do, or what's there, or what your feeling is, but there's this really important part that is the convincing part, and knowing how to make a good PowerPoint presentation is really important in a business setting to be able to convey all those thoughts and ideas you've learned.
Having the ability to do that is way more important than being able to use the most advanced machine learning models, right? If you can use a linear regression and you can give a good PowerPoint, that can often be much more powerful than being able to do deep learning recurrent neural networks, but not being able to take what you've learned and convey it to other people.
Hugo: That's great, and I love that you mentioned regression, because this is an example where you can show people relatively straightforwardly, even non-technical people, why the model does what it does. You can explain if you tweak one parameter, why the output that they're interested in changes.
Jonathan: Yeah, and I think that's a thing that is often undervalued. I think there's kind of traditional notion in data science that the higher accuracy, the better, and that's the most important, you know? I think like things like Kaggle competitions really emphasize this, that the more accurate you can get, the higher your R-squared, the better things are going, but there's actually a lot to doing models, right?
There's understanding what are the things that are important within the model. There's the convincing people that your model is good. There's lots of things that go on, and often that stuff is more important than necessarily getting the technically most accurate answer, and so something like a logistic regression, it's really great because it's really easy to understand what's important, it's easy to explain to other people what's going on, and having things not be a black box is often really useful.
Hugo: Absolutely, and I also think ... You mentioned that a data scientist's job is to convince people of the power of their models and to help make business decisions, and another point is not only do you wanna make PowerPoint presentations, but you wanna, for example, you know, have figures that people understand, use colors that people like, as well.
Jonathan: Right. If you're making a plot, and there's something that's good that could come out of the plot and there's something that's bad, right? There's a bar for opportunity and a bar for risk, and if you color the bars where the good one is green and the bad one is red, when someone's gonna look at that chart, they're gonna much more quickly understand what's going on than if you made them two different shades of blue, and that's really small, but if you think about making good visualizations, good presentations, there are giant sets of these small decisions that all together compile into “do people quickly understand what you're talking about and accept it, or do people feel confused and uncertain and don't necessarily wanna listen to what you're saying?”
Hugo: The other thing of course is, and I think you've written about it, if I recall correctly, is even a data scientist needs to be able to choose their meeting times correctly. For example, it's probably better to try to convince someone of something at 11AM than directly after lunch.
Jonathan: Yeah, there's that famous study that I'm gonna butcher, but it was around seeing who gets parole, I think, and the prisoners who go their parole right before lunch did a lot worse than the people who got it right after lunch, and so just having your meeting at the right time can often be influential in the decision, and that's really infuriating when you think about data science, right, because data science is all about getting the best evidence and using the best techniques to try and really understand what's going on, so to find a really cool finding and then have it not be listened to because your meeting was at the wrong time, that's just unfortunate, but there's lots of these sorts of decisions that happen, and being thoughtful of them is really valuable.
How should you feel when data science projects don't work?
Hugo: You mentioned R-squared, and very recently you wrote a post on Medium which actually the first image in it is R-squared equals 0.042, and this is a great post about why data science projects may not work, and how we should feel when they don't work, and I thought maybe you could give us a bit of insight into your feelings about this.
Jonathan: Yeah, so this post was really kind of a culmination of a thing that's been happening a lot in my career, which was I would be given, or I would have some new idea for a project, right? Maybe it's trying to predict which customers are likely to churn at a company. I'd say, "Hey, we can use data science here at this company to try and predict which customers are gonna churn, and we can use the customers' transactions, and we can use when they called the call center, and we can put all these things in, and make a machine learning model, and guess which customers will churn," and I come up with this idea, and people would say, "Okay, great. Go try it," and they'd be all excited, and then I'd go try and do it, and it wouldn't work.
Maybe the data really wasn't quite there, or maybe I would actually have the data, and I make the model, but for whatever reason, it actually didn't work very well. Maybe it did turn out that knowing what transactions happened doesn't really actually tell you much about which customers are likely to churn, and what would happen to me in these situations is I'd get really depressed, and I'd think, "Oh, if I was only a better data scientist, I could have made this work. If I had had a few more techniques, or if I had had a few more projects under my belt so I'd known what to look out for, if only I was better, this thing would have succeeded."
Hugo: In fact, the top highlight on your Medium post is ... So you say, "Each time I feel awful about myself, comma," and then the top highlight is, "with a lingering belief that the project would have worked if only I had been a better data scientist," and what that says to me is a lot of people identify with this, or consider it very important to them.
Jonathan: Yeah, and I think, if you go and sample a hundred random blog posts about data science, I'm pretty sure that at least 95 would be talking about how great data science is, and how it's gonna change the world, and cure cancer, and you can use data to improve your company, and data's the new oil, and there are all these optimistic things. To create this environment where this feeling is is that data science is easy and it's gonna be very fruitful, and so when you get people trying to do these things in practice, the reality is you're doing a new, risky venture, right?
No one's tried predicting churn with transactions before. No one's tried clustering these customers before. You're doing all these things that no one has actually tried before, and so naturally it makes sense that most of the time, it doesn't work. If it did work, people would have probably tried it already, and so because of that we have this environment where the hype is that everything is easy and good, and the reality is that it's difficult and hard, and the natural inclination when people don't succeed at problems is they don't wanna talk about it, and so I feel like most people run into this situation where they try doing machine learning model, they try doing some data science, and it doesn't work, but then they can't really talk to anyone else about it, and so you just assume, well, since no one else is talking about how they fail all of the time, it must just be that I'm bad, but I don't think that's true. I think that happens to most of us, and dare I say all of us.
Hugo: As data scientists, we should be absolutely aware that statistically, it will happen to most of us some of the time.
Jonathan: Yeah. It should happen to some of us all the time, and when you see those news articles about those companies doing really cool data science and everything going great, they don't make articles about all the times things didn't work, so there's just a huge selection bias going on.
Hugo: This happens in basic research as well, right, that negative results aren't publishable, for the most part.
Jonathan: Yeah, and I feel like there's a movement to start publishing them more, and just getting more open about what people try that doesn't work, and I think data science is an especially good field where we would be especially receptive to that kind of philosophy.
Hugo: In this post, you mention several reasons why you may not be successful on any given task. Two in particular you mention is the data just doesn't contain a signal in it, and you give the example of it'd be ridiculous to predict the weather based on rolling die, right? But you also mention that a signal may exist, but your model isn't right, but then you chop down that particular possibility, really stating that if a signal exists, when you try a bunch of models, you'll find it, even if it's weak, so it's usually the fact that the data just doesn't contain the signal you're interested in.
Jonathan: Yeah, and that's really from my personal experience, is that if there's some relationship, you know, if it turns out that transactions can help you predict churn, then trying even a linear regression will pick up some sort of correlation, you know? You'll get some sort of success, and then you can try using better techniques, and choosing better features, and you can do a lot of things to improve it, but usually the very simple approach will still work.
In fact, I think this comes from this thing I've been thinking about a lot, which is this notion that machine learning models very rarely will pick up something that a human couldn't detect themselves, and so what do I mean by that? I mean that if you took customer data, you know, if you took a bunch of history of transactions, and you said, "I'm gonna personally take a guess if this customer, number 27, if they will churn or not." If you as a human can do that pretty well, then a machine learning algorithm can probably do it, too, right? Maybe it's that, well, maybe if you haven't made any transactions in a year, that's a sign you're not gonna come back, or maybe it's, if your transaction are getting less in value, it's coming back, but if you can kind of just explain, as a human, by looking at a couple of data points, what's happening, then your model will probably work.
Conversely, if you can't, even as a human, if you can't look at the data points and try and predict what's gonna happen, then a machine learning algorithm probably won't either. Let me give you an example from my career. At one point, I was working for a software company, and this software company would put out software, and before the software was released, it would have to go through testing, and so what would happen is people would use the software a lot, and it'd create a lot of in-app telemetry, so it'd create a lot of data around, well, then someone clicked here, and then someone did this, and then the app was a little sluggish, and things like that, and it'd create tons and tons of logs of telemetry.
The idea was, hey, what if we use machine learning and data science to try and make it so that we didn't need a human to tell if the software passed the test or not? What if we could just look at that telemetry and tell if then, by looking at the telemetry, tell if it will pass or not. The company I was working for ended up spending two years and millions of dollars trying to do all these different data science and machine learning techniques, only to realize that it just wasn't gonna work, and intuitively, if you had looked at the very beginning and said, "Okay, well, here's a terabyte of telemetry data collected from the app. Tell if it's gonna pass the test or not."
Like, the tests were really complex things, right? They're like, the app must feel responsive, and it must have no grammatical errors, and things like that, and things you would never be able to tell just by looking at telemetry. So as a human, if you couldn't look at the telemetry and tell if it would pass or not, there'd have been no way, or it's very unlikely that a machine learning algorithm is gonna suddenly detect a thing that you wouldn't have noticed, either.
Let me bring this all back. Then going back to why do the simple approaches work, well, it's like if you with your eye can figure out right away, "Oh, this is what the rules are here. You know, I could think this customer is going to churn, or I think this software is going to pass the test," if you can do that, then a simple algorithm can probably pick it up, too.
Hugo: Essentially, then, a more complex algorithm won't just, I think as you say in this post, won't just give you signal where a simple algorithm did not, but a more complex algorithm will give you some sort of lift, or it'll turn a good algorithm into a great algorithm, but it won't just produce something from nothing.
Jonathan: Right, exactly. It may catch the edge cases, right? Your linear and logistic regressions, they may not be able to notice all the different possible rules and things like that, but they can pick up on the base idea of what's going on, and so if they can't pick up on anything, then it is very unlikely that by putting in a super hard algorithm, you're actually gonna get a huge success.
How did you get into Data Science?
Hugo: This has been a whirlwind introduction to, I think, a lot of your interests, what you think about. I wanna step back a bit and consider you historically, in some sense. I'd like to know how you got into data science originally, what your trajectory was.
Jonathan: I would say my story starts with me going to college, and I went to a college not knowing what I wanted to study, and after my first semester I realized, "You know, I really like mathematics. I'm gonna get a degree in that." I had no idea what kind of things you could do with a math degree or anything like that. I knew intellectually that people said, "Oh, businesses hire math majors," but I didn't actually know what that was like, or what the job would look like. I didn't really know anything, but I figured, "I like math. It'll all work itself out."
I ended up getting an undergrad degree, and I got a master's degree in mathematics, too, because at some point during my undergrad and master's, I decided, "Oh, I wanna be a professor," and then later, during that undergrad and master's, I realized, "Oh, I hate math research. I don't wanna do that at all." Then I ended up working at a company called Vistaprint, and this was before data science was a term, so at the time, this was a role called business analytics, and I didn't really know what that meant, but it sounded interesting, and so there I ended up doing a bunch of cool stuff around creating forecasting algorithms for sales, and helping them optimize their recommendation engine, and things like that.
I ended up doing that for a while, and then I realized, "Wow, having a degree in math and having all this applied math knowledge isn't that helpful without knowing statistics, and knowing how to work with data, and all those things," so I ended up going back for a doctorate in industrial engineering, and my particular research had to do with how do you optimize electrical vehicle route networks, so if you're Tesla, where do you put charging stations, and if you have electric buses, where do you have them stop and recharge their batteries?
All these sorts of cool math problems, but not super related to the data science, but I was finishing my degree, and then I started getting into consulting, and I realized there's actually this huge opportunity where lots of these companies have all this data, and they just need help figuring out what to do with it, and so I think by that point, data science had decided to be a term, and so at that point I became a data scientist, and did that for quite a few years.
Eventually, that consulting company I was with went under. That was unrelated to me, and then I ended up being a director of analytics at a consulting firm called Lenati, where I started the analytics teams from scratch, started as just me, ended up growing to a team of seven, and then I decided that I wanted to do something new. That team was pretty much running itself, and so I became an independent consultant, and so I am working for Nolis LLC.
What do you do in your consulting work?
Hugo: Great. What do you do generally in your independent consultancy work?
Jonathan: There are lots of companies that are trying to start getting into data science, machine learning, AI, and if you don't have any of that, trying to start having that is not easy, and so I help companies with that kind of process. Right now I'm helping T-Mobile looking into growing their AI, and how can you use AI and natural language processing within call centers, and things like that, and really growing that space up there.
Hugo: That sounds really exciting. T-Mobile of course is telecommunications, and we see that data science can have a huge impact in telecommunications. I'm wondering more generally what verticals do you currently see data science having the most impact in or being capable of the most impact?
Jonathan: I'm gonna pivot that question slightly, if you don't mind.
Hugo: No, not at all. I've got no idea where this is gonna go, so I'm excited.
Jonathan: Okay. The question of which verticals are right for data science. I think, in some way, it's more a question of which verticals are furthest along in the data science journey. You imagine tech companies, Google, Microsoft, Amazon, they're pretty far in. They really understand data science, they have big teams, and some of the other verticals, you can imagine they're much earlier. Retail is getting there, healthcare in some ways is getting there, in some ways it's just barely starting. It really depends, and when I think about data science, I really think it actually breaks into three separate fields, and I think for each one of those subfields, the different companies and different verticals can approach them differently. Allow me to talk about those three different fields.
Hugo: I'd love that.
Jonathan: When I think of data science, I think it's really three topics. One topic is business intelligence, and that's really around taking data that the company has and getting it in front of the right people. That can mean taking data and putting it in dashboards, or weekly reports, or even in emails that get sent out every day, but really taking data and getting it to the right place, and business intelligence generally doesn't have that much analysis of the data. As a BI person, your job is not to try and figure out what it all means, your job is to get it to the right people so they can figure out what it means.
The second area of data science is what I call decision science, and I say what I call, but I didn't coin the term, and I don't know who did, and it's killing me, so if someone's listening to this and they know who coined the term decision science, please email me.
Hugo: Get in touch.
Jonathan: Yeah. Decision science is really around taking data and using it to help a company make a decision. For instance, that could be trying to figure out, "Hey, which of our products is the right product ... You know, which of the products should we stop stocking, or we are noticing that this segment of customers is churning. Can you help us figure out why, or even what is the best way to split up our customers so that we market to them differently?" This is really around the creating PowerPoints, and really trying to get people to understand what is happening from what we see in the data.
Then the last field is what I think about as machine learning, and so this is the area of data science that's how can we take data science models and put them continuously into production. For instance, making a websites recommendations engine, or creating the model that chooses what price are we gonna quote you when you look at our website, or trying to predict each day, trying to decide which customers are gonna get the email. All these things that are continuously running, that's really the machine learning part of data science, and each one of these fields is different, and each one is important, and I think for different verticals and industries, they've progressed differently along each one of those.
Hugo: There are interactions as well, right? For example, machine learning can impact decision science.
Jonathan: Yeah, and they really kind of overlap quite a bad, because you could imagine that the decision science work can also influence what you're gonna put in your dashboard, which relates to BI, and the machine learning models can influence what you report out in the decision science, and the decision science folks can really influence what ... Does it even make sense to put a machine learning model in place in the first place? I think from a skill sets perspective, each one is different, but they have a fair amount of overlap, so when you think about business intelligence, that's much more around understanding storing data in databases, it's understanding Tableau, Power BI, and the right visualization approaches.
When it comes to decision science, that's much more around using Python and R to be able to take a big dataset and get some meaning out of it, and then put it into a meaningful report, and then the machine learning component is much more around the software engineering, so getting it so that you create a model, you test that it works, that you deploy that code into a test environment, and you deploy it into production, and so it's by far the closest to software engineering.
I think when you realize that data science has these three distinctions, it makes it a lot more approachable, because I think if you lump them all together, you kind of think, "Wow, I need to understand software engineering, and I need to understand visualization approaches, and I need to understand how to make an effective report. How can I, a lowly human being, ever be able to do all of this and become a data scientist?" But because it's actually much more specialized, it's much easier for any one person to enter this field.
Usually which part of this field you're best fit for, you will just naively fall into, so people who really have a good business understanding, and think about people, and that sort of component, are often the ones who end up as the decision scientists, and the people who really like software engineering and thinking about, well, what's the right way to store this data in a big cluster? You end up doing that kind of work, so it's often naturally sorting.
Hugo: Exactly, and I think that will help a lot of newcomers who, as you say, can find the world of data science incredibly overwhelming, get started.
Jonathan: Yeah, I saw that there's this discussion people have been having online around those like "This is what a data scientist looks like" infographic that has a hundred thing listed that the person knows, and it's like, no, any one data scientist maybe knows four of those, and that's plenty.
What isn't data science capable of?
Hugo: Exactly. The unicorn is rare. Jonathan, we've seen clearly that data science can be highly impactful, but as with any endeavor, I'd like to invoke a healthy skepticism, and I'm wondering, to your mind, what can't data science do, or what isn't data science capable of?
Jonathan: I think that there's this naïve assumption that if you use data in a decision, then that decision will be better. So if you are deciding where to locate a new factory, then having data on every possible bit of information you can know about all the different places will intrinsically cause you to make a better decision, or, for instance, in my last company, we did a lot of loyalty program design, so having all the historic data on a company's customers before designing the new loyalty program will intrinsically get you a better-designed program than not having that information.
There's that assumption that more information, more data, is better, and therefore data science should make every decision better, but often in practice, data can sometimes make decisions harder, or get a less good solution, right? For instance, if the data you have is sufficiently limited, then it could be that using data actually causes you to make a less informed decision.
For instance, if you are designing a new loyalty program for a company, so let's say you're designing for Starbucks a new program for you get a certain number of coffees, you get a free coffee. Knowing everything you possibly could about how customers behaved in the old program doesn't actually tell you very much about how they will behave once a new program's available, but if you assume that how they behave in the old program will be just like how they behave in the new program, you may make a decision that is actually less good than had you not thought about data at all.
Hugo: Is this speaking to the idea that an underlying assumption of data science is that the past will be a good predictor of the future?
Jonathan: Yes, exactly, that the past would be a good predictor of the future, or if you have customers that look similar to other customers, how they will all behave similarly. There's lots of actually, if you really step back and think about it, when you're doing a data science analysis, you're often making a ton of assumptions, and those assumptions often they have the possibility of leading you to a bad place, which is, if you think about it, can get really dark, because then it's like, why do anything ever? Why should everyone even care about data science? But no, it's still good, but you just have to be careful.
Here's just another example, right? Suppose you're doing an analysis for a company, and you find that customers who buy product X end up spending way more after that, so then you now have this bit of knowledge, that the customers who have bought product x spend more in the future. That piece of information doesn't necessarily tell you that much, right, because from it you can infer that A, if we give everyone else product X, they will then buy more, or B, you know, you could say, "Hey, all those customers who buy product x, they're high-value customers. Maybe we could get rid of product X and they would still be high-value customers," right? Or maybe if you took product X and gave it to the people who didn't, you know, maybe you give it to the people who didn't buy product X, maybe they'd be annoyed because product X doesn't work for them at all, right?
There's all these actual different things that could come out of the analysis, that customers who bought product X buy more in the future, and that finding that if you buy product X, or the people who have bought product X buy more in the future, that finding could be valuable, but it could also be incredibly dangerous if you make an assumption off it that pulls not true. That doesn't mean finding that initial analysis wasn't valuable in the first place, it's just you have to be very thoughtful of what are the conclusions you draw from it.
Hugo: Absolutely, and then what decisions and business decisions to make afterwards, because that's what's not clear in this case as well.
Jonathan: Right, exactly, and I think most of the time, when we use data science to make decisions, it is not necessarily clear what is the exact implication you can pull from it, which is not to say that you shouldn't pull the data at all, it's just that it is not sufficient to have good data to then make a good decision. Back to the original question of what can't data science do, is it often can't tell you the final conclusion you should draw, it can just present evidence.
What are common pitfalls of organizing data science teams?
Hugo: Fantastic. We're here today to talk, among other things, about organizing data science teams and good organizational structure for data science teams, but before we do this, I'd like to know what are the most common pitfalls that you've seen people make when organizing data science teams.
Jonathan: I think there are a couple big areas people struggle with. For one, there's a big question of how should you organize data science within the bigger company? One approach people take is they build what's called a center of excellence, right, where you take all of your data scientists, you put them together in a space, and you say, "Well, the data scientists will talk together, they'll share what they've learned, they'll really grow, and that's the best way to organize our data scientists."
Another way you can do it is you can say, "I'm gonna distribute the data scientists. I'm gonna put data scientists in each part of the company. Finance is gonna get data scientists, and marketing's gonna get data scientists, and the supply chain's gonna get data scientists," and you say, "Well, this is the best way to organize because those data scientists will be really focused on what's important to them, and really be able to help out that particular part of the organization."
These are actually two totally different ways of organizing, and they can have really different results, so thinking about what's the right way to do it for your company is difficult to do, especially because often data science grows organically, and so to be able to coordinate hiring and distribution is just not an easy thing to do.
Hugo: It may change as a function or time and as a function of industry, and the actual business needs of the company in question.
Jonathan: Yeah, it's really not a one size fits all. For some companies, one approach works, and for some it's really about being distributed, and things like that.
Hugo: Are there any other pitfalls that you commonly see?
Jonathan: Yeah. I think similar to that is how do you actually store the knowledge that the data science team generates? People often don't think very hard about how you're gonna store the knowledge that you gain from data science. For instance, if you learn that customers who buy product X spend more in the future, how are you gonna store that so that people remember it, right? How are you gonna keep that knowledge around so that it's not lost the moment the person who did the analysis has left, or the person who is the recipient of that analysis has left, and if you think about it, there's actually a lot of things that have to get stored, right? There's the things that you've learned from the analysis, there's the code to do the analysis, there's the knowledge of how do you actually execute that code to do the analysis?
The best data science teams are the ones that can really handle the ability to have people changes, the ability to have changes in who's working on what, and have all that knowledge be stored, and the worst data science teams are the ones where all that knowledge is stored in one person, and if that one person quits or whatever, then that knowledge is lost. If you think about it, that is hugely expensive for an organization to have someone do lots of different analysis, and help out a company a lot, and then just suddenly lose all of that the moment the person quits.
Hugo: Interesting. In a word, essentially, it's how to take the knowledge gained and how to store it, but all the distribution of the knowledge.
Jonathan: Exactly, so companies that do this really well, they will make knowledge sharing hubs, right? Places where when you do analysis, you drop it right in, and they will come up with standards for their code, so code always has to be stored in this particular way, and I think that I really get caught up a lot is when you do an analysis, make sure that someone else can run it by pressing a single button.
If you're thinking very practically, if you're writing R code and you have to run this script, and then that creates a dataset that you then run this script, and then you run this script, and you have to remember all those steps, no one else is gonna be able to do that, so to be able to enforce a rule that, hey, any time you do an analysis, it has to be that someone can run one script in one place, and it runs everything else correctly. Those things can be really helpful to an organization and a data science team, but isn't something that everyone always does.
Hugo: Agreed, and not only other people may not be able to run it, but me in a week or less may not be able to run what I've been doing, right?
Jonathan: Yeah, that's how I learned this lesson. It only takes you having getting burnt yourself by this a couple of times, and also I think 10 years ago, when I started in this field, we didn't have the tools to do it that we do now. In R, for instance, you can have R code that automatically pulls from your database, and runs an analysis, and creates a Word doc with all of your output. A single R script can do all of those things, and that just wasn't possible 10 years ago.
What is the ideal organizational structure for a data science team?
Hugo: No, that's right. Having talked about the common pitfalls, in your mind, what is the ideal organizational structure for a data science team?
Jonathan: For me, I've found that what works best is what I think of as distributing the data science within the business, but culturally being centered as a single group. By that I mean if you're gonna have lots of different data scientists, some who have to work on supply chain and some who work on marketing, really have those people embedded in those parts of the organization. Really have them get to know the problems and the people and what's going on there, but that being said, make sure that all the data scientists culturally feel like they're connected together, so have even small things, like team lunches, and quarterly outings where we talk about our career goals, and things like that that get them to feel like they are part of the data science team.
That combination I think really works the best, so distribute the work, and on a day to day basis have people embedded, but make sure you have everyone coming back as a single group culturally to keep people feeling like one core team.
Hugo: Yeah, that's cool, and I'm wondering whether even this idea of storing and distributing knowledge in a knowledge repository, for example, or a wiki, can actually help with this kind of culture of a single group, that they're checking each other's code every now and then, seeing what each other are working on in this kind of a knowledge base.
Jonathan: Yeah, exactly, and if we're culturally in one group, then when that group is meeting once a week or quarter or whatever, when we're meeting together, we'll be talking, and I'll say, "Hey, let me tell you about this analysis I did." They're like, "Oh wow, that would have been great over here," and you build those bridges, and you get things continuing to work together, and that kind of cultural cohesion is really valuable, especially to your point of keeping the knowledge around.
Who to report to?
Hugo: Yeah, and in this particular model, would ... Let me get this right. Would a data scientist on a growth team report directly to the VP of growth, or would they also report to a chief data scientist or something like that?
Jonathan: My heart tells me they would report to the chief data scientist, but dotted line to the other group, but I think in practice it really depends on the actual organization. You said ideal, and I'm kind of punting by saying, "Well, it depends," but I think that kind of exactly who reports to who is much more around the culture of the company you're working in.
Hugo: Is there an inherent challenge here, though, that if you report to someone, but there are dotted lines also, that you get crossed lines, and it can actually get confusing.
Jonathan: Yeah, I think the problem is is because of the field of data science, you're always gonna get dotted lines, right? If you only report just to a data science team, you will never actually do work for people who care, right? You need some sort of dotted line to get direction on what is important to the business, and if you report just to that part of the business, then you will kind of be isolated, and your data science work won't be coordinated with other parts of the company, which can be really problematic. Yeah, I think just more than other fields, like software development, we really, in data science, have to deal with the ambiguity of who reports to who.
Hugo: Yeah, I agree completely. How does your distinction between the different types of data science play into this ideal org structure? To be very specific, the distinction between what we discussed earlier. Business intelligence work, decision science, machine learning.
Jonathan: Yeah. I think those are really three different types of work, and I think data scientists listening to this podcast would likely agree that business intelligence, decision science and machine learning are really quite different, but people outside of this field often may struggle to understand the difference, and many times those are the people doing the hiring, and just to our point we were talking about, giving those dotted lines of work, so it's very natural that a person who feels like they're best at decision science may be asked to do BI work, or a person who is really thoughtful about machine learning may suddenly find themselves doing decision science work.
This cross-work happens all the time, but the more you can kind of, as a team, try and enforce a structure that keeps that work separate, the better, because I've really seen teams struggle where people are doing kinds of work that they don't like, which ends up causing them to be really dissatisfied, which causes the team's productivity to go down, and everyone's just unhappy, and that's just not a good way to run a team.
Hugo: Yeah, so I think something worth circling around is the fact that data science teams and data science individuals can actually sometimes be stuck between a rock and a hard place, in the sense that we do have a lot of dotted lines everywhere, and in all honesty, a lot of the time you'll have engineers on one side that you're waiting on stuff to be implemented, and marketeers and BizOps on the other, for example. I'm just wondering what are practical deals with this unique position for data scientists and data science teams.
Jonathan: Yeah, and I really love that question, because it gets to the thing ... Just as we were talking about, and just as I was talking about the beginning, whereas to me, it feels like data science is much less about understanding complex mathematical approaches or working with giant, massive, big datasets, and it's much more about how do you get people to work together and make decisions using data, and yeah, on the day-to-day basis, as a data scientist, you're gonna be given questions from marketing, and BizOps, and you're gonna be given questions from the engineering teams, and you're gonna have people not wanting to do things you really wish they would do, and trying to manage all of that is difficult, but it's very much a part of the job, and the more, as a field, we think bigger than just what is the next, newest technique, and more about how do we handle these kinds of approaches, I think the better we will be.
The question is “what is the actual practical way of dealing with this?”, so I have two answers to that. One is kind of a cop out, and the kind of a cop out answer is hire people who are good at the stuff in the first place. When it comes to hiring, if you are the person on the data science team who does hiring, really try and hire people who you think would be effective at communicating with people outside of data science, because if you're given the choice to hire someone who knows the best techniques or knows how to communicate, the person with the communication skills is gonna really be the person who helps your team out, because just as we were talking about at the beginning of this podcast, data science is about getting people to make decisions effectively, and so a core component of that is communication.
That's the kind of cop out answer, because most of us don't actually have the ability to choose who's hired, so a more practical answer. For me, I've found that the most effective thing is really just trusting that everyone you're working with is trying to do what they believe is in the group's best interest. When the engineering people who you're working with are telling you they can't do a thing, it's not because they're malicious or don't like you or don't respect you, it's because they are trying to do what they think is best, and the same with marketing, and the business teams, and so the more you trust other people to make good decisions, I find the better things go.
Another way of wording this is have empathy for people outside of data science. Really have empathy and understand what are the struggles of the business person? What is that poor engineering team struggling with that they are having difficulty getting you the data? The more empathy you can employ in a situation, the easier these sorts of decisions end up being.
Hugo: I think what you referred to as the cop out answer actually plays into that a great deal, because if you hire people who are good at working with other people, you can trust everybody as much as possible.
Jonathan: Yeah, and oftentimes, when you hire people are good at the communicating and good at more than just purely getting an optimal answer to a textbook's math problem, you're gonna get more different perspectives, you're gonna get more viewpoints, you'll have a more diverse team, and it just ends up being much more successful than hiring around technically who has the technically best technical skills.
Hugo: Yeah, and actually, this discussion reminded me of several points in ... You've got a fantastic series of posts on Medium. There are four parts of how your hiring process, and how to hire data scientists. Listeners, definitely check out all of Jonathan's hiring data scientists posts. The reason I mention this now is that you have a section on the take home challenge, which is a real business problem that you give people during the interview process, and you encourage, almost actually force them, I think, to email someone on your team at some point, to even see how they frame interacting with collaborators in that type of workspace.
Jonathan: Yeah, because, you know, when we're interviewing, you really wanna understand how is the person going to do at the job you are hiring them for, and oftentimes that job involves communicating with others. As part of the case studies we would give out at Lenati, I had written it so that the person taking it had to ... They needed a data dictionary to be able to do almost anything, and I didn't give them a data dictionary, and I said in the instructions, "If you want a data dictionary, please just email me," and so they almost universally would, and that email was very informative on how do they think about handling communication, and did they make their request clear, and so that was very helpful, and that email was very much like what they would have to do on the job.
Model or Idea?
Hugo: In a previous discussion we had, talking about the distinctions between decision scientists and data scientists, you raised an interesting point, which is you asked the question “is your product, as a data scientist or a decision scientist, a model, or is it an idea?” I'm wondering if you could unpack that slightly, because I think that's a wonderful question that perhaps we need to think about more.
Jonathan: Yeah. It's very easy as a data scientist to think what you're building and what you're doing is making models, so for instance, maybe you're making a segmentation model, or maybe you're making a customer lifetime value model, or whatever. You're making models, and for some people, that is their job. Really, it's just literally make a model, put it in production, and call it a day. Those are the people I've referred to being in machine learning, but for most of us, the job is much more around delivering an actual idea, right?
It's not just creating a survival model and survival curves, it's really saying, "Okay, I have learned that these are the indicators that a customer is not likely to survive, and so I need to convey the idea of hey, these are the signs that someone's not gonna survive," and by survive, I don't mean die, I mean stop being a customer. "These are the signs someone's not gonna be a customer, and so when you're thinking about doing marketing, you should try and market around these particular ideas." When you pivot from thinking about delivering a model to an idea, then a lot of things come into more focus, right? Just like we were talking about before, you know, how do you give the right presentation to convey that idea? How do you get people convinced that you are a trustworthy person, that your ideas are sound? There are lots of things that come into convincing someone of something, and it's more than just a single model that does that.
Generalists vs. Specialists
Hugo: We've spoken a lot about the different types of data science work that can be done, and you spoke to the fact that it's important to recognize whether people really are interested and adept and skilled in business intelligence, decision science, machine learning. However, when you were director of insights and analytics at Lenati, you made it very clearly that you were hiring generalist data scientists, not specialists, and I'd like you to speak a bit more to the role of generalists and specialists in data science as a whole today, and how you see this evolving in the future.
Jonathan: I think to the point earlier about there behind business intelligence work, decision science work, machine learning work, I think it is very reasonable, acceptable and wise to focus in one of those three areas. For me, most of my work has been in the decision science space. I've done a fair amount in BI and machine learning, but most of my work is in decision science, and so for each person, having a specialization is generally good. But, just as we were discussing earlier, what will happen is oftentimes, for whatever reason, your company doesn't have that much, or your team doesn't have that much of that type of work to do right now, you know?
It's like, "Oh, we don't have any decision science work to do, but we really could use someone to make this visualization," and so having the ability to switch between one or the other, it's not something you have to be able to do, but you should be able to want to try it, and for me, a lot of my career growth has just come from a situation where someone asked me to do something that I didn't know how to do before, and I just said, "Well, I'll try," and I learned it.
When I hire, I really look for people who are generalists because I think it's important that they feel comfortable, that if they were given something new, they would be able to learn it. When you think about running a team, right, you could have a team where everyone's literally a generalist, and technically anyone can do anything, right? If we have BI work, anyone can do it. If we have decision science work, anyone can do it. In some ways, it's very good for running a team because you don't have to worry about having ... If anyone's available to do work, they should be able to do it, and as a employee, it's very good to be able to work in that space, because it means that if something new happens, you can feel comfortable trying to do it.
Generalization is good, but the downside of that is that people like doing the stuff they like doing, right? Some people really like making dashboards, and some people really like building machine learning models and coding them so they can run in production, and some people like giving presentations, and so if you can get your team to specialize, then you get to give people what they like to do, and then if you like people, you should like to give them what they like to do. By allowing for more specialization, you can then make your team happier, but then the downside of specialization is that, say you have someone on your team who likes making the presentations, yeah? Someone who likes loading the data. You have all these different specialist roles.
Well, now anytime you need to do one task, you have to have, like, five people involved, right? You have to have the data loader, and then the person who does the analysis, and then the visualizer, and then the presenter, and each time you have a new person involved, getting knowledge from one person to the other is hard, it takes time, that person loses context, and so that can make your team run less efficiently.
It's like, I was constantly in the struggle to try and decide, well, how much should I allow people to specialize versus how much should I say, "Everyone here technically could do everything, so I will give someone a piece of work even if they're not the best at it, and they will learn how to do it." At the end of the day, I really landed on, "I'm gonna do that generalist approach of I'm gonna assume anyone can do everything, and while some people are better at things than others, I'm not gonna feel uncomfortable giving something to someone a little bit out of their comfort zone," but I'm not sure that's the right approach, and so if a listener has some evidence that, "No, here's a situation where our team really highly specialized and it worked extremely well," I would love to hear about it.
Hugo: Yeah, as would I, and look, Jonathon, embedded in that narrative I think is one of the most important points for people kind of getting into data science, is really to demonstrate the willingness and ability to learn new things on the fly almost constantly, right?
Jonathan: Yeah, and in fact, let me tell you a story from my very first job at Vistaprint. I was a month out of school, out of a master's. First real job, and I was working at the company, and they had this situation where the company, there was a bug on the website, and they lost a fair amount of money, and they only realized there was a bug because the marketing dashboard went down a bit, and so marketing noticed, and then eventually they found the bug.
A marketing executive came to me and he said, "You're on the forecasting team." The bug was on a Tuesday, and he said, "You're on the forecasting team. Can you pull the last five Tuesdays worth of data, because I show that we can analyze Tuesdays and see that this issue occurred," and so at the time, I realized, hey, this is actually a deeply complex mathematical problem of if you have a sales forecast, and every day you have a certain amount of sales come in, how can you know when sales are so much lower than baseline that it's a problem? In fact, how can you even know what the baseline is? There's weekly seasonality, there's yearly seasonality, our sales are going up each day because our company is growing, so there's all these different factors, and how can you tell, okay, this is so bad, we should do something about it.
I realized that problem was way out of my comfort zone, no one's ever tackled it before, and all of the executive wants is these five Tuesdays, so do I give him the five Tuesdays or do I make this problem way more complicated and complain that we're not doing it the right way? As you can imagine, I did the latter, and I complained a lot, and eventually, this ultimately led to me running a team where we built this tool that would actually, every day, analyze all these different components, and try and figure out, okay, is today a day that we're considering sufficiently bad that we need to do something?
To do that, I had to learn a ton about statistical quality control, I had to learn a lot about different forecasting methods, and I had to really get out of my comfort zone, but because I did that and because I had learned how to learn, I really succeeded in that environment, and I think that's what makes the best data scientists, is the people who have learned how to learn, which is not easy, but if you can learn how to do it, it's very valuable.
Future of Data Science
Hugo: Agreed. We've talked a lot about what data science looks like today. What does the future of data science look like to you?
Jonathan: That's a good question. I see a couple of different things happening. I think more and more companies are going to get business intelligence people and decision science people, right? It's just going to be a common narrative for executives, and business people, that they're gonna wanna see what the data looks like at any time, which means they need BI to help support that, and they're gonna then look at that data and be like, "So what should I infer from this? What's important and what sort of decisions should I make?" And you need a decision scientist to do that, so I think those groups are really gonna flourish.
I think the machine learning work is gonna kind of diverge, where a lot of the things that used to be really common data science work, like building a churn model or segmenting customers, I think that stuff's gonna really start to be commoditized, and I think you can already start to see that happening, where it's getting to the point where it makes sense for a company to just buy an off-the-shelf customer lifetime value model rather than building their own, because building your own means you have to maintain it, your team may not have a customer lifetime value model expert, and if you buy one off the shelf, that model is trained on other people's data, too, so it should be, in theory, work better.
I think a lot of that's gonna be commoditized, but I think also there's gonna continue to be companies who really just continue to do ML work in terms of really trying to do cutting edge natural language processing, and AI, so I think those jobs are gonna continue to exist, but I think it's gonna become more and more special applications as opposed to just general, "Hey, I'm at a retail company, and I'm the retail company's machine learning engineer."
I think, on a related note, something I've noticed a lot in the last year or so is I've seen a real divergence in that I've seen half the people I know saying, "Data science is such an easy field to get a job in," and the other half saying, "Data science is an impossible field to get a job in," and the difference is the people in the first group are all senior people who have worked in data science, and companies are begging to get experienced data scientists on their teams.
But there are lots and lots and lots of people who are doing data science bootcamps, getting data science degrees, reading about data science and wanting to get into it, and so there's a large group of people who are less experienced, and I think companies are not hiring those people at the rate that they're being generated, and I suspect that that trend's gonna continue to get worse, where people are gonna continue to want senior data scientists even more than they do today, but the wave of people trying to get into data science is gonna get harder, and so it's gonna be more difficult for a new person to get in.
Call to Action
Hugo: As a final question, I'm wondering if you have a final call to action to our listeners out there.
Jonathan: Yeah, I would say I've got two calls of action, which is maybe breaking it, because you're only supposed to have one, so two. One, if you are interested in having me consult for you or your organization around data science, in particular around how do you grow, expand your team, are hiring or potentially coming in with new approaches, please reach out to me. If you not in a position where you're looking to hire a consultant, but you still find me mildly entertaining, you should check me out on Twitter, and my Twitter account is @skyetetra, and that's Skye with an E at the end, so S-K-Y-E T-E-T-R-A.
Hugo: We'll also link to you on Twitter in the show notes. All right, Jonathan, thank you so much for coming on the show, and the wonderful chat.
Jonathan: Thank you. It's been a lot of fun.