Skip to main content
HomePodcastsMachine Learning

From Predictions to Decisions

Dan Becker deep dives into the intersection of decision sciences and machine learning, how data teams can go from experimentation and deployment to providing value at scale for organizations, and more!

May 2021
Transcript

Photo of Dan Becker
Guest
Dan Becker

Dan Becker is one of the top thought leaders on machine learning and AI. He's the Chief Generative AI Architect at Straive, building AI solutions for content technology. He also runs the Build Great AI consultancy, and was previously VP of ML Development Tools at DataRobot. Dan is also a successful Kaggle competitor, the author of "Automated Machine Learning for Business", and the DataCamp course "Introduction to Deep Learning in Python".


Adel is a Data Science educator, speaker, and Evangelist at DataCamp where he has released various courses and live training on data analysis, machine learning, and data engineering. He is passionate about spreading data skills and data literacy throughout organizations and the intersection of technology and society. He has an MSc in Data Science and Business Analytics. In his free time, you can find him hanging out with his cat Louis.

Transcript

Adel Nehme: Hello. This is Adel Nehme from DataCamp and welcome to DataFramed, a podcast covering all things data and its impact on organizations across the world. One of the most exciting things we've seen over the past few years is the maturation and operationalization of data science and machine learning in organizations across the world. More and more so, machine learning is jumping from notebooks and into business processes. However, as machine learning starts driving more and more predictions around complex business processes, finding a scalable way to turn predictions into decisions that maximize value for organizations will be the next frontier to go from operationalization to full-blown usefulness.

Adel Nehme: This is why I'm excited to have Dan Becker in today's episode. Dan became a professional data scientist in 2012 when he finished second in a Kaggle competition with a $3 million grand prize. Since then, he's consulted for six companies in the Fortune 100, served as product director in the early days of DataRobot and worked at Google. Dan has also made open-source contributions to Keras and TensorFlow. And is the founder of Kaggle Learn. Most recently, he founded Decision.ai, which aims solve the challenge of bridging decision sciences with machine learning.

Adel Nehme: In this episode, Dan and I talk about his background, his journey reaching the top of a Kaggle competition, machine learning and business, as opposed to Kaggle, concept drift, the friction between developing something that is useful versus shiny, the importanc... See more

e of empathy when aligning business objectives with data science, and finally, the growing intersection of decision sciences and machine learning and how Decision.ai can help data teams align machine learning with business outcomes. If you want to check out previous episodes of the podcast and show notes, make sure to go to www.datacamp.com/community/podcast.

Adel Nehme: Dan, I'm so glad to have you on the show. I am really excited to speak to you about your journey into data science and machine learning and your new venture at Decision.ai. But before we dive in, can you tell me about how you got into data science and the various roles you held in your career so far?

Dan Becker: Yeah, in some ways, I've gotten into data science twice. So when I first graduated from college, I got a job. It was with a startup. This was a long time ago when a lot of people were selling things on eBay. And we had a service where we use neural networks, which were not popular then and not very good then to help retailers figure out when to post things to eBay and how to optimize their listings, both because of what the state of the art was then and then also how little we know or how far off from the state of the art our little startup was. That was a total and complete failure. And after six or nine months of doing that, I decided, all right, machine learning is never going to catch on, the word sound cool, the promise sounds cool, but it's just not going to work. And I did conventional statistics for a while and I got a PhD in economics and moved away from machine learning, though, at the time I was an economist for the US government.

Dan Becker: And then about 10 years ago, someone told me about a Kaggle competition. And I figured, yeah, machine learning doesn't work, but I know a lot about other ways to work with data. And so I entered this Kaggle competition. And I was not dead last, but I was almost dead last. And I was so shocked to be so far behind the curve of what other people were doing that every night, all the time on the weekends, I would focus on improving, just getting better at building models. And I got up into eventually to second place out of over 1300 teams in that competition.

Dan Becker: From there, a consulting company hired me to both help them with some of their projects, but even more than that, train their consultants in the techniques that I've been using in this competition. I did that for several years consulting. For any of your listeners, I think that consulting is such a great way to get exposed to a lot of different data science use cases. And that has served me well for a long time. After that, I worked for DataRobot as an early employee in charge a product for DataRobot for three years or so. That was also really enjoyable. DataRobot is an awesome company, awesome product. And then I left DataRobot to work at Kaggle and Google. I was there for three or four years. And then most recently, I started Decision AI.

Kaggle competitions

Adel Nehme: That is so awesome. And it's always really nice to learn from someone who has been involved in data science and machine learning for this long. And you have this panoramic view over the field. I want to start off with the second time actually you got into data science, as someone who spent tons of time working on Kaggle competitions, can you walk me through the process of growth and learning that led you to have these awesome results?

Dan Becker: Yeah, it's very iterative. I think of in some ways, most machine learning algorithms, they start with something that's really bad. So a neural network, it starts with a random set of weights and it makes predictions and it's really bad, but then you've got a mechanism to improve. And so it learns from the data and as it runs before you deploy it, but it improves the weight so that it gets more and more accurate. And you could say the same thing for, another algorithm is gradient boosting. And it starts out with a base prediction and it's very inaccurate, it's very bad, but there is an algorithm for improvement. And I think the same is true for how do you improve as a data scientist? I started at these Kaggle competitions and my first submission was very, very bad.

Dan Becker: One of the great things about Kaggle is that, or at least the competitions, is that you can make a submission and then you have a feedback mechanism. Every day, you can upload a new model or a new set of predictions, say, all right, did this improve? Did it not improve? Well, what do I learn from the feedback of trying it, seeing whether it improved or not? And I just started at something that I was really quite bad at it, but every day tried to improve a little bit. And every day moved up in one particular competition a little bit and learned a lot along the way.

Dan Becker: The other thing that I did, which I think most people could really do is learning from the community. So that is if you're in Kaggle, be on the message boards, but that could be go to meet ups. And don't just walk, watch, try and talk to people. And I find conversations about data science, I find them so fun. It's like way more fun than working alone. And I've just learned a ton from it. And a lot of it is the stuff that's hard to read or hard to find in a textbook or even hard to find in the great courses like what you guys do. There's something about having a back and forth, a give and take where you talk to someone that I think is so valuable.

Advice for Aspiring Data Scientists

Adel Nehme: Yeah. I completely agree. Even from my experience as a data educator working with a lot of data scientists, that conversation that I've had while in developing courses, let's say, and trying to understand how data science is implemented in the field, what can I learn from them, I found that really contributed to my growth and my data skills as well. You've mentioned that working on and scoring high in Kaggle competitions led you to other job opportunities within data science. And this was a dynamic that really held previously. Do you think that this is a general dynamic that still holds today? And what advice do you have for aspiring data scientists looking to break into the field?

Dan Becker: Yeah, I think that many people will probably be disappointed by this advice, competing in Kaggle competitions is not a silver bullet. If you are in the top 1% or it's probably in many cases, even the top fraction of 1%, it's a great credential, it will open a ton of doors for you, but tautologically, 99% of people will not be in the top 1%. And so for most people, that thing that I did, which is now even harder to do than ever because Kaggle has gotten more competitive, that I think is not the ideal strategy to get jobs in data science. I think of, again, starting with a lot of conversations, find a community of other learners, Kaggle happens to be a nice one, but there are others. Find a community of data scientists, try and learn from them.

Dan Becker: And then the most important thing is to build a portfolio of interesting work. So I've hired probably about 10 people directly who reported to me. And then I've probably been involved in interviewing maybe about a 100 other people or the hirings of a 100 other people, which is hundreds of interviews over the course of time. And the thing that is really most interesting to me is, if someone has a project, it doesn't need to be related to my work. It could be that they love rugby or they love movies or they love whatever, and they've done a project, which is just interesting and they're willing to talk about that, I find that that pulls me and that helps someone stand out in a pile of resumes. And then, there's such a crowd of people getting into the field. And I think if you can build a portfolio, make it interesting, put it on GitHub, if the first thing I see is going to be a README, if you make that README easy to read and has nice graphics in it, that's going to pull someone in and really stand out.

Dan Becker: And then if you say my goal is not just to get a first job, but I want a great career, I think that a lot of people spend too much time doing textbook style learning upfront and focus too little on how do I get my first job? Because once you get your first job, you can learn in outside of work, but you're on top of that going to be spending eight hours a day or however many hours a day, working on data science, getting feedback from colleagues who are more experienced. And I just think that you should think of how do I get great over the long-term whilst getting a job, I'm getting hands-on experience for a lot of my time and getting paid for it early on. And the key to do that, as I said, is to build a great portfolio of things that you find interesting.

Adel Nehme: Oftentimes, I receive questions on how to break into data science as well. And I think one of the most important things, one of the lowest hanging fruits is just getting into it, doing a project, hosting it on Medium, on GitHub, writing a blog post about it. And then as you said, really immersing yourself in the work actually through learning that way. Now I want to talk about as well, given your subsequent experiences after Kaggle. So you've worked on a lot of Kaggle competitions and then you've transitioned to the consulting world afterwards as well, so DataRobot and Google, can you provide an overview of how different working on and applying machine learning is in a business context or how different it is than a Kaggle competition?

Dan Becker: Yeah, it's incredibly different. The number one thing that matters in a Kaggle competition is to make your model incredibly accurate. The number one most important thing that matters in most business settings is to scope the problem so that you're building a model that is addressing the right problem, that it is something where when we get the output of that model, we know how to use that to make business decisions. And so it's scoping and deciding how the model will be used is the single most important thing. And then you can make your model more or less accurate within that, but it just doesn't matter as much.

Dan Becker: And that's part of why I was so excited about DataRobot as a product is instead of fiddling with hyperparameters, which can make your model more accurate or fiddling with how do we do feature engineering? And AutoML tool says, all right, you're going to press a button, however, you've set up the data, we're going to make your model as accurate as possible. And that allows you to focus on, okay, where is this model going to fit into our decision making process? Or how do we prioritize what is important for us to work on versus less important to work on?

Dan Becker: The thing that I always found, and this is true in Kaggle competitions, it was true for us at DataRobot is that data scientists are pulled into things that are cool or useful. And a lot of times, that's the cool new algorithm. And I think my job at DataRobot has been how do we focus on things that are useful rather than cool? And you see the same thing where Kaggle competitions, they allow you to do things that I think the competitions are super fun. And so they're cool, you see some really cutting-edge techniques.

Dan Becker: And yet, it's really important to stay focused on what's useful. We've got some big picture business goals, we got to get all the vaccines to people in the right places at the right time as efficiently as possible, how do we do that? And making your model go from a log loss of 0.3 to 0.29 is not the problem, it is how do we even decide what to model in order to make machine learning useful for that problem? And then I guess the last thing I'd say that would be remiss in 2021 not to mention is that one of the big differences in a business context versus Kaggle is that you really need to worry a lot in real-world settings about train, test, drift. That is the world changing out from under you.

Concept Drift

Adel Nehme: Talking about all of the different elements, breaking them down as well, when you've discussed them, scoping an AI project and making sure that it's aligned to the business objective and how to operationalize it within a business process is also super, super important. And you've mentioned here concept drift. Can you think about some of the remedies for concept drift? And for those who may not be familiar with it, can you also define it?

Dan Becker: Yeah. So concept drift is just this phenomenon of the so-called concepts that are captured in your data, historical data, because all data is historical, are not going to carry forward into the future. And so the relationships between things that you have historically will change over time. So the number one outcome of that is that you can have a model, which was very accurate in your training data, maybe even in your validation data or your cross-validation, and then you deploy it in the real world and it's much less accurate than you expected. Seeing that to some extent is probably the norm rather than the exception, but there are some places. If you are modeling a physical process, there's nothing human about it, then in those cases, you see less of it. And for processes that there's a lot of human interaction, how much aluminum will someone need? Well, that depends on economic changes. So those things tend to have more concept drift, or it has a lot of other names, train, test drift is another name for it.

Dan Becker: Let's see. So the things that you can do to address it, probably the simplest thing that you can do, but though, this is not perfect, but the simplest thing that you can do is when you validate your model, you say, how accurate will this model be? Use out-of-time validation. So take training data from, if this were happening today, I'd say I'm going to take training data from 2018 and 2019, and then I'm going to use validation data from 2020. And that way, if you were working on a process where this concept drift is going to occur, you're very likely to see it because you see this big drop in model accuracy when you go from my training data, what I used to build the model, to the validation data, since the validation data is just from a different period. That doesn't exactly tell you how to solve it. And you'll see that this is pretty common with concept drift is there a lot of things to detect it and fewer options to solve it. But that's probably the first line of defense and that's something that's super easy to do.

Dan Becker: The second thing that is harder to do, is better, but also not perfect and you should do both of these is model monitoring. So model monitoring, there are starting to be more tools in the MLOps space for this. And what you're doing is you say, every time a prediction comes in, I will store some data about that prediction that I made and then on a regular basis, I'll say, how does the data that I've predicted on since I deploy that model, how similar or different is that from the data that was used to build the models? Is that frequently, well, may come back to something called covariate shift. Ideally, you'd like to say just how accurate is the model, has the model become more accurate or less accurate after I deployed than it was when I trained it? Typically, that's pretty hard to do because when you make a prediction, you don't have the true value, like the actual value to make a comparison against.

Dan Becker: So an example would be, if we are making loans and we want to know is someone going to repay the loan? All right. So we've built a model on historical data, who repaid their loans. Now we make predictions, people are coming in to our API. We say this person, they're 5% likely to default, Someone else, they're 20% likely to default. We don't actually know the default at that time. And so all we have are the features that we predicted on. That's something called covariate shift if the population of people who you're making loans to now looks unlike the population you had in the past, that is covariate shift, meaning, the features you're using for prediction if those are changing, that is the way of model monitoring or detecting concept drift that is most common today. Again, it's not perfect. You could have that covariate shift and actually your model is staying pretty accurate, or you could not have covariate shift, but yet, your model is becoming less accurate. And so it's not perfect, but it's something that is easy to implement or easier to implement and is done today.

Dan Becker: Even beyond that, there are a lot of things you can do, which are much less common to adjust your prediction. So let me give you a couple of examples, and these are going to seem so subjective that a lot of people push back on these when they first hear them, but I'll tell you why I think they're still important.

Dan Becker: So, if you think that you have a hypothesis that sales for your product in the next year are going to be 20% higher than they were historically in ways that are not driven by changes in the underlying features, you make a prediction and you should just multiply it by 1.2. You should take that subjective belief you have about your domain and incorporate it into how you make predictions. You can use your conventional machine learning model, but again, just adjust the prediction after you get the output of the model. So once you will say that seems so subjective, they're a little uncomfortable about it. But the flip side is if you take that prediction and you don't make that adjustment, you're implicitly assuming that something is not going to change or that the world is not changing. And if you truly believe that the world is changing and you have that domain knowledge, I think it's a mistake to not apply it, even if the way you apply it is very subjective and is imperfect.

Dan Becker: We could even do a step better than that. There are ways to make more surgical adjustments where you say, I think that for a certain population or the effect of a certain variable is going to be 50% larger than it was historically. And so I want to take the impact of that variable historically and I want to make some adjustment after I get my prediction so that the impact of that variable is 50% larger or 50% less. There are ways to use some machine learning explainability techniques, such as SHAP values, to figure out what was the impact of a given variable and now I'm going to do a post-hoc or after prediction adjustment to it.

Dan Becker: These techniques are not very widely spread, but we see the world changing so quickly that I know a lot of people who are taking machine learning models out of production, because they say the world's changing too quickly and we don't have a way to address this. And so I think it's going to be really important for us to say, how do we take our knowledge or our beliefs about how the world's changing that aren't captured as data yet and make these types of adjustments?

Adel Nehme: Yeah, I think what you're speaking about here is a very fascinating change that we're seeing in the world of machine learning, where we're really seeing a much more emphasis on not just producing models, operationalizing models, but also integrating human knowledge within machine learning models, as well as explaining different machine learning models, and finally, monitoring models in production. And it's going to be very interesting to see how machine learning operations, data operations as a field evolves within the next few years.

Adel Nehme: Now, circling back to your experience at DataRobot, you've mentioned that there is a friction often that data scientists encounter when building between something that is cool and something that is useful. I think a lot of data teams as well often fall through that trap of building something cool. And it's kind of like a resume driven development agenda. Can you speak about that friction and how you experienced it and how you went about balancing between these two elements?

Dan Becker: Yeah. I mean, you see this dimension everywhere and I think the key, and it's going to come up in a lot of different places is just empathy and understanding someone is going to use this model for something, typically, that's one of your colleagues, and you need to see the results of your work through their point of view. And so, if you'd say, hey, I used LightGBM, they truly don't care. And they instead have some decision that needs to get made and you need to figure out how do I plug into this so that I am, again, creating work that is valuable to them? And that's purely a matter of just seeing things through their point of view.

Best Practices to Integrate Empathy

Adel Nehme: Do you think that data scientists can hone this skill over time, that skill of building empathy? And what do you think are some of the best practices that data teams can adopt to really integrate empathy into the process of building data science solutions?

Dan Becker: Yeah. In some sense, I think it is hard not to get this empathy over time. So, if you and I, we worked together all the time for a year or if we even went out to a bar post-COVID every Thursday for a year, over time, I'd say actually, the things that the times I've talked about, basketball, it didn't seem that interested in the times I talked about something else, actually, it seems like you were really interested. And over time, I would get a sense for what you care about and what things you don't care about and our conversations would flow to the things that are at that intersection of what we both care about.

Dan Becker: And so I think that if you talk to the stakeholders or the people who really rely on your data science models, it would be hard to talk to them on a regular basis and not get a sense for how they think about these problems and what's useful to them or interesting to them and what's not interesting or useful to them. I suppose, the real question is, can you intentionally do it or can you accelerate the pace with which you become empathetic? And I think it's probably a matter of just asking for feedback and say, hey, here's what I did, how could that have been better? And people will tell you, actually, I didn't understand when you talked about X or the thing that was most exciting to me was Y.

Dan Becker: And if you ask for feedback and accept feedback, you're inevitably going to get a better sense for how others see the world and you're going to be able to tailor how you present to them to how they see the world. And then once you tailor how you present to them, that in turn will drive the type of work that you do or the type of algorithms you do or where you focus your attention while you're programming so that you're focusing your effort on the things that are going to have the biggest impact on the people you work with.

Adel Nehme: 100%, I completely agree. Further touching upon this friction, one thing that I've seen you speak about publicly is the different stages of evolution of data science and machine learning. And I've seen you speak about how machine learning historically was all about experimentation, and now it's about deployment and that we need to go into usefulness. Can you provide your insight into your thinking here?

Dan Becker: Yeah. I think that people who are getting into the field now will be maybe surprised that there was ever a period where companies would just hire you to experiment, and your model didn't need to get deployed, it didn't need to be useful. And that was okay. And yet, when I was consulting seven or eight years ago, it was absolutely the case that people would hire us to build models so that they could be exposed to machine learning because they knew so little about it and they just wanted to get used to some of the underlying ideas. That was not the even probably three or four years ago. And so there was a time when they said, if we have you work and this doesn't get put into any real-world process, that is not okay and we'll be upset about it.

Dan Becker: And so three or four years ago, they said, all right, it needs to get deployed. And there was a period of time when, if you got your models deployed, that was great. Getting them deployed actually is a skill both on the engineering side and on the interpersonal dynamics side. And I think it's only in the last year or so when companies have said, yeah, we've been deploying your work, and yet, we're still dissatisfied because we find it's not making an impact on the things that our CFO cares about.

Dan Becker: And so, if I go back a year ago, and you said, think of all the data scientists you know who have ever been laid off, I would have said, I've never heard of that. I've never seen that. Data scientists don't get laid off. And I know a lot of people who've been laid off in the last year. So some of that was the economy, but I think even more of that was companies saying, well, we invested for years a lot of money in data scientists. And if we look at our profit and loss statement or any of the metrics that we care about as a business, we don't see the impact. And so now there is a demand more than ever for your work to show up in something that the rest of the business cares about. And in the metrics, not of accuracy, but in the metrics of profit or loss or customer retention. And so that's a change that I think the data science world is still adapting to it. We're not quite there yet of having the mindset of we're going to make sure that everything we do is really useful.

Adel Nehme: That's very fascinating, especially when you think about the risks of not being able as a data scientist or as the field of data science to really prove the usefulness of data science within the organization creates this sense of lack of trust from business stakeholders, as well as disinterest and disinvestment in data science. And I think this actually marks a very great segue to discuss your most recent venture, which is Decision.ai. So can you provide an overview of Decision.ai's mission?

Dan Becker: Yeah. So like I said, you've got all these machine learning models are being widely used, the output of a machine learning model, so, if you look at whatever, TensorFlow or scikit-learn or XGBoost, their API is always dot predict. And the key insight, which I think a lot of data scientists haven't accepted yet or a lot of people using machine learning haven't accepted yet is that predictions don't matter. The only thing that matters, what affects the world is your decisions. And so you get a prediction and now there's a step beyond that of how do we translate that prediction into an action or a decision? So let's use something that is frequently called decision rule, and we are helping companies make better decisions with the outputs of their machine learning models or we're helping them come up with better decision rules.

Dan Becker: So let me give you an example. So you run, you're the data science department for a chain of grocery stores. One of the most important things that you can do is figure out how much of each of these items that we sell, do we need to order from our wholesalers at each point in time? Really, to make it more concrete. So in a given week, you're figuring out how many mangoes should we order. You build a predictive model, a machine learning model, it says, you're going to sell a 1000 mangoes between now and the next time that we can get our mango delivery. All right. So how many should you actually order? So this is the decision rule. So many people would say, all right, well, if we think we're going to sell a 1000, then we'll buy a 1000, but machine learning models aren't perfect. None of them have an R-squared of one, or those that do have an R-squared of one, we know have some other problems.

Dan Becker: And so, if you order a 1000, then roughly half the time, you're going to run out of mangoes, a bunch of customers will show up and go say, I want to buy some mangoes. And they won't find any. And then other times, you'll have extra mangoes. And you and I, when we go to grocery stores, and we see that it's pretty rare that they're out of stock of produce. And so why is that? Well, they buy extra stock to make sure that they don't run out. And part of that is they think that running out is particularly bad. So what are the considerations that you'd want to think through? You'd want to think through what's the shelf life of mangoes. Can I buy extras now? And then I'll just sell them next week. Next week, I'll buy some more mangoes, but I'll keep those in the back room. And then those will fill in when I sell out.

Dan Becker: So you want to think about what's the shelf life of mangoes, you want to think about what's the cost of having mangoes rot versus what is the cost of running out and disappointing customers? You want to think about what's the likelihood of wholesale prices going up next week or going down next week? Because it may be that if I think wholesale prices are going to go up, maybe I buy extra this week to avoid the future price increase.

Dan Becker: So there's all sorts of external dynamics that are beyond here's my prediction for how many mangoes will sell. And thing that we are trying to do is allow a data scientist to encode mathematically all of this broader context so that they can rigorously optimize their decisions for big picture goals, like we want to, when we make our mango purchase, we want to figure out what's going to make us money in the long run or keep us happy customers in the long run, rather than being narrowly defined on we're going to make a prediction for how many mangoes we're going to sell.

Dan Becker: And then typically, data scientists today don't have any rigorous way to build the decision rules or the broader structure around that. Doing this, I'll be honest, doing this is harder. It's going to lead to much, much better results. I think the benefit that we get from going from, or we just guess or have a rule of thumb for a decision rules, which is what most data scientists use today too, we're rigorously optimizing that. I think that win is going to be as large as it was when we went from we're analyzing data, but don't have machine learning models to when we had machine learning models.

Dan Becker: So I think the win is going to be huge. It is harder, but I was an early adopter of deep learning and I was contributor to deep learning frameworks, even before TensorFlow exists, there was one called Pylearn2, and the early adopters of deep learning, yeah, it was a little harder than the machine learning techniques we'd use before that. It got easier because there were better tooling, but the payoff for early adopters was huge. And I think there is now the opportunity is quickly accelerating into a need for data scientists to be really rigorous about how they use their machine learning models and what these decision rules are.

Adel Nehme: Yeah, and I think this solution, they were speaking of, it sits at the intersection of the problems that we previously discussed of how to go from experimentation to production, but from production to usefulness as well. And I think the idea of combining human knowledge and human assumptions about the business within your machine learning models and combining these extraneous concepts is super useful. So it's really about combining human knowledge, simulation and machine learning to further align results, machine learning results to business decisions. can you walk us through an example of a project you've worked on?

Dan Becker: Sure. So let me walk through a different example. I talked about mangoes, but I want you to see the breadth of how these can be used. So there is a hotel chain. Number one most important problem for a hotel chain is how they set prices. And hotels can update their prices on a daily basis. So for a given location, they have a machine learning model. There's a model they have today, which is if we set our price at, let's say, we're looking 90 days out for the hotel nights that are 90 days out, if we set our price at a $100, we'll sell five bookings, at some other price, we'll sell six, at some other price, we'll sell seven. Now, which of those is better? But since the spaces that we don't sell tonight can be sold tomorrow and the spaces that aren't sold today or tomorrow can be sold the next day, there is a dent dynamic optimization problem here.

Dan Becker: And the way that you would approach this problem in Decision AI is, one, you would build your machine learning model for how many hotel nights can we sell at any given price. You'll bring that into our softwares, you can build it in XGBoost or TensorFlow or scikit-learn or any of these tools. And then you will write out some domain knowledge equations, which are typically pretty straightforward that just describe other structure for how this problem works. So for instance, you would say, our revenue on any given night is our price or revenue that we accumulate from sales on any given day is our price that we set on that day times the number of hotels that we sell on that day. Okay. So that's a pretty straightforward equation. You'd have another equation or formula that says, the number of rooms that we have left over to sell is whatever we had available to sell the previous day minus what we actually sold the previous day. You'll have a couple of other equations for how these things evolve over time, but they're all equations that are formulated. If someone works in this field, they'll know very well.

Dan Becker: And we can now start on today and say for any given price, how many rooms do we sell at that price? What's our profit or our revenue from that today? And we might calculate some other things today. And then we are going to use that to calculate what is the starting spot that we're in tomorrow? How many rooms do we have left over? Maybe what is our competitor's price tomorrow, which is going to depend on what we did today because of competitive dynamics? And you may have a machine learning model that predicts competitor's price tomorrow as a function of what we did today. So you'll have, at that point have the state of the world tomorrow, and you'll take your decision rule and say, all right, what happens tomorrow if we use this way of setting prices? And we're just going to propagate that forward until either we run out of rooms or it's the night that we're selling rooms for. And then we can add together what was revenue from selling on day one, day two, day three. And that's going to give you total revenue.

Dan Becker: So now we have this simulation environment, people who have done dynamic optimizations or operations research before will see parts of this that feel familiar that we're now combining that with machine learning. And we've got the simulation environment where we can try out different price setting functions or different price setting rules to see what happens over this long period of time in terms of not just the things that we typically predict, like how many rooms do we sell on any given night, but what is our cumulative revenue over this whole period of time, which is this sort of thing that people keep saying, what is your CFO care about? Things like revenue or things that the rest of the business cares about. And we can now rigorously optimize for revenue over a long period of time.

Adel Nehme: Apart from the hotel industry and some of the use cases you've worked on, can you name some of the other use cases that you've worked on at Decision.ai?

Dan Becker: Yeah. The two that we've been closest to are financial fraud. So machine learning is very, very widespread in identifying transactions that are likely to be fraudulent, but it's a classification model. So what is the output of that? You run a transaction through a conventional machine learning model. It says this transaction is 7% likely to be fraud. Okay. Well, what do you do with that? Do you reject it? Do you accept it? Frequently, there are some third path for investigating it. And which of those you do should depend on a lot of things? Do you think this is a high value customer who you're concerned about upsetting? Is this the beginning fraud on something that's likely to be a long trend of fraud and you want to nip it in the bud? So how do we take that prediction of likelihood of being fraud and make sure that we're using it optimally?

Dan Becker: So we've worked in financial fraud, we've worked in inventory management. That's going back to that mango example. All right, we think we're going to sell a 1000 units of some product, what do we actually buy from our wholesaler? What do we ship from our warehouse or distribution center to our retailer? So inventory management is another one. I would say I'm especially excited about supply chain management applications. Supply chain management is a quite broad field, but how do we make sure that in our manufacturing process, we have all the ingredients we need, all the pieces we need? How are things shipped from one place to the next? After they're manufactured, how do we get them to a distribution center or warehouse? And all of that is stuff that, in addition to inventory management is stuff I'm very excited about.

Understanding Business Context

Adel Nehme: Yeah. I completely agree with you about the supply chain management, because any marginal improvement to a supply chain network that is highly complex, highly optimized will give so many additional value to the business. And given the importance of embedding business knowledge and decision rules into the machine learning models to arrive at usefulness, how important is it do you think for data scientists to truly understand the business context around the models that they are building?

Dan Becker: Yeah. I mean, some of this goes back to businesses have lost interest in pure experimentation. Now it is absolutely necessary for you to deliver results that are not log loss style results, but are because of the result of what I worked on last month, here's how our business is better off. To do a good job of that, you need to be interested and knowledgeable in the broader business context. And so I think just to stay employed, you need to be interested and knowledgeable about how your business works. And then if you go, all right, that's what it takes to be employed, what does it take to be a future data science leader?

Dan Becker: The future data science leaders unquestionably will be those who can take the work of a group of data scientists and then speak to executives across the business. And how do you do that? It is really having that intersection of knowledge of the business, ability to speak well and knowledge of data science. For a data scientist, if they say, hey, I want to be a director of data science or I want to be a chief data scientist, it is absolutely going to be crucial that they understand the rest of the business well. But even if they say, hey, I just want to survive in this business, they're going to have to understand it reasonably well.

Communicating With Executives

Adel Nehme: Even more crucial, because data science is now permeating to a lot of different business functions as well. So what do you think are some of the common pitfalls data teams make when they communicate with executives?

Dan Becker: Yeah, I think the main mistake is that they focus too much on data science primitives rather than the business. So, if I buy a piece of hardware, I buy, let's say I buy a new computer. If there was not someone who could speak well to me as a purchaser of computers, then you can imagine the person who was a hardware engineer, they might say, really, the great thing about this computer is that I worked on the bus between the CPU and the RAM, and we got the resistance from six micro ohms to five micro ohms. And they said that to me, I would say, I truly don't care. Actually, I don't even know what that means. And I made that up. It could be that pathway, which I think probably has a resistance to be measured in some sort of almost like, maybe the thing I said is gobbledygook, but I would certainly say, I don't care. I'd probably say I don't understand.

Dan Becker: A step better than that would be. If they said, well, we decrease the resistance between the, whatever the RAM and the CPU, like that little channel. And so it went from transferring 600 megabytes per second to 1.2 gigabytes per second. It's like, okay, that seems faster. It's starting to be something that I could imagine caring about. If I were to do, maybe put in a ton of effort, maybe I could translate that into the impact it has on the work that I do.

Dan Becker: A step better than that is if they released benchmarks and say, for your machine learning workflows, the sort of thing that would have taken you four hours for computation to finish is now going to take you 2.5 hours. All right, now I understand it. I'm ready to say, hey, this is worth buying or is not worth buying. But it is really like, they need to speak in terms of the things that I care about.

Dan Becker: And data scientists frequently fall into the pitfalls of, especially when they first start of that hardware engineer who's going to speak to me or by this analogy, the executive, in terms of the micro ohms. And they say, hey, we ran LightGBM with a learning rate of 0.05. And tomorrow, I'm going to try it with 0.025. Okay. So that is truly, no one cares, even other data scientists don't care about that. Yes, you might need to run both experiments to figure out which is more accurate, but that is truly talking in micro ohms. And for that matter, if you said, hey, I tried LightGBM, but tomorrow, I'm going to experiment with XGBoost or scikit-learn or gradient boosting implementation. So junior data scientists may be curious which of those is better, but we really shouldn't care about that.

Dan Becker: And so the things that you should communicate is, all right, I got this accuracy. Here's the impact that it would have if this has the accuracy of the model that we deployed for our business, even the back of the envelope calculations, here is how many more customers we would disappoint or here's how many customers we'd lose because of this, even back of the envelope calculations are useful there. But you can say, hey, this is the work I did today, I'm going to try a different algorithm and see if I can improve that tomorrow. I think I can, or I think I can't. And a different algorithm is the right level of granularity because the rest of your business just doesn't care whether you're using random forest or deep learning or LightGBM.

Dan Becker: And so I think to always come back to empathy, what are the things that they care about? And now how do I talk about that in their language and their level of abstraction, is I think the key. We absolutely aim to facilitate that by allowing you to build out that structure in Decision AI around your models so that you can say, we're not just, instead of saying, here's the change in our accuracy for, or the change in our R-squared for predicting demand for mangoes or hotel rooms or whatever else, you'd like to be able to say, if we use this model with this decision rule, here's the impact that's going to have on our revenue, or here's the impact that's going to have on the amount of storage space that we need for extra mangoes. And now you're starting to talk in the language of the business. And a lot of data scientists, especially junior data scientists are so excited about telling people all they know about learning rates or all they know about gradient boosting that they forget what the right level of obstructionists are or what other people care about.

Adel Nehme: It is so exciting to see that the future of data tooling is also trying to solve that problem of how to frame and present data science projects and data science results. Now, Dan, before we wrap up, are you working on any projects outside of Decision.ai?

Dan Becker: Yeah. I've always, I just enjoy fooling around with data. There was a while where I was really using Decision AI and how we did some marketing decisions as an excuse to do this, but I was looking at the popularity of different data science tools and Python packages. But the thing that I'm most interested in right now is some projects related to climate, but I'm trying to figure out the ideal way to connect them to Decision AI, but I'm looking at some publicly available day by day weather data in different locations in an S3 bucket. And looking at differential effects of climate change in different locations.

Dan Becker: And for people who don't follow this field very closely, it's been very differential in different parts of the globe. Some places are getting wetter, others getting dryer, some have warmed in the last two decades, other have cooled. So it's very heterogeneous. That's going to have lots of, not surprisingly, lots of effects on the world. It's something that I've just been looking at that data and quite curious about, and then building a dashboard to interactively play with it, but always trying to think about how do I take the work that we're doing with Decision AI and also connect it to climate, which I think is probably the big challenge for us of the next couple of decades. So that's my hobby project of the moment.

Call to Action

Adel Nehme: That's awesome, Dan. And it's awesome to you as well working on some of the biggest problems of our time. And we'll make sure to include details on the show notes. Finally, Dan, do you have a final call to action before we wrap up?

Dan Becker: Yeah, I suppose that the thing that I've found talking to a lot of data scientists, and I'm talking to data scientists all the time, the thing that I would say for people who are just learning is find something that is interesting to you, interesting in a real-world sense, and do a project to learn about that from data, do a project to work on that, even if it's not perfect. If you're further along in your data science journey, maybe you're working as a data scientist, the thing that I've found is that most people who are using machine learning today or have pondered using machine learning have said, if we were to build a model, here's how we would use it. I think probably about 80% of the people that I talk to after we have a conversation, they realize actually, they can do a lot better than they thought because you slice out so much of the context in order to use conventional or supervised machine learning models.

Dan Becker: So here's the, I think, the challenge to your listeners is if you've got a problem that you think you are using your machine learning models to make decisions, well, drop me a line [email protected]. And we'd love to hear about it. And I have a hunch that if we chat even for a few minutes or a few emails back and forth, you'll realize there are better ways to turn your predictions into decisions. We're also making Decision AI, we've got a free accounts. And so we encourage people to come to our website and fiddle around with the product and see if there's a way that they can do a better job of optimizing decisions, given the data science knowledge and tools they already have.

Adel Nehme: That is very exciting, Dan. And we'll make sure to include how people can reach you as well in the show notes. Now with that in mind, thank you, Dan, so much for joining us today and for sharing your insights.

Dan Becker: Thanks so much for having me.

Adel Nehme: That's it for this episode of DataFramed. Thanks for being with us. I really enjoyed this conversation with Dan and how he frames the importance of aligning machine learning with value. I'm excited to see where the growing intersection of decision sciences and machine learning goes next and what that means for data teams across the world. If you enjoyed this episode, make sure to leave a review on iTunes. Our next episode will be with Sergey Fogelson, Head of Data Science at Viacom. In it, we talk about the evolution of the data tooling stack, his best practices leading data teams, the importance of democratizing data and more. I hope it will be useful for you. And we hope to get you next time on DataFramed.

Topics
Related

Exploring Matplotlib Inline: A Quick Tutorial

Learn how matplotlib inline can enable you to display your data visualizations directly in a notebook quickly and easily! In this article, we cover what matplotlib inline is, how to use it, and how to pair it with other libraries to create powerful visualizations.

Amberle McKee

How to Use the NumPy linspace() Function

Learn how to use the NumPy linspace() function in this quick and easy tutorial.
Adel Nehme's photo

Adel Nehme

Python Absolute Value: A Quick Tutorial

Learn how to use Python's abs function to get a number's magnitude, ignoring its sign. This guide explains finding absolute values for both real and imaginary numbers, highlighting common errors.
Amberle McKee's photo

Amberle McKee

How to Check if a File Exists in Python

Learn how to check if a file exists in Python in this simple tutorial
Adel Nehme's photo

Adel Nehme

Writing Custom Context Managers in Python

Learn the advanced aspects of resource management in Python by mastering how to write custom context managers.
Bex Tuychiev's photo

Bex Tuychiev

How to Convert a List to a String in Python

Learn how to convert a list to a string in Python in this quick tutorial.
Adel Nehme's photo

Adel Nehme

See MoreSee More