Preventing Fraud in eCommerce with Data Science
Adel Nehme, the host of DataFramed, the DataCamp podcast, recently interviewed Elad Cohen, VP of Data Science and Research at Riskified.
Adel Nehme: Hello, this is Adel Nehme from DataCamp, and welcome to DataFramed, a podcast covering all things data, and its impact on organizations across the world. I think something we can all relate to during 2020 and the confinement it brought upon us last year is the rise of eCommerce, and just how digitized retail has become. eCommerce is known for its data science use cases. With our market basket analysis, recommendation engines, marketing optimization techniques, fraud detection, and more. This is why I'm excited to have Elad Cohen on the show. Elad is the VP of data science at Riskified, a startup specialized in machine learning based fraud detection solutions for eCommerce. Elad is responsible for the ongoing improvements of the machine learning algorithms that power Riskified. He has over 12 years of experience managing data science analytics teams across research domains. He holds a master's and bachelor's of science and applied physics from Bar-Ilan University, and an executive MBA from Tel Aviv University.
Adel Nehme: In this episode, Elad and I talk about the state of eCommerce and the plethora of data science use cases therein, with deep dive into fraud detection, and what are the best practices Riskified learn when creating bleeding edge fraud solutions. We also talked about his best practices leading and organizing data teams. Such as balancing long-term research with short-term goals, keeping business stakeholders engaged, how the data team could own an organization's data literacy, and the future skillset data scientists need to hone in order to be successful. If you enjoy today's conversation with Elad, make sure to also register for our upcoming webinar with him, where he'll deep dive into much more detail on the ins and outs of scaling fraud detection solutions and other eCommerce use cases in data science. We made sure to include the registration link in the show notes. If you want to check out previous episodes of the podcast and show notes, make sure to go to datacamp.com/community/podcast.
Adel Nehme: Elad, it's great to have you on the show.
Elad Cohen: Well, thank you very much first off for having me here. I'm really excited to be here.
Adel Nehme: I'm really excited to talk to you about data science and eCommerce, your experience at Riskified fighting fraud with data science, and more. But before, can you give us a brief introduction about your background and how you got into the data space?
Elad Cohen: Sure Adel. So my background started with a bachelor's in physics. I continued to a master's and was working actually with lasers, which was really interesting, and really cool. From there, I went to the military, focusing in the field of signal processing and statistics and algorithms. And I remember at one point when I was working with lasers, I thought I consider myself a laser guy. And then afterwards in the military, I had the same kind of thinking where I was like okay, maybe I'm a single processing guy.
Elad Cohen: And when I thought about it a lot harder, I came to the realization that you know what, the domain isn't actually the most important thing. What is connecting these is my love for data and for research, and for being able to look into what's happening and understand it. And that was back in late 2013, when I finished my military service, data science was really starting to come to life and I decided to shift there. So I had already had some background in a bit of machine learning and statistics, but really jumped in with a lot of online courses with R and more machine learning algorithms. And actually, I learned a lot on DataCamp. So it was a great process.
State of eCommerce Today
Adel Nehme: That's great. So before deep diving into the data science applications in eCommerce, I want to first start off by asking you to share your thoughts on the state of eCommerce today. The past year has definitely forced many organizations to start thinking digital when selling products and consumer to buy products using digital channels. How do you view the state of eCommerce today, and what are the forces that make it such an interesting space for data science use cases?
Elad Cohen: So eCommerce is definitely in a super interesting space right now. So you think about last year, COVID catapulted us forward by several years. You have scenarios where people physically can't get to shops, you've got quarantines and lockdowns, and they have to continue buying whatever they want. And all of these brick and mortar companies realize that they have to go digital. They have to start selling in eCommerce. Everyone joined the show. And on top of that, you're seeing all these sorts of new flows. So things like buy online pick up at curb, BOPAC. All these scenarios where you're trying to make the customer experience as safe as possible for customers. And we saw huge transitions.
Elad Cohen: You've also got in the same time, lots of new shoppers. So people who are maybe a little bit older aren't as tech savvy, who were not as accustomed to buying with eCommerce before. They had to make this transition and they had to start buying online as well. And this has happened across lots of different verticals.
Elad Cohen: So we saw across really all the different industries, a significant uptick in the purchase patterns, as well as changes in shopping behavior. So things that may have been risky or safer, all of a sudden are changing. Lots of people are buying home fitness goods and sweatpants, and lots of food and whatnot.
Elad Cohen: I think the most important thing to take from this is that it wasn't a temporary situation. What happened was the customers who may not have been worth buying with eCommerce as much beforehand understand that okay, there's just actually a really significant advantage of the customer experience that you can get when you're shopping online versus when you're shopping in a store. And eCommerce is here to stay, and it's here to stay in a very, very big way. We really fast forwarded a few years during this experience. And today when you're looking at it, you can see a situation where customer experience is phenomenal, and the competition is getting really, really fierce.
Elad Cohen: So the companies that are doing this as best as possible are trying to make it as frictionless as possible. They want to have the best possible customer experience and give the customer everything they want. They want to go omni-channel and allow you to start scanning with your mobile and then finish the purchase with your computer and ship to a completely different location, so on. And data science plays a huge role in all of this.
Elad Cohen: Being able to connect all of that data, understanding who you're working with and who you're talking to so that the customer can switch between devices. They can switch between channels. You're still able to understand who is this entity and who's this person that you're working with. So for them, it becomes a completely frictionless, almost magical experience. That's all due to data science and how we can help.
Adel Nehme: It's really exciting. And I'm really excited to talk to you about different use cases of data science and eCommerce. eCommerce tends to be an umbrella term that sometimes comprises multiple different activities. So when we talk about data science use cases in eCommerce, there could be a host of different use cases in different domains. Can you walk us through some of the use cases and which ones you think are the most interesting for organizations in this space?
Elad Cohen: Sure. So if you think about it from the eCommerce perspective, okay, you're a business that you want to sell and give the best experience possible. So at the end of the day, you want to make sure that your customers are able to find the products that they want. They're going to have a really easy time to purchase them, and have an overall great experience. So there's a few use cases that are going to fall within this.
Elad Cohen: First, probably one of the more basic examples is you need to have a good recommendation system. You want to know your customers, understand what were their buying patterns and what are the things that they're interested, and what do they like? And based off of that, give them good recommendations that will actually captivate them. And they're going to want to be interested in buying those products from you and not from a competitor.
Elad Cohen: Some additional use cases that are really important for example are pricing optimization. So I'm sure that you've already seen cases where if you don't buy something for a while, you're going to start getting emails with more discounts. And this is extremely personalized and tailored. On the one hand, you have this capability of giving out discounts to certain customers so that you'll increase their likelihood to buy. On the other hand, if you give too many discounts, then you aren't going to be as profitable as you could have been.
Elad Cohen: So the way that merchants are able to optimize this and to really maximize their revenue is extremely important. They need a lot of information about who it is they're working with in order to do that effectively.
Elad Cohen: Another third really important use case is marketing optimization. So when you are using different campaigns, either email campaigns, or through social media and so on, you always have this exploration exploitation trade-off that you want. On the one hand, you want to try different messages and you want to try to see what's going to create the best impact. But you don't want to wait too long. Because if you found something that's going to be useful, you want to start exploiting it. And a lot of the messages, they might be transient. There may be a holiday right now. So you've got two, three days coming up in order to make the most out of it. You can't wait for the experiment to go through.
Elad Cohen: So utilizing things like multi-armed bandits or Thompson sampling, those are extremely interesting use cases in order to maximize the effectiveness of your marketing as well.
Fraud in eCommerce
Adel Nehme: And one of the use cases that Riskified specializes in is specifically in tackling fraud in eCommerce. Can you walk us through how fraud works and why this is such an important use case to tackle, and how does Riskified solve this problem?
Elad Cohen: Excellent. So the sad truth today is that a lot of the credit card information is out there on the dark web. Fraudsters are very sophisticated. They actually specialize just like we do. So you have hackers who are going and pulling this information. They go to the dark web, they sell it to other fraudsters who then buy that information. And they're going to go ahead and create purchases.
Elad Cohen: Now if you ask the average person when you buy online and there was fraudulent activity, who is liable for that? And you might think, well credit card company, obviously, right? Well, that's not the case. When you're making a purchase where the credit card was not present, then the liability is on the merchant. So if you go to the supermarket and you actually show your credit card, sure, if that was stolen, fraudulent, the credit card company will pay. But if you're doing this online, then the merchant is the one that's liable for that.
Elad Cohen: And this creates a really big problem. Because if you go back a few years, almost all the systems were legacy based and rules-based, and merchants were really conservative. Because when you have this fraud happening, there's a process called a chargeback. And this chargeback, whoever is the victim is calling up their credit card company. They're saying, "My information was stolen. I didn't make this transaction." And the merchant pays back the full amount. So they might have relatively small margins, but when they lose, they lose all the transaction, and whatever they already sold has already been shipped off to whoever.
Elad Cohen: So what Riskified does is we use machine learning algorithms based on our network of data across all of the retailers that we work with. And we provide a real-time decision for merchants. We'll either say they should approve, or they should decline a transaction. And we'll offer a chargeback guarantee because we have that kind of confidence in our models. So that means that if we got it wrong, we're going to pay for the entire order. And we're only going to be charging a fee if we approve. So we are highly incentivized to approve as many orders as possible. Otherwise, we don't get paid. At the end of the day, because we have really accurate machine learning models and we have a lot of people working on this problem and this very large network effect, we're able to get more accurate than what a single merchant could do on their own. Which means that we're both able to reduce their operating costs around fraud, how much they would be paying for chargebacks versus how much they would be paying if it weren't for us. But on the other hand, we're also able to enable them to approve more orders, which means that they can actually increase their revenue top line by a significant amount.
Elad Cohen: If you look at some of the legacy systems, they many cases are going to be super conservative at the point where specific countries could be black labeled. If you want to buy from a country that might have notoriously high fraud, they could just blackmail the entire country, and you just can't buy from there. They're also very inflexible and very slow to adapt. So if there are changes like with COVID where some things that historically would have been risky and now they're safe, you've also got the same problem. We saw this in COVID where all of a sudden, you have these last minute flights. People are trying to get out of specific countries where the pandemic is starting to become a huge problem. They know that there's going to be a lockdown. And last minute flights historically, super risky. Now all of a sudden it's a legitimate pattern. So you need to be able to adapt to those kinds of things. And when you have a rules-based system, it just won't work.
Designing Fraud Detection Solutions
Adel Nehme: That's great. And I'm sure as someone who is working full-time at scaling and improving machine learning based systems for combating fraud, you've developed a plethora of best practices and tricks up your sleeve to create these state-of-the-art predictive models on fraud use cases. While I won't ask you to divulge all these tricks, can you walk us through some of the best practices that you learned when designing fraud detection solutions?
Elad Cohen: So if you're going to be building a fraud detection solution from scratch ... and I will warn you, I highly recommend not to go down that path. You know where you start, you don't know when you're going to end. There are a few important things to consider.
Elad Cohen: So one of the most important ones is your ability to link to historic orders. Let's say I'm a merchant, I'm a data scientist working in a merchant. And Adel, you're my customer. And I've seen that you've made these past purchases, and they all came back as legitimate. Okay. I've got a good history around things that you've done. I know where you ship, your billing address, maybe IP ranges that you come from and so on. So I can always link to that and understand is there enough history behind this customer? And does it make sense with a new order that's coming in?
Elad Cohen: If I see all of the information is suddenly changing in this new order, but it's your account, that's going to trigger some questions around that. So you definitely need a very, very rich history with very accurate labels so you know which orders were good and which orders were bad, and have features that are able to understand that information well and pick up on what's important with that [inaudible].
Elad Cohen: Further on, when we're talking about some of these features, you need strong domain expertise around how do fraudsters actually operate. If you look at the raw data points between an order today, can you tell fraud from not fraud? And this is something that takes time.
Elad Cohen: So I'll give you an example. At Riskified, we actually have an internal fraud academy. And before someone is going to be looking at orders, they go through 140 hours of training to understand what fraudulent orders looks like. It's a lot. I've passed through only some of it. And I can tell you there's a ton of information there.
Elad Cohen: Once you have that domain expertise, you can start understanding how would a risk expert actually look at a specific order, and what's the thought pattern behind this. Now, you can take that thought pattern, and you can actually create it into a very good, very accurate features. Then you're going to be able to help your model now. And it's going to have a much better opportunity and possibility in order to recreate that same accuracy.
Elad Cohen: Next, I'd say you need a lot of data points. And it's from all over. You're going to be looking for things and information around the device that it's coming from, anything that you can get supplied. You want enrichment from additional data sources. So for example you know with the IP, you'll want information like what is the geolocation of this IP? It's not going to be super accurate, but at least you want to understand that it's the right proximity, and it makes sense, and it's not from a completely different country. You want to look at things like is there a proxy or not?
Elad Cohen: There's dozens, if not hundreds of data points that you're going to want to look at from anything that might be helpful. And there's three specific data science challenges in the domain that you're going to have to try to overcome.
Elad Cohen: So the first is partial labeling. If you look historically, any time you approved an order, you're going to get a feedback. You're going to understand whether it was a chargeback and it was fraud or not. Even here I'll say that that's not even 100% accurate. Because we know that historically there can be cases where there was fraud, but someone just didn't pick up on it in their credit card information afterwards, so they didn't file a chargeback. So a chargeback is not necessarily fraudulent.
Elad Cohen: There's also the other case as well, where someone actually did make an order, but they called their credit card company and say, "It wasn't me," which we call friendly fraud. And that's another situation. So you've got that part of the problem in regards to the label.
Elad Cohen: The really big issue is what happens if you declined an order. So if you declined an order, you can't naively just assume that it was fraud. Because if you assume that it was fraud, the next time you retrain your model, what you end up happening is that you're basically going to be in a reinforced state where you're teaching it the same things from last time. So you're never able to correct based off of the model you had earlier. So you have to be very, very suspicious around the declined orders and were they really fraud, or did I get it wrong?
Elad Cohen: The second big problem in the fraud space is that you've got a long feedback loop. Now this is all relative. If you're working in an insurance company and there's an insurance claim on, I don't know, a house mortgage 10 years later than you probably have it a little bit worse than we do. But if you look at [inaudible], someone clicks on the button basically within a few seconds. And you know if your experiment is working instantaneously, you've got a lot of traffic and so on.
Elad Cohen: If you think about when was the last time you actually looked at your credit card statement, not everybody does it all the time. Sometimes, they don't do it every month. They might check it every two or three times or whatnot. And we see that it can take a couple of months, three months, even longer. The credit card company will actually in many cases allow you to file a chargeback even up to, I think it's nine or 12 months after the fact. So then maturation time is long. If you've got a problem in your model and you're not able to pick up on it quickly, by the time you're going to start getting back those chargebacks, it can be too late basically.
Elad Cohen: The third problem, which is maybe a little more standard. And I think a lot of data scientists are going to be able to relate to this is the imbalance you have here. So obviously, luckily, fraud is not as common as legitimate orders. It's much more rare. So you're going to have to work around that and be able to either upsample, or downsample, or think about additional tricks into how you're going to be able to do this so that your training set is as effective as possible.
Elad Cohen: And just going back to what I opened up with, if you're thinking about doing this, it's a Pandora's box. This is something that you can probably do an okay job if you're working on it for a few weeks or a few months. But if you want to get really, really good, and we're talking about scales of merchants that are selling in the billions of dollars or more every year where every 10 basis points of approvals is going to be significant in their top line, you want to do a really, really good job. And that is going to take really, not even man years, it's man decades or more.
Adel Nehme: This is fascinating, specifically when you talk about the machine learning challenges in the fraud space. In a similar vein, what do you think are the main challenges when designing and deploying these AI systems at scale, and how do you go about alleviating them?
Elad Cohen: So I think there's some challenges that you're going to have to overcome whenever you're working at a large scale, and you've got a big organization of a lot of people who are trying to tackle a similar problem. You need to start standardizing everything. So that means that all of the tests are going to be standard. Anytime you're doing a comparison, you're doing a champion challenger between whatever you have right now working in production versus your alternative model. Everyone has to be using the exact same tests. It has to be agreed upon. And then you can actually compare things and know pretty well what's going on.
Elad Cohen: You've got to decouple the work between different data scientists so they can work independently. If you think about ... I love to do Kaggle projects. You take on a competition, you've got your dataset, and you can do anything you want. But if you took a Kaggle competition and you let 20 people work on it, it's not going to be really effective if you let each of the 20 people just say, "Have at it and let's see what happens." Okay. So one of them might be doing a little bit better than somebody else, but the other 19 is basically waste work.
Elad Cohen: So you've got to think about ways where you're going to be able to split the work effectively. Sometimes we do this by domain. We can say for specific teams, "Okay, you're going to take our gradient boosting algorithm. And you're going to try to optimize the algorithm." In other cases, we're going to say okay, these teams are going to be focused on feature engineering for specific families of features. And if the families of features are orthogonal enough, then they're going to be able to make their own independent contributions and incremental contributions without stepping on someone else's toes.
Elad Cohen: Now if you go basically back to the theory, someone might say, "Wait, you can't really do anything that's going to be suboptimal in specific places because everything is related. And you change the algorithm, maybe you have to optimize and change your feature engineering and your sampling methodology." It's true. It's true, but this is the constraint we live in. So at the end of the day, you have to try to find how to do that in some manner. So we try to create our teams in ways that they each have areas of responsibility and authority, and they can work there as autonomously as they want. They can create their own roadmaps. But they know that these are the guard rails. This is the area that they're going to be trying to optimize.
Elad Cohen: And then the last challenge, which I think a lot of people are also going to be able to connect to is this is a mission critical real time system. So think about it as a merchant. You've gone through so much pain to send out your marketing campaigns, to get these shoppers on. They've gone through the cart. They've clicked everything. They're clicking on the checkout button. And now that data is going to Riskified. There's no way you want it not to work. Okay? It has to be up all the time, no matter what. This is a super robust system. And it has to work in very, very fast timescales as well. So that means we got to collaborate very, very closely with engineering.
Elad Cohen: In some cases, that means we might have a really cool model that performs very, very well, but it's going to be hard from a technical perspective to put it into production today. Sometimes, we can definitely work with product and with our engineering, and we'll see that the value is big enough that we are going to have to change part of our tech stack or find different ways in which we can implement it. In other cases, you might show value. But the value is just not going to necessarily be enough in order to want to implement that. And that is something that people need to comprehend. It's not just that I've been able to improve the model. It's that the improvement is big enough to justify additional tech debt or the incremental investment that we're going to be doing on the tech side.
Do you think data scientists should learn certain engineering skills?
Adel Nehme: I want to zero in on one challenge you talked about here. So you mentioned the need to work really closely with engineering, especially in order to operationalize and deploy some of these data science models into mission critical systems. There's also a lot of different schools of thought in data science today, where one argues data scientists should be specialists and work purely on data related problems, and others arguing that data scientists should learn engineering skills and should learn how to deploy and maintain models and production as well. Do you think data scientists should learn these engineering skills? And do you think that that will make the difference between a successful and not a successful data scientist in the future?
Elad Cohen: So that's such a great question. I think that it really depends on the company you're working with and the ecosystem that you have. So it's going to be really hard for me to say there's a one-size-fits-all. If you work in a culture where the engineering has created magnificent infrastructure for data scientists to focus on data science, and they can go ahead and implement things in production, and they might not necessarily need very, very heavy engineering skills, that's one way to go at it. And that can be great.
Elad Cohen: I personally believe, and this is probably also a lot to do with where we are at Riskified today, that data scientists do need some engineering skills. We have to be able to talk in the same language. We have to be able to work with similar tools. So we don't want to get to a situation where we're basically saying, "Okay, this is a object that represents my model. Go ahead and take it from here." Or I don't want to get to a point where I am engineering a feature, and I write pseudocode, "And here's the logic behind it. Go ahead and implement it some way," and just be throwing it over the wall.
Elad Cohen: So we do want to have some engineering skills. Where I am more troubled and concerned about is scenarios where engineering is a practice on its own and data science is practice on its own. And if companies are at the point where ... and it's a slippery slope, you can very easily get there. You do some of the research, and now you're working on implementing it in production, and now you maintain it, and you own it. And the maintenance is starting to take up so much of your time. You can get to a point where the majority of the data scientist's time is only around how do I implement what I do in production? And at the end of the day, data scientists should be doing research. They should be looking at the data. They should be improving the models. That is what is really their specialty and what's different from a ML engineer or data engineer.
Elad Cohen: So in my mind, they should have these skills. They should be able to do it. But I think it needs to be a situation where there are other people who can take on more of the burden, then they're able to create that separation more effectively.
Elad Cohen: Now again, that's very much specific to us. If we were in a much, much smaller situation with maybe one or two data scientists working in a smaller startup, then everybody's going to have to do everything. And yeah, you're definitely going to need engineering skills. If you ask me long-term for people listening in right now, if you don't have engineering skills, it's really going to limit where you're going to be able to work. So you need to have the fundamentals. But on the other hand, if you have those fundamentals, then there's always areas within data science where you can continue to improve your craft.
Adel Nehme: Now speaking of challenges, you know data science and machine learning are the heart of Riskified's business model. And I'm sure with that comes many challenges. So I'd like to segue more into the challenges you face when leading data projects and data teams. And I'm always excited to discuss these since they are use case or industry agnostic challenges, and any data team can learn from your challenges and your best practices. One thing that I've seen you write about is the common pitfalls data teams face with any data science project. Do you mind walking us through these pitfalls and how you go about solving them?
Elad Cohen: Sure. So I've worked in a few different places in the past, and I have to say actually at Riskified, we're lucky enough where the main product is machine learning models. That's really the core of what we're doing. So we're in a relatively good situation in regards to these pitfalls.
Elad Cohen: But some of the things that I've seen, first is bad data. If you are starting up a data science practice right now, or you're going to work on data science in a field that may not have been using machine learning before, the quality of data can be horrendous. And in many cases, there might be an executive stakeholder who's going to try to bring in data scientists to work on this. You're going to have so much time just cleaning up the data, that you're not really going to be able to show business value for a long time.
Elad Cohen: If your stakeholders understand that, that can be okay. In other cases, they may expect instantaneous magic and, "Why are you still cleaning this data after six months?" That can definitely create a lot of problems.
Elad Cohen: So if you're going to be starting, I would highly recommend that someone who's experienced in the field actually start digging into the data and doing exploration and say, "Is there something to work with right now?" Or alternatively, you can create a backlog of, "These are the things that we need to do in order to clean the data." And only then should you for example, hire someone full-time or bring someone into this project. But we've got a backlog before we can do that.
Elad Cohen: One of the pitfalls I've seen is lack of buy-in and trust. And again, this can happen many times where you have someone on at the top. Someone who says, "Okay, I want to bring in data science here. I want to change it." And then you might have a middle manager or someone you're working with who just basically received this from the top. they don't necessarily have the buy-in.
Elad Cohen: In many cases, when they work with data scientists, they can see this as a threat. And I can completely relate to that. So if you're thinking about someone who's a domain expert in some field, let's say you're working in operations, or fulfillment, or marketing. And you're an analyst, and maybe you're doing forecasting models in Excel and you're creating whatever. It's not working necessarily well enough in it's scale. And all of a sudden you have this data scientists come starting asking you questions about how you go about your business and the data you do.
Elad Cohen: In many cases, the immediate response can be, "We don't have a problem," or, "Why are we even working on this?" Or creating criteria that are ridiculously hard for any data scientist to meet them. And they want you to come up with a solution that is way better than anything that they can do right now. And they're the domain experts and they're doing this manually. So they'll usually have an advantage, at least on any MVP model you can create.
Elad Cohen: Here you have to work really, really closely with those stakeholders. You have to make them understand that you're not coming here to replace them. And usually, that's the case. I mean, in many cases, what data scientists are going to do is enable other people to do better at their job. Maybe you're going to take some of the more manual work and make that better. And now they can use those models to do something better.
Elad Cohen: There's other cases where there's stakeholders who are engaged, they're interested, but they might not necessarily know how they're going to be using machine learning. So it resembles a little bit of what I said in the first one where they'll bring somebody in. And basically, they'll just want them to bring the magic as soon as possible.
Elad Cohen: And here again, you really need to work with the stakeholders to formalize what's the problem that they're interested in? What is going to give them the business value they want? In a lot of cases I've been, the stakeholders may just want you to start exploring. Start looking at the data, try to create something. And it can be problematic. You start looking in that data, you'll start bringing some kind of nuggets, some insights. You might have something that's going to be helpful. And it just goes on and on. There's an education process where you have to work with your stakeholders and tell them, "Guys, let's work on these KPIs. Let's understand what is going to be important for you. What do you want to see from the model?" So I understand you're engaged, but you need to know how to utilize our time.
Elad Cohen: And finally, the fourth pitfall I've seen is that stakeholders change their minds. So business is very dynamic. You might be working with one stakeholder today. They may change. It's going to be challenging to make sure that whenever you're working with the stakeholders, they understand the value of what you're doing. They understand the research process as well. The buy-in is longer term, and they're committed to this process. It's not a situation where especially if you're coming into something new, a new field, new data types and whatnot, you're going to start giving the media value. And it'll take time.
Elad Cohen: And I think if we try to look at all these different pitfalls, one of the most important things I would tell any data scientist anywhere is there is no such thing as overcommunication. If you think that you are meeting with your stakeholders, product, whoever it may be too much, basically I would guarantee that if you talk to them and ask them questions that you probably think that for sure they know, not so sure. Talk to your product. If they really understand some of the fundamentals behind the KPIs or the research that you're doing, or what kind of risk this entails and whether it's likely to succeed or not, how you see things. There's a lot of these fundamental questions here. And time and time again, I see cases where data scientists naturally like working with the data. They may not see the value behind keeping the stakeholders up-to-date all the time and discussing things, especially when they don't see these big milestones. But you can't wait for the big milestone. You have to continue having those conversations.
Adel Nehme: That's great. And I want to zero in on the human side of this. So you mentioned one of the common pitfalls being a lack of stakeholder buy-in or lack of trust. And I'd like to zero in on that. There's a lot more discussion nowadays around data storytelling, the importance of presenting insight and how to maximize value when bridging the gap between business stakeholders, and data scientists, and analysts. Do you have any best practices you can share there about how to maximize the value there and to bridge that gap?
Elad Cohen: Yeah. So there's a few things that you need to keep into consideration. I'd say one of the most important ones is to try to have a good, honest reading of a level of engagement that you have with your stakeholders. So you want to look for things like are they asking you questions? Are they actually interested in what you're doing? Are they really interested in this, or are they coming up to sync meetings just because their boss is there, and they have to be there?
Elad Cohen: I've seen situations where, like one of the pitfalls we mentioned earlier, the senior management decided to go on with a project. Some of the people who were actually the main experts may have felt threatened. They may not have felt that it's as necessary as possible. They arrive to the meetings, they don't necessarily ask many questions. They are, you could say a bit passive aggressive. and the data scientist is almost trying to prove to them that the model is better than what they're doing. But at the end, it's not necessarily a rational conversation. People are really good at deflecting, even when they see proof. And it can be hard. It's a cultural issue. So you want to be sure that they're engaged as soon as possible, and work with them.
Elad Cohen: Another really important thing is to ask a lot of what ifs. So before you go ahead, and create the model, and present the final results, way, way, way back before, even before you started working extensively into this. You should start asking them, "Let's say, imagine hypothetically, tomorrow morning I had the model ready, and it reached 90% precision, 50% recall, whatever metrics you want. It worked within 100 millisecond latency. Would you use it? How would you use it? Is it good enough?"
Elad Cohen: And then you start seeing the thought process. If they start asking lots of additional questions or they start giving you more excuses around this. The worst thing you want to hear is, "Let's see. Create the model and then we'll talk about it." You want to avoid that at all costs. You basically want to try to get to a commitment saying, "These are the KPIs. This is what I need to try to reach. But if I can do this and I work within all of the constraints that you've given me for the problem, you're actually going to be using it. Because if you're not committed to using this, then really, I need to question whether I want to put in the work to try to research it."
Elad Cohen: And keep doing this. It's not just the one-off. As you go along, you and your stakeholder may understand that okay, there's more constraints than you originally understood. Maybe it's not just the latency. Maybe it's important what's the memory footprint of the model, what technology it's going to be running on, or whatnot. So keep asking these questions, and you'll get a much better understanding of what you need to do in order to actually make this be valuable for the business.
Elad Cohen: And finally, you want to make the stakeholder and the main expert part of this process. So we talked earlier about the fact that you have to update them and keep them in the loop. You want to ask them for their feedback as well. When you're working with a domain expert who's not a data scientist, they have a lot of body of knowledge. You don't want to get to a scenario where you're setting up a one-hour meeting. You're trying to debrief them from everything they know. And then basically say, "Okay, thank you very much fellow from here. This is now a machine learning problem, and I'm going to run ahead." You want them to feel part of the process, and be involved, and get their feedback. They're going to be super useful every time you see quirks in the data. And you're probably going to be seeing all these anomalies and little problems in the data. And you want to ask them, is this real? Is this an outlier? Should we be removing this?
Elad Cohen: The more they're bought in, and the more they feel that they were part of this process, and it's their model too. They were really contributing. Then they're going to be wanting to use this model, and it's not going to feel like a competition.
Adel Nehme: This is wonderful. And you mentioned here in your response around the element of culture as well as relaying back to the other human element of this, which is stakeholders who are engaged, but not necessarily know how to leverage these machine learning models in their day-to-day tasks. So how important do you view the role of organization-wide data skills and data literacy when creating that buy-in, and that culture, and where business stakeholders feel like they own the data solutions that are built for them?
Elad Cohen: So data literacy is super critical. It is extremely challenging. And personally, it can be very, very frustrating when you're trying to have a conversation with stakeholders who aren't literate at data. In many companies, that is the situation, and you have to understand it. And it's very important that you do have a good understanding of which stakeholders are more data literate, and which are not. And I would go so far as to say that if you're a data scientist and you're going to be working with stakeholders who aren't data literate, the first thing you should do, and you should for your own success, take this as your personal responsibility, is to educate your stakeholders. Make them data literate. It doesn't have to be extreme, but they really need to understand what are the meaning behind the KPIs? How is this going to work? Where does it make more sense? And in some cases, they need to understand the algorithm to an extent where how explainable is it going to be? Is this going to be a black box, or are you going to be able to actually have knobs where you can tweak it on your own or change it?
Elad Cohen: Once you have that data literacy, like I mentioned before, you want to have basically an informal contract. If you're able to reach X precision, Y recall, whatever latency, or your [inaudible] is going to be underneath some threshold. Now, the stakeholder is going to go ahead and integrate this. And it can't be vague. It has to be very, very specific.
Elad Cohen: You also should work with the stakeholder to understand what's the baseline right now. So I've seen this as well where in many cases, you're coming in and you're trying to develop a solution that might've been done manually before. and the stakeholder might set this very, very extravagant threshold. And you ask them, "Okay, so how are we doing right now?" And they actually have no idea. "So is this 100% improvement, or a 10X improvement, or 100X improvement? I don't know what this threshold really is." So it's also very, very important to try to work with them and understand what's the baseline that I'm competing against, so I understand that this actually makes sense.
How does Riskified chooses to prioritize its research goals?
Adel Nehme: Another important aspect of leading data science at Riskified that I've seen you write about and speak about is balancing between longer term research goals and short-term product goals. And this is something that I'm pretty sure many data science teams also have to balance. Can you walk us through that balancing act and how Riskified chooses to prioritize its research goals?
Elad Cohen: So this is one of those cases where it's going to depend on the scale of the organization, the organization you're working with. So at Riskified, we work pretty big. We've got about 40 data scientists right now, working many of them on the chargeback guarantee and a lot of other different products. And when you get to that kind of scale, you start thinking about this more statistically. It's almost like a portfolio. You can imagine, "Okay, I'm going to have 20 different research projects. How do I want to allocate them?" Very, very different than if you've got one or two, and it really is going to depend if you are successful in this one. You don't want to take a lot of long shots that aren't going to pan out.
Elad Cohen: Here, you want to have a trade-off between some of the long-term bigger bets. Typically, those might have the potential huge payout. So your expected value might be higher, but you've also got a very, very large standard deviation.
Elad Cohen: If you can run a few of them, your expected value as I mentioned earlier may be higher. So it makes sense. But you don't want to overdo it. You definitely don't want a situation where you're trying out a few of these, maybe none of them work out, and all of a sudden you didn't bring enough value for a few months. You also got lots of shorter term projects that the business is more interested in, and you always want to be delivering value consistently.
Elad Cohen: So we definitely want to continue working on shorter term stuff. And there's always going to be more lower hanging fruit. You also don't want low-hanging fruit as a strategy. If you continue doing that, then over time, you're basically going to be drying up the well. You're not going to have the next big thing that you can build upon. And once you typically do that, then you can continue optimizing it more and more. And it's going to be generating more lower hanging fruit.
Elad Cohen: And we try to think about what is going to be the portfolio that will bring us to the best place maybe six, 12 months down the road. So that means that we're mostly going to be doing some short-term projects, or a lot of short-term projects that are going to be improving the current features that we have in the models. And we carefully choose a few handful of moonshots that could take on several months, hoping that one or two of them are going to be successful.
Elad Cohen: And this is a very delicate balancing act that you always want to do closely with your stakeholders. And we work very closely with our product. So they understand how are things looking. They give us a lot of input, and insight, and try to get to something that we feel is going to give us the best possible outcome.
Elad Cohen: I will say that try as you might, when you're trying to decide on a research project before you've initiated it, there aren't a lot of really good ways in order to understand it. The best two heuristics that I can come up with, one is the Delphi method, which is basically where you try to take experts who have good intuition. This could be your senior data scientists, maybe some of the team leads or anyone who's already done this before. And you want to ask each of them, "How valuable do you think this direction could be?" You let them do it independently. Okay. You compare the results. You talk out the differences, if the differences are very, very big. And you have another couple of rounds like that. And at the end, you're going to have a more or less group consensus using all those experts to try and understand, "Okay, the way we understand this, this is going to be hugely impactful or not so much." You can do that process and try to get a better estimation using more people. Again, built on intuition. But this intuition is based off of a lot of expertise and experience that they've built over time.
Elad Cohen: And then the second thing that you can do instead of just using expertise is looking at historic returns from specific areas. So I mentioned earlier we can break down and we decouple where we're doing different research projects into families of features or different areas within the algorithm and how we're tackling it. If I see that over the last year or so, and obviously not one project, but I look at a lot of projects that they have it in one domain. and I see that they gave fantastic returns relative to what I did in other domains. It's definitely not going to be a guaranteed win for the next project, but there's a higher likelihood that there's still more potential to be utilized there. So it will help us and say okay, probably a little more promising, and we're going to be putting more resources there. But definitely a very challenging area to try to work around, and prioritizing research goals is one of the more challenging things that we face.
Call to Action
Adel Nehme: Thank you so much for this insight, Elad. Finally Elad, before we let you go, do you have any call to action for listeners before we wrap up?
Elad Cohen: So if you're listening to this, you're familiar with DataCamp. And I've got to say, you really need to never stop learning. AutoML is real. It may not be as big as you think, but there's a famous quote where people tend to over predict what will be coming in the next year. And they under predict how things are going to change 10 years down the road. So AutoML may not change everything in the next year or two. But five, 10 years from now, a lot of the problems that data scientists work on are likely going to be automated. And you have to continue to improve your skillset and become an expert in what you do so that you're always going to be ahead of the curve, so that the value that you can bring is more than fit predict.
Elad Cohen: Secondly, I'd say focus on bringing the most value to the business. At the end of the day, that's what you're getting paid for. If you aren't able to bring value, it happens sometimes. Maybe you were on a project that didn't work out. It's really important to decouple good research from the outcome. Sometimes, great research didn't work. But you still need to try to focus on that, because that is what people will be looking at over time to see what kind of contributions are you able to bring.
Elad Cohen: So while working on the most latest technologies and the last generation of algorithms state-of-the-art can be really exciting and can teach you a lot, always think about the value.
Elad Cohen: And then I'd say the last one is if you can, especially if you're in a management position and this is possible, you have to choose your projects well and wisely. Data scientists that I'm familiar with, especially the really good ones, they will just love to dive into a problem. And they're going to start exploring the data. and they'll create these models and continue to improve them. And it becomes addictive. Can I get better features? Can I change the model and the hyperparameters, and improve it over time?
Elad Cohen: Like I mentioned earlier, if you're going to go into a new domain and it's a really big problem, you should carefully consider whether you want to buy instead of trying to build something in-house. If you're going to be creating something where maybe you can get a good enough solution, that's fine. If the bar that's being set is very, very high, there's situations where you have to make the sometimes more difficult consideration and say, "Okay, this is super interesting. I love this project. I think I could do a good job here, but I'm going to think about my time as a really precious resource." Because it is, there's not enough data scientists. "And I'm going to go ahead and look for a buy solution."
Elad Cohen: I can tell you for ourselves, we have a lot of data scientists, but there are data science problems where we're also looking outwards. Things like monitoring some of our models. We look for areas where these are components that aren't specific to the domain. And if we can get a outside solution that's going to save a lot of our time, then we're going to do that. So you have to pick your projects well when you're able to do that. And think of what are the core competencies within your company where you're going to be able to bring the best possible improvement, and not try to solve every possible problem that could be solved with data science. Because there's just so many of them.
Adel Nehme: Indeed. Thank you so much for the time today, Elad. I really appreciate your insights.
Elad Cohen: Thanks Adel. Thank you very much, and thank you for all the listeners who joined us today.
Adel Nehme: That's it for today's episode of DataFramed. Thanks for being with us. I really enjoyed Elad's insights on the state of data science in eCommerce and his best practices leading data teams. If you enjoyed this podcast, make sure to leave a review on iTunes. Our next episode will be with Sudaman Thoppan Mohanchandralal, regional chief data analytics officer at Allianz Benelux on the importance of building data cultures. I hope it'll be useful for you, and I hope to catch you next time on DataFramed.