The Past and Present of Data Science
Sergey is the Vice President of Data Science and Modeling at Viacom. He began his career as an academic at Dartmouth College, where he researched the neural bases of visual category learning. Since leaving academia, Sergey has worked as a data scientist in digital advertising, cybersecurity, finance, and media. He is heavily involved in the NYC-area teaching community and has taught courses at various bootcamps, and has been a volunteer teacher in computer science through TEALSK12. When Sergey is not working or teaching, he is probably hiking.
Adel is a Data Science educator, speaker, and Evangelist at DataCamp where he has released various courses and live training on data analysis, machine learning, and data engineering. He is passionate about spreading data skills and data literacy throughout organizations and the intersection of technology and society. He has an MSc in Data Science and Business Analytics. In his free time, you can find him hanging out with his cat Louis.
Transcript
Adel Nehme: Hello, this is Adel Nehme from DataCamp, and welcome to DataFramed, a podcast covering all things data and its impact on organizations across the world.
One thing we're looking forward to covering in more detail on the podcast is not only the latest insights on how data science is impacting organizations today but how the field has evolved and is evolving towards democratizing data science for all. This is why I'm excited to have Sergey Fogelson on for today's episode.
Adel Nehme: Sergey began his career as an academic at Dartmouth College in Hanover, New Hampshire, where he researched the neural bases of visual category learning and obtained his PhD in Cognitive Neuroscience. After leaving academia, Sergey got into the rapidly growing startup scene in the New York City metro area, where he has worked as a data scientist in digital advertising, cybersecurity, finance, and media. Currently, he's the vice president of data science and modeling at Viacom CBS, where he leads a team of data scientists and analysts that work on a variety of awesome use cases.
Adel Nehme: In this episode, Sergey and I discuss his background, how data science has evolved since he got into the field, the major challenges he thinks data teams and professionals face today, his best practices gaining buy in from business executives on data projects, and his best practices when democratizing data science in the organization, and more.
Adel Nehme: If you want to check out previous episodes of the podcast and show notes, make sure to go to... See more
Adel Nehme: Sergey, I'm really excited to have you on the show. I've been excited to have this chat on the state of data science, your experiences leading data teams, and democratizing data science. But beforehand, can you please give our listeners a background on how you got into data science?
Sergey Fogelson: Sure, would love to. Thank you for having me, Adel, I'm really excited also to speak with you about all of this stuff. So my academic background is in AI and cognitive neuroscience. I got my graduate degree in cognitive neuroscience applying ML algorithms to functional neuroimaging data. So basically what this means is put people into large scanners, record their brain activity, and then try to decode what's actually happening in their brains using machine learning algorithms.
Sergey Fogelson: And what I knew was, probably about halfway through my PhD, I knew I didn't really want to stay in academia and I knew I wanted to work on interesting data related or data intensive problems. And when I was at that point in my PhD, so this is around 2010 through 2011, I heard about this thing that people were talking about called big data. They didn't really have a term for data science at this moment in time, and so I just knew that there was this field where you could use, still in its infancy, but you could use the sames kinds of algorithms that I was using for neuroimaging work, but applied to real world data sets. So data sets in advertising, in finance, in quantitative analysis, all over the place. And so basically, I started looking into this stuff, started reading about it and in my last year I really made a hard push to try to get into the industry and I wound up being able to land a job in the world outside of academia and haven't really looked back since.
Leading Data Science at Viacom
Adel Nehme: So what were some of the earlier data science projects that you worked on and how has that shaped your path leading data science at Viacom?
Sergey Fogelson: I would like to think that I've had pretty varied experiences, but maybe not. I think they're reasonably eclectic. So I started, the very beginning of my career, I worked for a digital advertising startup and there the big two problems I worked, one was a classification problem. And I still think it's a pretty relevant problem. I don't think this problem has really been solved yet. And it's the idea of taking IP addresses and trying to understand what kind of a place that IP address represents. So for example, is this an IP address associated with a home? Is this IP address associated with an airport or a Starbucks, or some other business? Is it an educational IP address, et cetera? So there is some metadata associated with that information, but it's not 100% accurate.
Sergey Fogelson: So what you can do is you can take signals that are coming out of that IP address to make probabilistic inferences about whether you think it's a home or not. And that was really important for the work that we were doing because the way that we were building the main product that this company was selling, it's called a device graph, basically it tells you whether any two devices belong to the same household or not. Being able to do that and being able to build those links across devices was really critical to understand whether something is a home or not, whether that device is living or is being seen within a home-based environment or not. So that was one of the first projects I had.
Sergey Fogelson: I also worked a little bit on this graph building problem that I just quickly mentioned earlier. So the idea here is again you're trying to figure out whether two distinct devices, two phones, an iPhone and a tablet, or an Android phone and a smart TV for example, whether they belong to the same person or to the same household or not. And again, this relies on some network analysis techniques and then on really thinking about how to be able to do this at very, very large scale. So back then, again this was like 2014, there really weren't a lot of large scale data analysis frameworks. Spark was basically at 0.1 or something. It was a completely new project. So it was just very difficult and interesting to tackle these kinds of problems that involved working with data at scale. So that was my first foray into data science.
Sergey Fogelson: And then after that, I moved into cyber security, so I worked at a cyber security startup for a little while. And there the most important problem that I tackled was really what we called hack prediction. So the idea was given a company's cyber security footprint, so the number of IP addresses that they have exposed to the public Internet, if you can snoop, for example, and see what kinds of software they're running on computers on those IP addresses. So on servers or on personal computers, et cetera, you can actually see if the software's all up to date. We know that if that software's not up to date it can be hacked in various different ways. The idea is, is you take all of these kinds of signals and then you assign a probability score to what the likelihood is that this company's going to get hacked within six months or within a year or within two or three years. So we called that hack prediction.
Sergey Fogelson: And then I moved and worked for a small data consultancy where we actually worked for a large investment bank and we worked on what's called an automated account reconciliation problem. So this is not particularly attractive from a data analysis perspective, but it's actually super critical from a back office perspective. The idea here is you have two distinct accounting systems, they occasionally do not line up with each other. They need to be what's called reconciled and you need to basically assign a likelihood that they actually need to be manually reconciled or they can be basically dealt with by other downstream automated systems. So this is almost like a health check that happens at one point in this massive reconciliation process that happens every day within I would say every major investment bank in the world where you're trying to basically make sure that your books line up at the end of the day.
Sergey Fogelson: And this was something that had been done by thousands of people. When we first started this project, there was over a thousand people that were actually hired explicitly to do this, to manually check all of these records. And so what we did was we basically took years and years worth of their manual checks and just put a machine learning algorithm on top of that. We built this ensemble model and you could say look, given this metadata associated with these trades, what's the likelihood that they need to be reconciled manually or basically surfaced up to a manual reconciler, versus just pass it through the system.
Sergey Fogelson: Anyway, long story short, using machine learning on past human performance actually worked surprisingly well. We wound up being able to automate basically 90% of the reconciliation process in this way. So only the most difficult to reconcile records wound up being actually validated by human beings, which meant that those people that were hired to do this can now do other more meaningful, more impactful stuff. So I think that was an overall win.
Major Changes in the Industry
Adel Nehme: Yeah. I think all of these projects that you've engaged in are super useful in the sense that it gives you this breadth of experience in this data space. And this is one thing that I really want to pick your brains on, is really reflecting on how data science has evolved over the past decade or so. For example, you mentioned that some of the problems you were working on, Spark was still on 0.1, 0.2, there weren't really these mature data analysis frameworks. I'd love to pick your thoughts on what are some the major changes that you've seen occur over the past decade, and how do you really see it playing out within data science teams today?
Sergey Fogelson: I think that's a really interesting thing to talk about. When I first started, again, so I started in 2013, everything was on Hadoop. So basically my first work I was working in Pig jobs, then I was working in Hive, and then I was also working using a framework that came out of Twitter called Scalding. So it was basically Hadoop but using Scala, so writing Scala jobs. So Hadoop basically only now exists, from the way that I see it in the industry, as a legacy system. People are not going out and saying, "If I'm going to build a new state-of-the-art data architecture I'm going to use Hadoop." I don't know anybody or any company that actually makes that a flashy thing that they describe.
Sergey Fogelson: Now, what do we have? Spark is basically completely built out and has almost completely taken over data science from a data processing ETL and even ML kind of perspective. So there's not really any need to think about data from a MapReduce perspective. And in fact, I haven't touched a MapReduce job in probably I want to say three and a half to four years or something.
Sergey Fogelson: So before you had most of your data in flat files. So again, when I was working at the digital advertising startup, we basically had petabytes of data in flat files that were either compressed CSVs or back then what was this revolutionary new data format called Parquet, which now again is super standard across the industry. But nowadays it's actually so cheap. It's still reasonably expensive if you have really, really massive data sets, but it's so cheap to put terabytes of data into structured data warehouses that you can actually query data sets that are on the scale of tens or maybe even hundreds of terabytes.
Sergey Fogelson: I haven't heard of anybody having petabyte scaled data warehouses, at least within my industry because we don't really have petabyte scaled data yet. But I assume that there's probably somebody in finance or especially in web scale companies that are probably dealing with petabyte scaled data warehouses, where you can run basically structured SQL queries that will give you results within at most five minutes or something, which is just completely unheard of eight years ago when I started.
Sergey Fogelson: So that's I think the first really big difference I think, that people have basically moved ... There's still people working with flat files, especially with unstructured data. You can't just put unstructured data into a relational database, into a SQL database. So if you have text or you have images, I think it's very hard to still work with those in a more structured environment. You can definitely put the metadata associated with them into a database, but I don't think you can actually put the raw data itself into a database. So there is a place still for unstructured data. But for the most part, if your data is tabular in structure, it's going to be in a massive data warehouse. It's not really going to be in flat files anymore, unless it's completely unprocessed, completely raw and you're just putting it there for legacy purposes or for safekeeping purposes.
Sergey Fogelson: The next big thing that I think has happened, and this is really the actual revolutionary thing I think, the most revolutionary thing is really ... But again, it's not the sexy stuff. It's the orchestration, data pipelining frameworks for actually being able to automate data jobs on some periodic basis. So we're talking about frameworks like Luigi, frameworks like Airflow. There are obviously other ones. Airflow is what I use now fairly regularly or folks on my team use fairly regularly. But that whole idea of Cron for data science. There were people that were basically using Cron in legacy Linux or Unix based frameworks, for data pipelining, for ETL processes. But they really hadn't come into their own at that moment in time, they were still in their infancy. Now, basically everyone has some kind of a data pipelining framework that they use across some kinds of jobs within their data teams.
Sergey Fogelson: That I think is really, really important. That's really the stuff that allows you to increase the velocity of your data workflows. Instead of having to figure out how to automate that stuff, you can basically just build it once, forget about it. It's scheduled, it's run. There's alerting, there's error reporting, all of that stuff is baked in. You have a front end, all of that stuff. It's just really, really incredibly valuable. Okay, so that's the gut works or skunk works revolution that has happened in data science.
Sergey Fogelson: Then on the more ML standards side of things, there's the fact that we now have way more machine learning frameworks and the vast majority of common machine learning algorithms do not have to be written from scratch. So again, I'll come back to my first gig working in digital advertising. One thing we had to do was basically do a connected components analysis. So after you built your graph, you need to understand what the size is of all of your basically mini connected clusters, your households. And the problem is, is when you're dealing with a graph that contains hundreds of millions or billions of edges, that was a non-trivial thing to actually be able to compute.
Sergey Fogelson: So I remember we actually had to write our own connected components algorithm on top of Scala. So it was like you wrote a Scalding job that created this connected components graph and that was basically probably three months of work or something doing that stuff. Now, you take a network analysis package, an off the shelf package, and it's just baked in and it can handle graphs with hundreds of millions of edges and hundreds of millions of nodes without a problem.
Sergey Fogelson: Other things. So again when I was starting out, there was really only one implementation of gradient boosting that I remember seeing. And then all of a sudden, basically everyone has a gradient boosting implementation and most gradient boosted implementation frameworks now have GPU option and they're really, really fast and they're really, really robust. The point I'm trying to make here is that before you had machine learning algorithms that you basically had to implement from scratch or they were really, really difficult to get them up and running at scale. Now, it's not only that you don't have of implement them from scratch. You have implementations in various languages that are very, very fast. There are robust communities across each of these ML frameworks, and in many cases, there are GPU versions of things.
Sergey Fogelson: So now, they're way, way faster than they used to be. So again, you just have open source frameworks, there's lots of them. They're much, much faster than they used to be and they're much more extensible than they used to be. So I think there's been a crazy revolution there. But again, I don't know that that is nearly as important as I think the skunk work stuff that I talked about earlier.
Sergey Fogelson: The next things I think are visualization and explainability. So when I started, again, there were very, very limited explainability methods for non-linear algorithms. So for linear algorithms, a common thing you can do is you just look at regression coefficients and that gives you basically the full story. When it comes to non-linear methods, it's much, much harder to say what the actual impact of a given feature is on a specific prediction.
Sergey Fogelson: But that's changing, right? So now we have some very powerful explainability methods. I can think of SHAP or Shapley value explanations. We have LIME, it's more of a local-based method for explanations for visual problems, so problems in image recognition. And these algorithms really have foundationally allowed practitioners to quickly see where signal is in your feature space and where it isn't. Ultimately, this really has significantly accelerated how quickly you can iterate on creating new features, feature engineering, in ways that we really couldn't do in the non-linear algorithm space. So again, LIME, SHAP, neither of things existed when I started.
Sergey Fogelson: As an aside for that, but I think something that is much bigger now, at least within the past I would say year to two years is this notion of feature stores. So the idea here is that you can actually create basically databases or tables that contain the latest versions of the features that you're using for your machine learning algorithms and you can quickly update them, you can quickly source them, instead of having to recreate them in some process. You can almost think of them as tables that you regularly update with the latest versions of whatever features you're using to feed downstream models.
Sergey Fogelson: And what's cool about feature stores is that you can use them across multiple different model spaces. If you have a new data scientist that comes in and you're like, "Hey, you're going to be working on this churn model," you can immediately point them to a place where all of the latest, greatest features exist for that churn model, or from other models that have been built in the past. And so this person can be very quickly brought up to speed with what seems to be working in the problem space that you're attempting to tackle. That again, feature stores, were not something that existed when I started.
Adel Nehme: Thank you so much for this really interesting rundown of all of that perspective that you've seen evolve in data science over the years. And I think really a broad theme that you're talking about here is really the move from experimentation to operationalization and productionization, right? And definitely agree with as well you on the skunk revolution basically that you talked about, about how orchestration platforms and really the move to centralized data warehousing, at least for tabular data, has really changed the game for data scientists and their ability to provide value quickly and to scale their work. I think this a testament to how much the tooling stack for data scientist over the past years has evolved. And I'd love your insights then on the flip side on where you think there is still room for improvement and where do you think that the data tooling stack is headed.
Sergey Fogelson: Yeah. I think really the largest place for improvement isn't really on the side of greater, better, faster machine learning algorithms. I think at this point, yes you can always have a better neural network model that captures 0.1% more performance for this specific task. I think that's always going to be the case and I think there's always a place for that, but I think it's the operationalization bit that still has the most legs to really grow and mature across all aspects of data science. I mean, I talked about feature stores and these orchestration frameworks and pipelining frameworks, but they work but in many respects they're still fairly brittle. The automated model performance monitoring isn't still really where I think it would be really great that it could get to.
Sergey Fogelson: So another thing I think a place where there could still be some significant improvement or maybe I just haven't seen the right product, but seamlessly updating different aspects of the ML pipeline. So if you think of all aspects of the ML pipeline as being as modular as possible, so for example, everything from data loading, pre processing, feature generation, just the standard when you think of the JPEG of the typical data science pipeline, all of those parts, you treat them as if they're individual, completely isolated boxes that can just be popped in and popped out, but they really can't.
Sergey Fogelson: The current issues are that if you want to change or add a new data source to some process or tweak a data source for some process, it's really not seamless. It's not as simple as point to this end point and then your model will just magically understand how this data is structured, what needs to be done to convert it, et cetera. You basically have to touch every single box in that JPEG that you have for your typical data pipeline and change it in some specific way. So it's really not completely modular in the way that we think of pure modularity. There are good things to that, there are bad things to that, but I think there's still some improvements that can be done in that perspective, and then in general, more around this modularity stuff but really more on the real time side of things.
Sergey Fogelson: So right now, you can get a lot of stuff done for batch machine learning models. So what I mean by that is you basically have a machine learning model that you build at some point in time, A, then it runs for some specified amount of time. And while it's running, you're getting new data coming into the system. And after that time has passed, you basically recreate a new model that you then reinsert as some pre specified time, hopefully at a time when it's not system critical that the model is performing at a 100% capacity. So you basically take it out and you put a new one in. So this is what I think of as batch model building in data science.
Sergey Fogelson: The realtime stuff is much trickier, right? So there is work, there are algorithms that do realtime updating, so you can do realtime gradient descent updates, you can do some realtime updates on coefficients in linear models, whatever. But what I'm talking about is actually what if, for example, you could in realtime add in a new feature or take out a feature that isn't performing. Currently, the way you have to do that is you have to do as if it's a batch process. You basically have to turn off that old machine learning model, create a new machine learning model where you remove the specific feature or add a new feature or do whatever it is that you're doing and then put that in. So obviously you can try to make that as seamless as possible and make it seem as though it was the same model, but really it's not.
Sergey Fogelson: And so ultimately I think this idea of being able to in realtime modify machine learning algorithms in certain use cases that might be very, very business critical for certain businesses. Thankfully, that's not nearly as business critical in many of the arenas that my team tackles data science projects, but I can see that being an issue. So I think basically, a long story short, I think it's this combination of the pipelining and orchestration stuff still has lots of legs, and then realtime model updating or realtime changes both across the pipeline and at the very end where we're actually talking about the model itself, I think there's really lots of improvement that can be done there.
Biggest Challenges Affecting Data Science Teams
Adel Nehme: And with this evolution in mind, by covering the relatively technical limitations that are still present within data teams today, what do you think are some of the other biggest challenges affecting data science teams?
Sergey Fogelson: So I think there are two kind of overarching, non-technical but still very important challenges that need to be tackled. So I think the first one is just a lack of consistent industry wide processes for things and best practices. What wold be really great is if there was a data science related, I want to say like a field guide or something where we know these are the things that work across the board or work 90% of the time in these kinds of problems. I think right now what's happening or has happened, that information almost certainly exists in some very distributed, disparate, kind of in the ether on the Internet across random blog posts, and across random maybe as tidbits in the documentation in certain frameworks and stuff. So you find those golden nuggets and you'll be like, "Hey, these people are saying that this works here and these people also said that this same thing works here. Maybe this is just a generally good thing to do."
Sergey Fogelson: So one obvious thing that you could say about that is standardizing your data is generally a good thing to do and you know that because you've heard people say that and you've seen that it's worked well, but there's almost certainly a whole host of other kind of processes or best practices or what have you that don't just involve data pre processing. There might be things around this orchestration and pipelining. There might be things about what's the best practice for serving up predictions. How should they be done? I don't know, there's just so many things in that way that you really can only get right now via exposure to those problems and exposure to those industry leaders that have actually done those specific things.
Sergey Fogelson: But really that's not the way that you grow the overall industry. What you need to do is you need to disseminate that information and it really needs to be captured in some way, almost like in a Wikipedia for data science where you know exactly the way that those things are done. Granted, I sort of understand why that hasn't happened. Ultimately, there's this belief that if that information is scarce it makes what data scientists bring to the table as inherently more valuable. And I think to a certain extent in a very narrow, I want to keep being relevant in my job for the next however many years way yes that probably makes sense. But I think in the larger scope, like scale of I want to make all of data science be more productive, I don't think it makes sense.
Sergey Fogelson: And I think that if you want to democratize data science or get more people to be interested in this stuff or be impactful and to grow, you really need to disseminate this information. And I know that it's going to happen eventually, but I think ultimately that this kind of consistent industry wide creation of something like a best practices Wiki or something like that I think would be really, really important. So that's the first thing. That's kind of like a data science across the board critique I would say about where the challenges are.
Sergey Fogelson: I think from a different perspective, the other really, really big challenge is really buy in by senior executives within legacy companies. I think that, look, if you are a company that was built during the original heyday of the web, so I would say basically if you were built from, I don't know, '95 until 2010, you're going to have data scientists, you're going to have a commitment to data and insights and whatever because you had to have that to survive, especially when the going was really rough post the dotcom bubble of the early 2000s. You had to be principled and committed to data and insights from data in order to survive.
Sergey Fogelson: So from those kinds of companies, I don't think there's any issue with technical buy in. So if you wind up working for a Google or a Facebook or an Airbnb or an Amazon or whatever, that's all solved. I don't want to say it's all solved there, but they know that data is important and they understand what the scope of the technical challenges around data management and processing, all that stuff. They understand it exists, they understand it's super important. I don't think that that's really fully happened in the same way at legacy companies.
Sergey Fogelson: So companies that were founded before 1995, or that weren't founded with the Internet in mind as their primary engine for creating value, for those companies, they really still amongst senior leadership, I do not think that they fully understand all of the aspects of what it means to be a fully data driven, data science driven company. I think there are still lots of places where you have people making decisions based on their gut, based on intuition, based on industry knowledge. I think all of those things are super important. I'm not saying do not trust your gut, I'm not saying do not use your intuition, I'm not saying do not use your judgment that you've created and honed over however many years of being a senior leader at these companies. What I am saying is you need to understand that in order to continue to thrive and survive, you need to start using data much more significantly in order to derive maximum value for your business. And I do not think that they are doing that as much as they should be at these legacy companies.
Sergey Fogelson: I'm not just talking about where I work now or where I've worked in the past, it's a just general thing where when you start asking other data scientists or practitioners or data leaders at other organizations and you ask them what's your biggest challenge, 9 times out 10, they're going to tell you, "Look, I understand where the investments should lie, I understand what I need to be doing to make these things happen, but ultimately I need buy in. I need people that are in the most senior leadership, the C suite executives, to not only say every quarter we care about data science during their board meetings, but to actively actually say, 'We are going to make investments in data warehousing. We are going to make investments in this monitoring. We're going to make investments in third party data that we're going to purchase because we understand that the more data that we have that's relevant to our business processes, the more profitable or the more successful that we will be as a business.'"
Sergey Fogelson: So I think ultimately it's the fact that we have basically two cohorts, or at least at a minimum two different kinds of companies that are operating within industry today. You have legacy companies that are saying that they're committed to data science, but still have not really made the full plunge and then the companies that account for the vast majority in terms of what we would think of growth and success and value creation, whether it's from a stock market perspective or whatever. And we're talking about basically Amazon and the post Amazon companies that have been created since then that are full in data, full in on technical buy in and have as a result created, unlocked whatever, trillions in value for themselves and for the economy as a whole.
Pit Falls When Trying To Gain Buy In
Adel Nehme: I think there's so much to unpack in both of these points, but I think the second point that you mentioned of the lack of committed buy in by senior executives especially in relatively legacy industries where the organizations have not really used the Internet as their primary source of value creation, especially when you think about the Amazons and the Airbnbs and the Ubers of the world. I think that problem, the data culture problem to a certain extent is one of the biggest problems affecting data science today, so I would love to expand on that one. As a data science executive, I'm sure that you had tons of experience with gaining buy in from stakeholders, non-technical ones. Can you walk us through some of the pit falls data science teams or data science team leaders often encounter when trying to gain buy in?
Sergey Fogelson: Yeah. So I think there're a couple that I've noticed anyway. So the first I think is one that I think in general people do, where they're trying to seem helpful, but it actually ultimately can significantly hinder their longterm success within the company. It's the wizard claim, where people ask you, "What can you do for us?" And you say, "Well, everything. I can literally do everything. I can answer every question. I can achieve operational success anywhere you put me." And so yes, that's probably the case over a long enough timeline.
Sergey Fogelson: So if I had infinite time and infinite resources I can solve any question, but that's not what a senior executive is looking for. What they're looking for is they're going, "Look, I have these specific things I care about and I want concrete improvements on these." Basically when you over promise and under deliver ... So my first axiom of being a data science manager is you always under promise and over deliver, you never do the opposite. So you say you can do a little, but you wind up doing double that amount, so that people immediately come away impressed.
Sergey Fogelson: And that fosters that buy in because as soon as you way over deliver for a given project people ask, "Can you improve this one little thing?" And you go, "Sure. Oh, and by the way, I also did this, this, and this." That's great. What you don't want to do is go, "Oh yeah, I can do that thing, but also I can do this other thing. Oh, and I can do this other thing. Oh, and I can do this other thing on top of that." And then when you have your next check in with that senior executive, you've basically gotten 10% of the way across all of those things and you have nothing actually concrete to show them.
Sergey Fogelson: So basically, over promising and under delivering which I think is a general problem for any what I would call an innovation type group in an organization, I think that will doom your ability to get that longer term buy in because now they can't trust your word. You say one thing, that you can do X, Y, Z, A, B, C, and then you wind up only being able to do a third of X, a half of Y, and a tenth of Z, and by the way, none of them are actually a full letter. And so now what the hell I can do with that, right? So in general, doing that, so over promising and under delivering, is the best way for you to sink your ability to get technical buy in. So definitely don't do that.
Sergey Fogelson: But a second thing is a part of this, but it's really in the actual execution aspect and this is what people talk about, the perfect being the enemy of the good. Or, I like to think of the perfect being the enemy of the good enough or reasonable, let's just use that, where you basically just say look, this works well enough to where it's better than chance. If you have any kind of improvement over what came before, it's good enough, let's immediately start putting the gut works around this, so this can be a repeated process, it can be productionized, et cetera.
Sergey Fogelson: In general, and you hear this a lot, there're like two things you hear about data scientists. Data scientists always say, "80% of my job is data cleaning." And then the other thing that they say is when it comes to actually building a model, the first 20% of the time you get 80% of the way of the performance and then the remaining 80% of the time, when you're building the model, you get the remaining 20% of the performance. So what does that actually mean?
Sergey Fogelson: If you think about that in terms of actual time, it means that if you want to get a reasonably good model and you spend the first month doing that, if you want to get a 5%, 10%, 15% boost, it's going to take you months and months and months of additional work. So what the hell does that mean? What that means is those months and months of additional work, where you got marginal improvements in the overall quality of the model, they've taken precedence over actually productionizing the model.
Sergey Fogelson: What should have happened is as soon as you got to something that was better than nothing or better than what came before, no matter how limited that improvement was, you should immediately start building out the pipelines, the reporting, the monitoring, automating the ETL processes, automating all of that other stuff to actually get it to a place where it's actually a data product. And so I think that's the second really salient point for where if you're always just talking when you're meeting with senior executives and you're going, "We improved our model. We improved our model by 5%," and then you have another weekly meeting ...
Sergey Fogelson: Well, let's say that you're not going to get a weekly meeting with a senior executive, let's say it's a monthly check in. And so month one you go, "We got it to 80%." And then month two you go, "We go to 85%." And then month three you go, "We got to 87%." And then month four you go, "We got to 89%." The executive at this point's going to be like, "What are these people doing? 80% three months ago was way better in terms of potential revenue increases or decrease in losses, is way better than 89% now without any actual data product built and no actual business critical things driving it." So that I think is really, really, really, really, critical.
Sergey Fogelson: If you want to get solid technical buy in from a senior leader, you really, really have to, as soon as you meet criterion, where again criterion here has to be super low, where you literally just say it's better than what we had before, whatever what we had before was, immediately build something on top of it. Build the non-ML parts or non-core ML parts of that product such that you can immediately show that this thing works, we can deploy it today. It's not going to be great, but it's going to be better than what we had and that means that you're saving money.
Sergey Fogelson: And that's where as soon the senior leader's like, "This is great, we didn't have anything before, now this thing works," that's where they're immediately going to be like, "Okay, this is really awesome. How can we get more?" And then at that point you can be like, "Okay. Well, you see we did this. We can do so much more, but you have to understand here are all of the obstacles that we're facing." And so you basically use a quick win to then drive the more lasting, the more difficult, the more longer term change in that organization.
Best Practices To Ensure Organizational Buy In
Adel Nehme: Yeah. Getting a quick win is so essential to getting, one, a data culture enthusiastic, to bringing up an enthusiastic data culture within an organization, but also to get organizational buy in because you need to be able to provide value fast, otherwise there's going to be questions around the value of data science in general. So on the flip side, what are some of the best practices you've found that can ease an alignment there and ensure organizational buy in around data projects?
Sergey Fogelson: So it's basically like take everything I said and do the opposite of that with a little bit of other stuff thrown in. So one, I think the most critical business related aspect of being a data scientist is operationalizing and converting a statement by a senior leader into something that's measurable. So this person says, "I want to reduce X by Y percent." Or they say, "I want to increase revenue on this thing by this amount." You have to say, "Okay, increase revenue by this amount, what is tied to revenue? How can we measure that tie-in to revenue and what aspect of that generation process is the least efficient currently and how can we use ML to improve that efficiency in some way?"
Sergey Fogelson: So basically the idea is you have to operationalize whatever it is that that senior executive asked for, to convert it into something that can be measured, that can be converted into tables and bits, et cetera, and then do that as early as possible. And, make sure the senior executive is aware of what those criteria are and agree that they make sense in their context. So for example, if I go back to this original question where the senior executive said, "We want to increase our revenue in this specific field by 10%." So let's just talk about churn. We want to increase our revenue for this specific product by 10%. And then you look at the product and you go, "Okay, well one way you can do that is by growing your subscriber base. Another way you can do that is by limiting churn."
Sergey Fogelson: So you go back to the person, you go, "Okay, you said you want to increase revenue by this much, so why don't we think of a project where we actually increase the subscriber base?" So you tell the executive that and they go, "No, no, no, no, that's not going to work, we've already saturated the market. There's no more new subscribers that we can tackle and our acquisition costs are going to be too high." So if you get to that point quickly, you can immediately say, "Okay, we're not going to start immediately going down the rabbit hole of what can we do to acquire new customers, so let's look at the other thing. Okay, what about churn? What if we reduce churn by this amount?" And then the person goes, "Okay, yes." The executive goes, "Yes, exactly. So the way that I think we should increase our revenue is by limiting churn." And so now immediately you're like, "Okay, great, I understand that it's a churn issue, so let me get to tackling that."
Sergey Fogelson: In general, I think this is what's really important is that you, by making explicit things that are implicit in what the executive is saying, drawing them out, I think is really important for at least starting on the right foot, because otherwise you might assume that they're saying one thing, but they're saying something totally different and you really have to bring it out of them. So anyway, that's the first piece, this idea of operationalizing exactly what your success criteria are as early possible. Basically as soon as you have that first meeting, you make sure you understand exactly what it is that they are interested in tackling, and anything that is vague is made as explicit as quickly as possible because that means you can immediately get started.
Sergey Fogelson: Now, once you've operationalized that criterion, what were your criterion for successes, the next thing, and this is kind of obvious but you would be shocked at sometimes how difficult this is, is getting access to the data that you need as early as possible in a project's lifetime. Basically, as soon as you have that first meeting, you need to get access to this data. And it doesn't actually mean you need the full database, you just need something, you need a sample, you need just anything, what the actual real data looks like. It doesn't have to be, like I said, access to a production data warehouse, it doesn't have to be all of the flat files that have ever existed, just something, because ultimately that's the only way you can really measure the true amount of effort that is going to be necessary for this project to actually become viable.
Sergey Fogelson: And the reason I say this is because the only way ... And this is the only way that I've ever seen anything work in an organization. I don't think it's not because of anything nefarious, but it's just because people assume they know things that they don't. It's that you don't know the state of that data. Never, ever, ever trust anyone's claims about cleanliness, data structure, data frequency, just any assumptions that they have or what they say they think they know about the data. Until you've done basic EDA on it, so exploratory data analysis on it, you don't know anything about the data. Basically it could be anything.
Sergey Fogelson: They could tell you it's in a pristine state and then you get it and 90% of the columns have 50% missing values. One column might actually have combined several different data formats. I've seen plenty of times cases where you have a time with something that's a date, is actually both a timestamp, a month and a day, and then sometimes it's just like you mix time. There's just so many things that you have to check and see and the only way you can do that is by getting access to a snippet of what you're going to be working on as early as possible, because the earliest you do that, the better you can understand what your true timelines are going to be.
Sergey Fogelson: Then the next part, I think this is just the inverse of what I had said earlier. So I had said the perfect becoming the enemy of the good enough. What I'm saying here is look, as soon as you have something that's at some reasonable criterion, stop focusing on the core ML parts of your product and immediately allocate as many resources as you can to actually standing this thing up, because although it's fun to get incremental, there's a little dopamine rush, there's a little dopamine trip that happens any time you get a slight performance boost in your model, it's not the stuff that's actually going to get your project across the finish line.
Sergey Fogelson: What you need to do is you need to be building out the non-core ML parts of your project in order for it to succeed within the timelines that you've told people that they're going to have. So anyway. So the monitoring, the visualization, the pipelining stuff is the stuff that you need to build to get to the finish line. So start building it as soon as you possibly can because that's ultimately what's going to provide true value to both your business and to your senior stakeholders.
Sergey Fogelson: And then lastly, this is just a general sourcing and allocation and estimation thing that I've learned to do after being burned a couple of times. Basically you have your internal estimate of how long you think a given project will take, just double that. And then give that as your actual timeline to your senior stakeholder. So if you think that something will take you a month and a half to do, internally, tell the senior executive that it's going to be three months. And that way, when you do deliver it in two and a half months or in two months, it's seen as a huge improvement and everyone's really, really happy. Again, this is the idea of you always want to under promise and over deliver. Doing that is the way that you get senior executives to buy in effectively and consistently on the projects that you begin.
Adel Nehme: I think this is really solid advice and especially when applied to organizations who are still maturing their data science competencies. I would assume, for example, access to data is not a major problem at the Ubers and Airbnbs of the world, but this is quite the ubiquitous problem throughout the industry, and yeah I think this is super useful. Now with that in mind, I would like to segue to data democratization, because I think really an important aspect of data science is not only producing data products, but really equipping the rest of the organization with the ability to work with data themselves. How do you view the importance of democratizing data for data science teams as a strategic imperative?
Sergey Fogelson: I think the question almost answers itself. We know that data's really important, data's what's going to unlock the largest amount of business value when applied correctly. So if you can get more people in your organization that aren't pure data scientists or data engineers or data analysts to have access to that data to start thinking about it, the more successful that you're going to become. So I think it's absolutely critical to provide people outside of data science organizations the tools to be able to what I like to call fishing for themselves.
Sergey Fogelson: So if you can, if you do not have an expert background in a scripting programming language and statistics or in SQL or something, it would be really, really great if you could still get to even half of the kinds of questions that you want to be able to answer but you currently can't ask because you just don't have the necessary skills. And ultimately having data scientists do these things is a bad use of resources. Getting people in marketing or in, I don't know, in product or in some other part of your organization answers to questions that they should be able to get themselves if the data was in a reasonably structured enough or was placed somewhere where it could be easily accessed by non-technical people, it would just impact them so, so much more.
Sergey Fogelson: Like I said, data scientists are expensive and they spend the vast majority of their time organizing and cleaning data and much less of that time actually mining it. So if you think about it, if you can get even a small fraction of your organization, but more than currently are like this, to be able to either ... I think my dream would be if everyone could use SQL in the same way that they use Excel. If they could even get to that point, the entire organization would benefit just so immensely. The organization's abilities to tackle questions quickly I would bet would grow by an order of magnitude. So I think data democratization is super, super important.
Adel Nehme: Yeah, 100%. And even for example internally at DataCamp, we have a centralized data warehouse that on top of it you can have Metabase or some form of connection SQL database. And most people at DataCamp know how to use SQL and that has really enabled everyone to answer their data questions. If you're on the sales team, you want to see who's the account that has the highest sales, you can check that out immediately. If you're on the marketing team, you want to optimize spend somehow, all of this analysis is really done immediately through a Metabase.
Democratizing Data Science
Adel Nehme: Like any organization, there are so many things that we can do to become more data mature, but it has really changed how we interact with data in that sense. Well, this is highly use case or industry relevant. What are some of the low hanging fruit that you find data science teams can quickly implement today to further democratize data science and to, as you said, give people the ability to fish for themselves?
Sergey Fogelson: I think the quickest things that they can do are two things. One is you provide an aggregated data view at the level that business analysts would typically see data, and surface it up to people so they can plug away at it in a dashboard like environment, so something like a Tableau. I think that would probably be the first thing. Basically you figure out some reasonable level of data granularity, you surface up a table at that data granularity, you provide updates to that table. Basically it could be a materialized view. So a materialized view that's constantly being updated with fresh data at a set aggregation level, and then you surface it up either in a dashboard.
Sergey Fogelson: Or really, I think the other way is something like what you just talked about, like a Metabase. I know there are other tools that provide effectively something like an Excel-like connector on top of the data warehouse. So if you can even do something like that, I think that would be a very easy quick win. I think if you can get something in a place where people in the organization are reasonably comfortable with something similar to that, so providing something like an Excel-like product, but that actually connects to a data warehouse that has, like I said, terabytes of data, I think that would be one very, very quick way to unlock that value.
Sergey Fogelson: Now unfortunately, you will have to do some kind of socialization around the dos and don'ts for that. So I imagine if you have a data warehouse and people are using SQL, that's well and good, but you have to let them know if they're looking at the top X in Y, so top sales person or the top account across the organization, or whose generated the most amount of sales over the past month or whatever it is, that's great. But for example, if they're trying to get larger segments of the data, knowing that there are certain things you shouldn't do because it could break your database is really important. So for example, people doing a SELECT * without a limit for example, are basic kind of pit falls that they should avoid when performing these kinds of queries. And I assume that those same kind of things would happen even if you had an Excel-like connector.
Sergey Fogelson: But I think those are the two things you can do. One, provide a reasonably aggregated view that can then be accessed by people. And then once you have that reasonably aggregated view, have something like an Excel-like connector that people can connect. I've never used Metabase, but now I'm very interested in what this is, so I'm going to have to check that out. Thank you for that, Adel. But I know there are other tools like that, that provide basically a query layer on top of your data warehouses. So I would suggest that as well.
Adel Nehme: Yeah, exactly. And in your experience as well, what do you think are the obstacles standing in the way of enabling really mature or robust data democratization and what do you think are some of the tactics that can be alleviated there?
Sergey Fogelson: I think this is going to be my plug for DataCamp here in this podcast. I think the first thing that I would say is look, there's a fundamentally a skills disconnect between what's necessary in the modern database company and what people actually possess. So I think the first thing that people should learn is just learn SQL. Take some courses on DataCamp and learn SQL. If you don't want to use DataCamp, go somewhere else and learn SQL, but learn some SQL. I think that is absolutely the best way for a person to level up their data abilities nowadays if you can learn SQL.
Sergey Fogelson: It's older than Excel, more powerful than Excel, at least from a data munching perspective, maybe not from a bunch of other perspectives. But I think that's really important. I think anybody that is comfortable in Excel should learn to become comfortable in a SQL environment. And you think about a business analyst, they live and breathe Excel. They know it really well. I've seen people do stuff in Excel that I was just like, "What is this? This is not Excel. This is like some weird, abstract ... You have scripts in here where a cell has literally enough text inside of it to fill an entire page." It's like, "Dude, you're literally creating a Mario clone inside of this Excel spreadsheet. What are you doing?" If you can create Tetris inside of an Excel window, you'll learn to use SQL. You are very, very good at manipulating data. It's a little bit different way to think, but you should totally be able to do that.
Sergey Fogelson: And in general, that's kind of the way that you're going to have to mature just technically yourself and just your entire data org as a whole because data now lives in the cloud. The time of people passing around Excel spreadsheets and saving them to some J drive somewhere, that's going to keep existing, but that's not where your golden records are going to live. If you as an organization are still living in a world where everything, all of your master data sits across 300 or 500 different Excel spreadsheets, maybe that's sustainable in the near term, but that's just not going to be sustainable over the longer term.
Sergey Fogelson: That stuff's going to be in a cloud data warehouse, it's going to be in a cloud database of some sort. In order to be able to access it, in order to be able to perform any kind of a non-trivial query on it, any kind of a non-trivial aggregation on it, you're going to need to use SQL, so you might as well learn. So I think that's just absolutely important. I think you can do that in lots of places. I know that DataCamp has some excellent SQL courses. So those of you out there that want to learn more, you should totally check that stuff out. I didn't write any of them, so this is not a plug for any of my courses. But anyway. So that's I think the first thing I would say.
Sergey Fogelson: The next is I think it's more of a description of the entire data enterprise as a whole and it's really this lack of understanding or an appreciation of how the way in which whatever your source data is, wherever it's coming from, those signals, how they're collected or stored impacts how quickly or easily a given question can be answered. So I think this is more of a problem again for senior executives, where they just assume we have this data, so why can't you just answer this question? We have the data, we have every possible interaction that's ever happened on our website or on our app or on our service, whatever it is, why can't you just tell me how many Xs are in Y?
Sergey Fogelson: And the reason that I can't tell you that is because of the way that, one, the data's stored, or two, the way that the data was collected. And so having an appreciation that if for example the way that we store sessions is separate from the way that we store users, means it's very difficult to figure out how many unique users were on your platform over the past month. It means that if senior executives don't appreciate that, if they don't understand that the way this data's actually sourced makes it so that what you think is a simple question to answer is actually kind of a difficult question to answer, it's not nearly as trivial as you thought.
Sergey Fogelson: So basically this getting people. And I've definitely had to do some of this myself and it's paid dividends because they then can push back on others and be like, "Why the hell was this built like this? Why did we not do it this way?" And that immediately, I think it just makes the accountability for how data is processed, cleaned, how it's stored a lot more visible, a lot more transparent. And when you identify why there are these kinds of breaks in the system or gaps in the system, it immediately makes it so that everyone has to talk to each other a lot more. And the more that they talk, the more ultimately, hopefully the issues that they're having are going to be solved.
Sergey Fogelson: So I think that's really the second obstacle to this data democratization. It's really this notion of being able to understand how the data was sourced and being able to effectively evangelize how the data is structured immediately lets people understand and know two things. One is they have more of an appreciation for how difficult some things are, but they also now also have a much better understanding of if something takes a while, it means there's probably some significant gaps in the way that things are happening. So I know this isn't really a point on democratization, but I think it's more of a point on democratization of the challenges around data as opposed to data access, if you will.
Call to Action
Adel Nehme: I think democratizing data access and providing extreme context around how data is sourced and how it impacts a given function is also super important in the formula of data democratization. Sergey, it was a huge pleasure chatting. Before we let you go, do you have any other call-to-action to make?
Sergey Fogelson: One, I wanted to say thank you very much, Adel. This was really, really lovely. This was a great way for me to relive my entire data journey up to this point, so this was cool. I think really the only call-to-action I really have is I think everyone should practice a little bit more humility when it comes to data science. This stuff is hard and I think that most people or your companies or whoever it is that you're doing, that you have the best interests in mind. I don't think everyone is out there nefariously trying to ruin data projects. So I think practicing some humility when it comes to both practicing data science and also evangelizing for data science is really, really important.
Sergey Fogelson: And I think as part of that, when you're humble, it means that you always have room for growth, you have more opportunities to learn. I think ultimately learning is the way that you make both the largest impact within any organization that you work for, but it also makes for a much more meaningful life. I've enjoyed learning about new techniques, new frameworks and all of that stuff and I think as long you keep learning as a data scientist or really just across your entire life, I think you'll find that your work and the things that you do are going to become a lot more meaningful. So that's my hope for everyone, stay humble and stay learning.
Adel Nehme: That's awesome. I would highly recommend that you check out Extreme Gradient Boosting on DataCamp taught by Sergey Fogelson. And with that in mind, thank you so much, Sergey.
Sergey Fogelson: Thank you. Thank you again, Adel.
Adel Nehme: That's it for today's episode of DataFramed. Thanks for being with us. I really enjoyed Sergey's insights on how data science has evolved over the years and what is still remaining to really scale the impact of data science across organizations and industries at large. If you enjoyed this episode, make sure to leave a review on iTunes. Our next episode will be with Barr Moses, CEO of Monte Carlo Data on the data quality challenges data teams face and the importance of data observability and reliability. I hope it will be useful for you and we hope to catch you next time on DataFramed.
blog
The Past, Present, And Future of The Data Science Notebook
blog
Data Science in Education: Transforming the Future of Teaching and Learning
Shona Afonso
12 min
podcast
Data Science, Past, Present and Future
podcast
The Past, Present, and Future, of the Data Science Notebook
podcast
The Credibility Crisis in Data Science
podcast