The Full Stack Data Scientist with Savin Goyal, Co-Founder & CTO at Outerbounds

Richie and Savin explore the definition of production in data science, steps to move from internal projects to production, the lifecycle of a machine learning project and much more.

Jul 11, 2024

Guest

Savin Goyal

Host

Richie Cotton

Key Quotes

Production is a spectrum. It can mean many different things to different individuals in the same organization.

What I'm quite excited by these days is that maturity in conversation is coming back, in the sense that people are understanding that GenAI plus predictive AI is a thing. It's not as if one strategy or one approach is going to completely upend the other. At the end of the day, you need to build machine learning systems.

Key Takeaways

Before moving a model to production, evaluate its worthiness through extensive testing and ensure it adds measurable business value.

Strive to develop or hire full-stack data scientists who can manage the entire machine learning pipeline from data wrangling to deployment.

Leverage both generative AI and predictive AI to build comprehensive machine learning systems that address diverse business needs and improve user experiences.

Links From The Show

Outerbounds

Metaflow

[Course] Developing Machine Learning Models for Production

Transcript

Richie Cotton: Welcome to DataFramed. This is Richie. The role of the data scientist is changing. Some organizations are splitting the role into more narrowly focused jobs, and others are going the opposite direction and making the role broader. It's this latter approach, known as the full stack data scientist, that we're talking about today.

The name is derived from the idea of a full stack software engineer, and that's appropriate as the extra duties often include software engineering tasks. In particular, one of the functions of a full stack data scientist is to take machine learning models and get them into production inside software.

Teaching us about the new role and the intricacies of getting machine learning in production is Savin Goyal, the Chief Technology Officer and Co Founder of Outerbounds. He's the creator of the open source machine learning management platform, Metaflow. And before Outerbounds, Sadlin was also a software engineer at Netflix and LinkedIn.

I'm excited to hear what he has to say.

Hi, Sabin. Great to have you on the show.

Savin Goyal: Yeah. Thanks, Richie. Thanks for having me.

Richie Cotton: Excellent. So, today we're talking about data science in production. But what does that actually mean to be in production?

Savin Goyal: It's in the eyes of the beholder at the end of the day. Production is a spectrum. It can mean many different things to different individuals in the same organization. Depending on the maturity ... See more

of your project, your definition of production could be quite different.

I remember back at Netflix we had of course the Netflix recommendation system is perhaps one of the most famous machine learning system that's out there. And you can imagine, I think everybody would agree that the definition of shipping a model to production means that you're able to run an A B test against that model against you know, live user traffic and that is a good Definition of production.

But there were a lot of other data informed decisions that the organization was also making. And I'm pretty sure that's the same thing across many other companies too, where the output of the model informs a business strategy. So in that way, even an offline analysis of a model where you end up generating a Google Doc or end up writing a memo for consumption by let's say the business team can go a long way.

I'm pretty sure a lot of Companies in the consumer space, for example, Spotify, I believe, recently raised their prices or is about to raise prices, and there would have been a lot of statistical analysis and very likely machine learning that might have also gone in. In terms of figuring out what kind of pricing strategy or pricing changes across geographies would work well, and that's yet another example of putting models into production where no microservices involved no high scale, low latency inferencing is involved, but still is as critical to the business as any other machine that leapstates.

Richie Cotton: So it's really, it's not just about, , putting some data science in a consumer facing app. It could just be a well, This is going to be a change prices or something like that. I like that. Can you talk me through what are the steps that need to happen to go from? This is an internal project.

This is going to be something customer facing or otherwise in production.

Savin Goyal: I think at the end of the day, the first question always is that, is this model even worthy of being in production? Right? think there's sort of like a popular statistic, don't even know if it is accurate, that says that, there's like X number of models never make their way into production and that X number is obscenely high.

There may well be some truth to it, but I think one thing that's most important to understand is that in data science there is an expectation that many of your models will never make their way into production because you run some tests, you figure out that, yes, this model isn't really worthy of promoting to production or continuing any further development on top of it.

And that's okay. That's the nature of data science. And what's most important is that are you able to move from one model to the next version of the same model very quickly and effectively or not. So that's, basically my viewpoint when it comes to figuring out how to promote a model or what does production actually mean.

now to your original question of what are the different levels that are involved? If you think about the life cycle of a machine learning project, you know, it starts out with data, you have certain hypothesis you look at the data, you train a bunch of models, and then you figure out that, , in what capacity would you want to consume these models?

And if it's a consumer setting where you can run decently powered A B tests, then quite often that is a good next step. But That may or may not be feasible at times there are times when, you just cannot run every test, you're not in a situation to do that and then you can, Be a little bit more creative in terms of running quasi tests.

But when that's also not feasible, then we usually see people invest a lot more heavily around explainability of their outcomes. So, for example, one of the more popular machine learning application Netflix is this suite of models called crystal ball that allows Netflix to predict the dollar valuation of any piece of content so that they can be a lot more disciplined in terms of how they construct their portfolio. And creating a counterfactual in that particular scenario is going to be very tricky as well as running an A B test can be quite difficult as well. And in those scenarios, then you rely on the explainability Of the outcomes of your model, and then you rely on your business judgment at the end of the day, and that sort of, it's yet another mechanism of figuring out whether the model is worthy of being deployed or not.

Richie Cotton: Okay, so really a lot of it is about just making sure that you have something that's going to add value before you put it in a place where customers are going to see it. Before we get into the details of how you go about doing all this I'd like to hear some success stories. So have you come across any cases where companies have invested in putting their data code into production, and then they've seen some sort of benefit?

Savin Goyal: Yeah, yeah I think Netflix has always been sort of the poster child for machine learning adoption, and this was Gosh, like maybe a decade ago or something when they first spoke publicly about the impact of recommendation system on their bottom line. And even at that point, I think there was a paper at one of the more prominent machine learning conferences where basically we came to the conclusion that the Netflix recommendation system was helpful to the tune of more than a billion dollars to the company's top line.

And that's, really impactful. And now basically if we move forward a decade later. where exactly sort of the biggest impact has been made have been sort of the long tail of machine learning applications. A lot of organizations, of course, , if you're, let's say, new to machine learning, then you end up focusing on your tenfold use cases.

So, for example, for Netflix, it would be their recommendation systems. For Facebook or Google, it would be their advertising or search systems. And that's great. That's where you'll derive , the largest value. Bye. As you sort of mature up, then there are a lot of internal business processes, a lot of other consumer interfaces that can be made a lot better with machine learning.

And the aggregate effect of that usually ends up surpassing. The 10 pole use case, and that is something that we have seen quite often not only at Netflix, but of course, you know, we have a popular open source project called Metaflow. helps organizations build machine learning models and deploy that in production as well.

So we've been able to see the impact across other companies too. So, there are companies like 23andMe who use machine learning as well as data science. Thanks. To predict , traits and outcomes using genetic data. And that has been quite remarkable in terms of just doing first rate medical research.

We have companies in the similar space like Medtronics we've been focused on surgical AI. And that's yet another area. If you just look at drug discovery or automated surgeries, the amount of value that can be generated and the actual real world impact to people's lives really great.

Another example that I came across during the pandemic was this company called Zipline. They build automated drones that are responsible for bulk of drug supplies, including COVID vaccines in sub Saharan Africa. And their drones are modeled, like they are powered through machine learning as well.

And at the end of the day, of course, we look at sort of the big tech sector and how sort of interesting consumer applications are made that it's always getting the limelight. But when you look at these real world impact that is actually really shaping people's day to day I think that's been sort of a lot more rewarding and enriching to just see from the sidelines at times.

Richie Cotton: Absolutely. So some very different examples there. suppose the Netflix example, where you're getting a better movie recommendation, it's a small benefit to a lot of people there because it's scaled so much, whereas on the other extent, the idea of AI Surgeon, that's helping one person a lot.

So you got that the very different use cases there. Alright, so, let's get into a bit of depth on some of these steps that are involved in like before you deploy your code. So, what sort of tests do you need to perform on your model so that you know it's ready to go live?

Savin Goyal: I think it depends as with anything else. I mean, of course, nowadays, if let's say the Netflix recommendation starts serving crappy recommendations, then it'll become a social meme. So in many ways, the amount of due diligence that somebody needs to push out a new machine learning model for recommendation systems for popular services like Netflix, Spotify, and whatnot, is considerably higher.

But Nobody is really going to lose their life, right? it's not a life altering thing if you push out a bad ML model. Versus if you're looking at let's say the healthcare space or the fintech space, then any kind of bias or any kind of issues with the model can have life altering outcomes.

So the bar really goes up significantly. And Fortunately, or unfortunately, that varies from industry to industry, as well as the application that you're actually using it. So there isn't sort of a one size fits all answer. The most important thing always is that as a data scientist or as a data professional, do you have full understanding, full view?

of how exactly a model is actually going to be used. And are you able to sort of control and observe those outcomes and really iterate from that. And we've seen that as sort of one of the more primary failure mechanisms where you can imagine if let's say, Putting a model to production or doing data science involves many, many individuals in an organization on one single project, then many of these important concerns become nobody's problem because there's no one single individual who has complete control over the end to end process and that's been sort of one of the biggest push that we have been advocating for that.

How do you essentially make your data scientist a full stack data scientist on an end to end data scientist so that they have full control around the model the kind of data that they are building these models on when are these models actually getting rolled out what are the impact of these models on real world outcomes and not just statistical measures.

Richie Cotton: So you're saying one of the biggest problems then in terms of maybe, like, quality control is going to be the fact that there's lots of different people doing different stages of the workflow. So in order to eliminate that, you have one person who is responsible, from end to end.

This is going to be a full stack data scientist opportunity. Do you want to tell me more about this role and what it would involve?

Savin Goyal: so you can imagine, you know, like, if we let's say harken back to the 70s or the 80s, and this might even be true today as well in many places shipping software was really expensive shipping software of any kind so of course, you know, you had your development team composed of software engineers, then you had a testing team composed of QA engineers, then you had a different group of people who were focused on shipping software through sort of , these release managers.

And then you had your database admins and application architects and SREs in the mix as well. And that raised the cost of doing anything in software engineering. Of course, there are plenty of areas where having specialized roles makes a lot more sense. Of course, definitely, you know, when you sort of get to a certain level of skill as well it's a lot more important to make sure that you have multiple eyes.

Bye. If you have five or seven of these different roles involved for every single project, then doing even the most simple thing becomes a lot more expensive, a lot more time consuming. And that's been , one of the big promises of the modern DevOps movement, as well as the advent of the cloud, that how can you have basically one single software engineer?

Who can run the entire gamut. I mean, of course, there isn't an expectation that overnight. They're also , significantly skilled at building front end as well as back end and being an expert around CICD. But the tooling has evolved and matured where you can have an expectation that a small software engineering team can return outsized returns.

And we are basically hoping for the same thing on the data science side of the house as well, that how do you basically ensure that it's not only , these 10 pole use cases that are reserved for machine learning, and you can ensure that a single data scientist or a small team of data scientists can really deliver outsized business impact by running the entire life cycle of a data science project, right?

So imagine all the way from. How do you figure out what data does an organization have? Is that data available in the right format? Does that data have the right quality? Can I access that data in an interactive manner so that I can play around with that data and really understand what are the possibilities that that data has encoded?

Then at the same time, do I bring in sufficient business skills to the table? Do I understand the business perspective that my organization is involved in so that I can marry my data science skills? With this business perspective and start to figure out, you know, what are the best ways for me to optimize this specific business problem.

And what we have seen historically is that data scientists come from quantitative disciplines that may not be software engineering related. Right. And that's where sort of one of the first bottleneck presents itself. That if let's say you're playing around with a lot of data or if you're in an organization where, you know, everything needs to happen within a certain specific governance boundary of either your data warehouse or within a specific cloud.

And how do you interface with that cloud? How do you , interface with all of the engineering and business architecture complexity? It's sort of , one of the big areas where people see declines in productivity and that then necessitates specialized roles to come in and you oftentimes end up in a problem where things fall through the cracks or certain important things that are like, how do you figure out if the right thing is happening, unfortunately becomes nobody's problem because there's no one single individual who has complete end to end perspective.

So that was one of the big reasons why we ended up creating Metaflow back at Netflix, which is like an open source machine learning platform. And it's geared towards ensuring that a data scientist can become this full stack end to end data scientist, and they can control more of their destiny.

And that should, in theory, then allow them to iterate on their machine learning models on their data science projects a lot more quickly. And in that scenario, you're then also able to gradually move the interface between data science teams and other software engineering teams.

If you realize like a decade ago when I started my career data scientist role used to be limited to prototyping on a laptop. And then there would be teams of engineers who would take that prototype, scale it out, and then deploy it, and off you go, right? And the scenario there was that, of course, as you scale out your model training on a lot of data, then many of the statistical guarantees that a data scientist was looking out for may no longer hold true.

And the software engineering team doesn't really understand the intention behind anything that a data scientist was trying to do. The data scientist has no idea what got shipped into production. And it was sort of like anybody's guess if even the right thing was happening in production. And now we are sort of getting into a point where a data scientist should be able to, let's say, expose an API to their work.

Maybe that API is a model. Maybe that API is an actual REST endpoint that you can call into. Maybe that API is a memo that has been written for consumption by sort of other teams, but at least sort of , that provides a lot more control, a lot more visibility. To a data science team sort of like shipping off their work.

Richie Cotton: lots to cover there and we'll certainly get into Metaflow later. But it sounds like when you've got this idea of a full time data scientist, they need the data science skills, they need software engineering skills, they need business skills. This seems like it's going to be quite difficult to hire for, and I'm wondering whether there's a trade off between having a large number of more specialized individuals versus having this generalist who can do everything.

Do you want to tell me how, like, A data team's going to be comprised then. I presume you wouldn't want all full stack data scientists or all more specialized individuals? Are you going to mix the two? how does it work?

Savin Goyal: I think the answer is better tooling at the end of the day. Of course, if the expectation is that we are able to , find somebody who is really great at engineering, really great at data science, really great at business sense. That's probably a unicorn. You may be able to find a few, but definitely that's not a strategy that can scale out.

Now, two out of three is something that would be desirable. Finding somebody who is you know, equally good at data science as well as understanding the business intimately, I think that in many ways is the minimum bar on hiring an exceptional data scientist. But paired with great tooling you can definitely ensure that the level of abstraction that they are working on at least the accidental complexities of the cloud can be taken care of for them. And then you can expect that this particular data scientist to then , take care of the business complexities and the data science complexities.

So that's basically sort of where I see many data science teams to be moving towards as well. I think, that's sort of one of the big reasons why many companies have also invested in their internal ML platform teams. As well where the entire prerogative of these teams is to provide the set of tools internally sort of like, , a point of leverage so that every single data scientist is insulated from the harsh reality of engineering but the interesting dynamic there is that.

How do you build tools that are really good at navigating around this fact that there is some amount of complexity that a data scientist would want to take care of. And then there is some amount of complexity that they would want the tool to take care of. And how do you sort of thread that balance?

It's always an interesting question.

Richie Cotton: , I definitely want to talk about tools in a bit, but just to press you on this idea of teams then. So if someone says, scientist, I want to become a full stack data scientist, what needs to happen to take that extra step?

Savin Goyal: I think the question there is that if you're not this full stack data scientist, then what kind of data scientist are you? And My colleague here at Outer Bounds , his favorite term is a laptop data scientist somebody who is , very well adept with, let's say, the Pythonic ecosystem or just you know, everything that's available on the machine learning side of the house and is able to understand the characteristics of the data and get their work done.

So there's that aspect. And then on the other side of the house, you need to figure out as an organization, how do you actually ship value through data science, right? And then there's the gulf of complexity that you need to cross in between. And one way is that, yes, you become equipped at handling that gulf of complexity all by yourself.

An example, here would be, let's say you want to train many different machine learning models using GPUs. And you're , constantly iterating on , different hypothesis. And that GPU form very likely is not going to be your laptop.

it's going to be something in the cloud. Let's assume that, you know, it's Kubernetes, the most popular computer orchestrators out there. Now, on one side, I can have an expectation that maybe my data scientist understands the nitty gritties of Kubernetes ecosystem and how to sort of , run these Kubernetes pods reliably and manage them and monitor them.

And when things go wrong, gnosis are her way around debugging. These failures but it's a very complicated landscape. And unfortunately, if it was only kubernetes that people had to worry about life would still be easy. Then you also have to worry about that. Okay. I have my data that needs to come from somewhere.

It needs to go somewhere else kubernetes. How do I think about that? I'm constantly experimenting reproducibility can be a big problem if let's say my colleague is running into a failure I'm supposed to help them out. But if I cannot reproduce or replicate that same failure You What are my odds of even , being capable of helping them out one bit.

The complexity very quickly multiplies and there are now multiple tools in the space that sort of help in this specific area. So, making themselves well appraised the latest and greatest tooling would be one. I think as practicing software engineers It falls on us as well to really understand where the world is headed.

What are the new paradigms around engineering? And I think it's the same thing that I think most data scientists also understand that for them to sort of stay relevant they also need to sort of equip themselves with, if not necessarily the details of every single thing out there at least sort of understanding what are the layers of abstractions that are available on top of these building blocks that can help them get their work done.

Richie Cotton: . So, really just Make sure that you're on top of, the latest tools. That's going to stand you in good stead for improving your skills. Okay, so, let's talk about Metaflow, since you're the creator of it. So, to begin with, can you tell me, what does Metaflow do?

Savin Goyal: Metaflow is an open source ML platform. To put it succinctly, it helps you train, deploy ML models and Targeted towards essentially building ML systems, I think, you know, in this conversation, we have spoken a lot about data science and machine learning per se, but at the end of the day, an organization is trying to build a system and a machine learning model is only a part of that system.

So how do you basically get to building these systems, which can oftentimes be complex, they may cross team boundaries. They may interface with significant engineering infrastructure. How do you basically ensure that a data scientist or a team of data scientists is capable of doing all of that is basically what Metaflow strives to do.

Richie Cotton: Okay, so it's just to help you take the sort of steps into getting your code into production. Now, I know there are just dozens and dozens of MLOps tools, so can you talk me through how Metaflow fits into this larger ecosystem of tools?

Savin Goyal: Yeah. Yeah. So I can walk you through what , , even prompted us to start the project in the first place. Now of course, you know, many of these tools they've taken a life of their own and , They cater to different markets or different use cases. What Metaflow is targeted towards is a practicing data scientist.

So it is not a low code, no code solution. It is a solution that's targeted towards a data scientist who either understands Python or R really well. and also brings in that data science understanding to the table. So we are not in the business of teaching people how to do data science. Metaflow is a tool that enables people to do data science well.

So that's sort of the big thing here. We started working on Metaflow back in 2017. So gosh, it's like now close to seven years. And we were at a spot back at Netflix, where Netflix was now looking into investing in machine learning. Across the entire life cycle of their business, right?

So not only like how do you do recommendations? Well, but how do you construct a portfolio of content? That is going to drive your subscription base higher and higher up. How do you figure out what is the best content that's available? How do you leverage economies of scale in either licensing or producing that content?

How do you take these bits and stream it to, people's TVs and mobile devices so that they have an amazing streaming experience? How do you fight fraud? How do you take care of pricing challenges? You can imagine, you're like, How If you start thinking about all the places where you can start investing in from a machine learning standpoint at a company like Netflix, and it's really in many ways, like a kid in a candy land.

And the other sort of interesting aspect was that while Netflix is usually lumped into this cohort of fan companies, and there's sort of a connotation with fan scale, I think Netflix is a lot more closer to your average company that is on the public cloud. But it's sort of , just a whole bunch of different interesting problems that Netflix is trying to solve that adds to that complexity.

So we were now getting to a spot where the solutions, the tools that we had built for our recommendation systems had served us really well. They were predominantly built on top of the JVM stack that was really popular instead of , the early to mid 2010s.

And now we were coming to a spot where the number of people Who were excellent at engineering and data science and business were very few. And of course, now, if you have to start investing in many different areas of data science, you have to then , pick and choose your battle.

And we'd said that, okay, of course, we can't really skimp on hiding the very best data science talent that's available. But then, somebody has to come in and really pay over, you that gulf of engineering complexity and then the goal for us was that, okay, how do we basically realize the stream of making our data scientists full stack data scientists?

How do we basically provide them solutions where they can get all of their work done on their laptop? But we can bring the cloud to their laptop, right? So you can basically sort of imagine, can I , provide hundreds of thousands of CPU cores and hundreds of thousands of GPUs and and petabytes of RAM to your laptop so that you don't have to become a cloud engineer to scale out your machine learning projects.

How do we ensure that people can take a graduated approach? Because you can imagine, , not every single project will be a humongous scale Machine learning project, but then at the same time, every single machine learning project will help or go , much better if there is some amount of discipline baked in, right?

I think there have been plenty of times when we have run into this issue where something works on my laptop, but does not work on my colleague's laptop, or I'm able to , install a version of PyTorch today, but not able to install the same version of PyTorch two days later. I mean, you know, a transit of dependency has.

changed and something is like subtly off and I'm trying to figure out what went wrong there. So it's sort of like a barrage of small little problems as well as, you know, some rather nefarious problems as well. Particularly on the compute and the DSI that you have to , start worrying about as a data scientist.

And our goal was that, okay, can we ensure that they don't necessarily have to worry about it? And can they just squarely focus their efforts and energy on wrangling the data science complexity? That's what their expertise is in. And if, let's say, Netflix as an organization is expecting them to spend a lot much more time wrangling nefarious issues with like, , how do I move data from my data warehouse to my GPU instance so that my GPU cycles are not wasted?

that's not a thing that a data scientist should be focused on very early. In a data science project, because you don't even know at that particular point, the approach that you are taking for your machine learning model is even worth it. But if you , take an aggregate view, if you have hundreds of data scientists, all running GPUs suboptimally, then that expense really adds up.

to a non trivial number that as a platform engineering team, I do indeed have to care about. But if you can codify all the best practices and provide a user experience that is a lot more human centric, that works with the data scientist, that a data scientist doesn't have to fight against, then it becomes a lot more easier, where by default, all the right things happen on the engineering side of the house.

The data scientists, their freedom of choice in terms of how they want to navigate the web of data science complexity is preserved. And the organization then , gets to benefit both from cost optimization because ML can become expensive at times if you're not careful about it, as well as making sure that, , you're able to innovate quite actively and sort you know, quite quickly.

Richie Cotton: One thing you've mentioned is the idea of working on a laptop. There's been a huge trend in the last decade or more about everything is going into the cloud. So the idea of working on a laptop sometimes, but also having access to, these, large scale compute that's in the cloud that sort of indicates some kind of hybrid computing.

that the approach you'd push for, or Is there like a reversal of the in cloud trend?

Savin Goyal: you know, there are many benefits of being in the cloud, for example if let's say all your data is in the cloud. You don't want any of that data to ever leave the cloud, like one big reason purely from a security standpoint, why everything would want to happen in the cloud, but your laptop can still be the interface to that cloud, right?

So from that point of view, you might still be accessing all your resources through the laptop, but the code that you might be writing might actually be running entirely in the cloud and the data may never actually , show up on your laptop and everything might happen through your ID or through your browser.

So that's one universe. And then the other universes, at times , the problem may not be sort of the problem that you're trying to solve require very steep computational resources or managing a lot of data. The data could be enough to in your laptop. It may not be super sensitive.

You could use something like scikit learn or, many other popular frameworks to build your machine learning models. I mean, the number of things that my Macbook can do these days, I mean, it's just beyond imagination. But then what you still sort of need at that particular moment is still some discipline, right?

You still want to figure out how you're going to catalog your experiments. You still want to figure out what is the best mechanism for you to ensure reproducibility so that, you know, you're able to understand how your models are behaving, or if you need to course correct, then you're able to do that easily.

many times one definition of productionizing your model. Can be that, okay, whatever work you have done on your laptop. Now, of course, at the end of the day, you're going to shut your laptop close and you're going to go back home. But maybe you want to sort of run that model training process every single night, every single week, maybe when new data shows up.

How do you sort of , push that into the cloud? I mean, it could be your on prem infrastructure as well, right? But basically, how do you sort of take something that is running on your laptop and reliably run it elsewhere? That can be one definition of productionizing your machine learning model in a variety of projects.

And that can be, a big activity where a lot of organizations may sync a lot of resources where the data scientist was able to prototype something on their laptop. But now just this process of converting their wares into something that can be run outside their laptop can be an activity that is sort of measured in months or quarters.

And for us, that was another goal that, okay, can we sort of take All the work that a data scientist is doing, whether that work is in the cloud or whether it is in the laptop, but it is available in a format such that it can be run anywhere else almost immediately. So, you don't have to sort of like anyway, at least then worry about this process of okay, I had something that was running, but now I need to go back to my manager and , ask for another one month before, so that I can come back, address all of my pain points and issues and refactor my code so that it's not worthy to be put in production.

But can be sort of like, you know, just flip the script. Such that the infrastructure basically allows you to do all the right things from the outset.

Richie Cotton: certainly that's like anything involving package management like environment And just having things not working from one machine to another that can be incredibly frustrating. If you've got to deal with this manually so the idea of is one important, aspect of getting things in production another thing seems to be scalability.

So, once you could go into production like You maybe got , your model being accessed by millions of people. How do you make sure that your code is going to be scalable?

Savin Goyal: I think in many cases, especially you know, on the consumer side of the house. Many times, it might not be feasible to understand what kind of scalability requirements you're gunning for in the first place, right? many cases, it is indeed possible, especially if, let's say, you know, this model is a subsequent version of a previous model, but if you're on a net new project Especially on the consumer side of the house the virality loops involved the kind of scale that you may run into may quite be unpredictable.

I mean, at the end of the day, sort of like all boils down to the question of project management and just like software engineering skills, that if you are, deploying a model, let's say in this case, if you're talking about , recommendation systems right.

I mean, because that's one area that I'm very well familiar with. If you think through, users, tastes and preferences, So if you're, let's say on Spotify or if you're on Netflix, then there isn't a lot of. Brand new content that is coming in very, very quickly, as well as your tastes are not really changing very quickly either.

And you already know what your entire user base looks like and what their preferences are. So you can pre compute those recommendations and then just serve those recommendations from a database. So you're not like doing any kind of live model inferencing. And that has , amazing scalability benefits, very simple, straightforward approach.

But of course it may or may not work. In many use cases. So my recommendation for folks is if you already understand what your scalability metrics are that you're trying to achieve, then there's always sort of An architecture that's possible, of course, you know, the amount of expenditure that you're willing to incur in that project is also a big input to that.

But don't overthink it, don't prematurely optimize it. , there's plenty of hacks and approaches that people can take to at least sort of buy more time before you really understand , what kind of next scalability benchmarks you need to be going for.

I think there's a similar scalability hurdle that is present on the model training side as well at times. And , that's sort of , in many ways, a silent killer for many organizations, where usually sort of the deployed model, First, , it's sort of like generating business value.

So you want to really make sure that that works well. And there's a lot of light that's shown on those use cases. And then you have, let's say some models that you're training that would be consumed directly in production. And people are also sort of ensuring that yes, you have, let's say all sorts of alerts and observability that sort of so that those models are actually generated on time.

reliable results. But the third sort of bucket where you have teams of data scientists who are actively experimenting that can be a big cost vector as well. And what we have seen many times is that the overall cloud costs. for experimentation can be orders of magnitude higher than your cost that a deployed model is incurring at times, because you may have , hundreds of models that you're experimenting with, but only a few models or maybe tens of models that are deployed in production and The unfortunate reality is that with experimental models, it's also very hard to sort of , ensure that from a cost efficiency standpoint you have complete awareness of how you want to actually demise that model training.

Or if you want to, let's say, scale out that model training, then what kind of engineering effort would be needed as well. So one interesting example that comes to mind here is that I was working with a data scientist, and they basically yes, you can debate whether this was a good reason or not, but they were basically trying to predict what content is going to be popular in any given geography. At any given moment so that the content distribution network, the CDN infrastructure can be seeded with the right kind of content, right? So imagine if, , you know, you're a company like Spotify and you know that certain kind of music is going to be popular in certain geography at certain hours.

Or if you're a company like Netflix who is releasing a brand new show and they've done a significant amount of marketing in, let's say, Australia, then you would want to make sure that people have an amazing experience streaming that content and you don't run into debuffer loops or anything of that sort, right?

And they wanted to build these models for every single neighborhood in the world. And they decided to build 60, 000 models in one shot. Each of those models required a container with a GPU. And now this is immense amount of compute that you're running, right? And ahead of time, you don't even know what is the ROI of this entire effort going to be.

And if let's say it is just to a data scientist, it can be a significant engineering challenge, all right? So then how do you run this much amount of compute without being a professional cloud engineer? And even for a single. professional cloud engineer, this can be sort of like, you know, oftentimes a bridge too far.

And Metaflow, they were able to run that compute very seamlessly. And at the end of the day, they , also got a nice bill of , here's how much money that you have spent. And of course, you know, when you start spending so much amount of money, some eyebrows are oftentimes raised.

And so of course, you know, people wanted to understand, you whether the spend was worth it and you can imagine, you know, in this case obviously the amount of capital that this company was able to save in terms of their CDN optimization was well worth this expense of , running 60, 000 GPUs fully engaged for multiple days.

Many times it may not be and just like, you know, having that perspective at times as well that, okay, you may want to scale, but is that scale actually linked to your business outcomes or not can be a lot more important too.

Richie Cotton: Yeah, I can see how you certainly want to just speak to some other people before you fire up 60, 000 GPUs and run them for a few days. Yeah, probably best to get that business alignment first. So, I'm getting the big theme of , just don't do calculations that you don't need to do, and just make sure you have metrics around, like, how performant you need things to be.

I think we talked a little bit about reproducibility, about scalability. The other aspect of things in production seems to be robustness, because as soon as things start to use this, they're going to give you stupefied impulse, things will behave in weird ways. Do you have any tips for how to make your data Programs, models, whatever more robust.

Savin Goyal: Yeah. , of course, we have to think through robustness through the entire layer of the ML infrastructure stack in many ways. So of course there's a thing that, okay, does your model actually encapsulate the behavior that it is trying to predict?

Have you taken into account things like seasonality and all of that. So, I sort of put all of those concerns on the data science side of the house that a data scientist needs to sort of worry about in many ways, versus as a tool builder or what I like personally , to focus more on is that, is the underlying infrastructure robust enough?

Because many times if your infrastructure gives way and you're unable to let's say generate a fresher version of the model, then your model performance is going to take a hit and that will have a direct sort of , hit to the business KPIs as well. And then the question becomes that, okay, how do you sort of get to a point where your infrastructure is robust, but then more importantly, of course, in the age of cloud, you can't promise 100 percent uptime.

for any piece of infrastructure. And you can sort of increase your robustness rates, but I think as you pointed out, right, like dependency management can be a big issue. So today you are able to install PyTorch, tomorrow you may not be able to install the same version the exact same way. And what do you do at that particular point?

And the big question then becomes how do you quickly recover from these errors? Or how do you quickly recover from these failure mechanisms? If let's say you have training pipeline to train a model. But that training pipeline to train that model depends on yet another upstream pipeline that is generating some embeddings.

Now, if that embeddings pipeline is failing for whatever reason, of course, your downstream model training pipeline cannot execute. And there may be other processes as well that are dependent on it. So it becomes very imperative to figure out what is the quickest way to be able to diagnose what went wrong.

with your embeddings pipeline and how do you basically recover from that failure so that your subsequent pipelines can start , executing. And that is oftentimes one of the areas that can be quite under invested in an organization where doing machine learning is so difficult at times, involves so many different moving pieces that the focus is on getting the happy path working.

And not necessarily focusing a lot on when that happy path is not quite happy when failures happen. How do you sort of recover from that? And the complexity arises from the fact that so many different things can fail. Right. I mean, a lot of people , these days they try to find cheaper GPU resources.

So they end up going to, you know, a cloud provider that might not be sort of one of the more prominent hyperscalers. And then they unfortunately realize that the machine that they are buying. It was advertised to have four GPUs attached to it, but unfortunately only has three working GPU drivers.

And that's why , certain things are slow or failing and then figure out how to take over from, right? Or your data changed for whatever reason. And now middle of the night, you have to wake up and you have to sort of , step through your work and even replicating what failed can be at times really tricky.

And then figure out that, okay, this was the actual change in data. That was the cause for trouble. Either patch your pipeline or wake up the person who is responsible for the upstream data pipeline so that they can fix the error. And , that becomes sort of , one of the bigger teams that as an organization, how do you basically recover from these errors?

Your MTR, how do you lower that down? And that was like one area that we sort of focused on quite a lot. And I think that sort of thing also pairs into this notion of reproducibility as well that while yes, you want to reproduce the good behavior of the model so that you have more trust in that model, but you also need to be able to reproduce the failure patterns. Somewhat reliably, of course, I mean, there's a lot of stochasticism involved in failures as well, but at least for a certain class of failures, if you're able to reproduce them reliably, then you also stand a shot at being able to fix those quickly and move on.

Richie Cotton: I know being on call is like a standard feature of being a software engineer. Waking up at like 3am to try and fix some data pipeline or even worse, having to wake your colleagues up as well to say, okay, can you help me debug this at 3am? That seems like something I wouldn't want to do on a regular basis.

So, Can you talk me through, what sort of processes should you put in place to make sure that you don't have regular failures? how can you improve that reliability?

Savin Goyal: the thing is at the end of the day, If let's say you have one simple strategy here is if you have a pipeline that is running, let's say, every week, and if it is super critical, then you may want to run that pipeline every day, but only consume the output every week. Right? So that then at least .

If it fails in between, then it's not something that you need to wake up in the middle of the night, you have an entire business day or an entire week, or like half a week on average to actually fix up the issue before it becomes a burning issue. So playing with you know, that frequency sort of arbitrage on the training side is almost always useful.

Of course it comes with , extra costs as well. And many times that cost may well be worth it because if it is not super urgent, then by even wake up in the middle of the night, you can wake up during business hours and address it. On the model inferencing side I think many, many techniques exist from the software engineering standpoint, but the one thing always is that you know, always have sensible defaults.

So for whatever reason, your machine learning system is done, you can always fall back. on heuristics or certain rules so that the end customer experience isn't impacted. I think that's , you know, one of the more common failure mechanisms that I end up seeing, unfortunately, where because a machine learning system is down the sort of immediate customer impact or something is broken.

Many times it's unavoidable, but the, systems can be designed in such a way that the user may get a subpar experience, but at least the critical functionality is not entirely impaired.

Richie Cotton: Okay, so you want to kind of fail gracefully and just, maybe cut off one bit that's not working and have everything else work. All right, nice. Before we wrap up, what are you most excited about in the world of MLOps at the moment?

Savin Goyal: I think definitely A couple of years ago especially when ChatGPD came out there was this conversation going on on Twitter or X that, you know, is this the death of data science? Well, data scientists as a job function even exist. And I think what I'm quite excited by these days is that I think That maturity in conversation is coming back in the sense that I think now people are understanding that Jenny AI plus predictive AI is a thing.

It's not as if , one strategy or one approach is going to completely append at the end of the day. You need to build machine learning systems. These machine learning systems may be a combination of many different models, which might be built using a variety of different ways. So even if you're building a recommendation system, your recommendations could be coming from, let's say, a deep learning model, but somebody still needs to convince the end user that those recommendations are indeed worth their time.

And that sort of compelling narrative could be derived from an LLM and the album art could again be sort of some gen AI image model that can you know, convince you that yes, that particular content or that particular song is definitely sort of worth your attention.

So I think now people are really sort of warming up to the idea. That it's like multiple ML models all working in cohesion that actually , drive a strong consumer experience or are able to enable an organization to optimize their internal business processes.

And that's, always exciting from a machine learning system quite a few, that, you know, how do you then tackle this increased diversity or increased complexity in any system.

Richie Cotton: . So having just like really complex systems of lots and lots of different models all working in harmony. Yeah, that maybe sounds like not step one if you're trying to start putting things in production, but maybe that's, a very good end goal to work towards. Um, All right, super.

Do you have any final advice for organizations wanting to start getting their machine learning in production?

Savin Goyal: Yeah. So, of course, you know, machine learning at the end of the day, it's not a silver bullet. it's experimentation at the end of the day. And you may end up investing a lot in data science and not really see any results for a really long period of time. And what really matters is ensuring that You have a plan before you start investing in ML.

You have the right expectations the right time horizon. And more importantly, you have either the right support structure around data scientists. And this could be sort of in the form of ensuring that You have people who understand infrastructure as well as data science really well and are able to work well with one another.

Or in the absence of that, you're able to invest in great tooling from the onset so that your data scientists at the end of the day are able to experiment a lot more effectively because one sure way of failing in machine learning is by not being able to experiment enough.

If your data scientist is only able to ship one version of a model in a week, a month, or a quarter, whatever that time horizon is, that may just not be enough. If you're able to really sort of ensure that their iteration loops are measured in minutes, or maybe hours, then that , is a good way of ensuring that the quality of your machine learning model , continues to go up eventually to a point where it may beat you.

certain predefined rules or heuristics. And then that's when you start reaping the benefits of your investment in machine learning.

Richie Cotton: Yeah that sounds like a great advice. Thank you very much for your time,

Savin Goyal: Yeah. Thank you. Thanks for having me.

Topics

Data Science

Data Engineering

Data Scientist

podcast

Full Stack Data Science

Hugo speaks with Vicki Boykis about what full-stack end-to-end data science actually is and how it works in a consulting setting across various industries.

podcast

How Data Science is Transforming the NBA

Richie and Seth Partnow look into the intricate dynamics of elite basketball.

podcast

Towards Self-Service Data Engineering with Taylor Brown, Co-Founder and COO at Fivetran

Richie and Taylor explore the biggest challenges in data engineering, how to find the right tools for your data stack, defining the modern data stack, federated data, data fabrics and meshes, AI’s impact on data and much more.

podcast

Post-Deployment Data Science

Hakim Elakhrass talks about post-deployment data science, the real-world use cases for tools like NannyML, the potentially catastrophic effects of unmonitored models in production and the most important skills for modern data scientists to cultivate.

podcast

Monetizing Data & AI with Vin Vashishta, Founder & AI Advisor at V Squared, & Tiffany Perkins-Munn, MD & Head of Data & Analytics at JPMC

Richie, Vin, and Tiffany explore the challenges of monetizing data and AI projects, the importance of aligning technical and business objectives to keep outputs focused on core business goals, how to assess your organization's data and AI maturity, why long-term vision and strategy matter, and much more.

podcast

Data Science at McKinsey

Hugo speaks with Taras Gorishnyy, a Senior Analytics Manager at McKinsey and Head of Data Science at QuantumBlack, a McKinsey company, about what it takes to change organizations through data science.

See More See More