Scaling Machine Learning Adoption: A Pragmatic Approach
Adel Nehme, the host of DataFramed, the DataCamp podcast, recently interviewed Noah Gift, founder of Pragmatic AI Labs and prolific author about operationalizing machine learning in organizations and his new book Practical MLOPs.
Adel Nehme: Hello, this is Adel Nehme from DataCamp. And welcome to DataFramed, a podcast covering all things data and its impact on organizations across the world. Over the past year, the AI space has been animated with the transition from experimentation to operationalization. More and more so, New School of Thoughts around MLOps, AIOps, DataOps are emerging, and data scientists are increasingly leveraging new technologies to ship models faster.
Adel Nehme: One can say that this is a pragmatic philosophy towards the AI development. And this is why I'm so excited to have Noah Gift on today's podcast. Noah Gift is the Founder of Pragmatic AI Labs and lectures on cloud computing at top universities globally, including the Duke and Northwestern Graduate Data Science Programs. He designs graduate machine learning, MLOps, AI, and data science courses, consults on Machine Learning and Cloud Architecture for AWS. And he's a massive advocate for AWS Machine Learning and putting machine learning models into production. Noah has authored several books, including Practical MLOps, Pragmatic AI, Python for DevOps, and Cloud Computing for Data Analysis. He has created content around AWS for top course providers, including Udacity, O'Reilly, Pearson, and DataCamp. You can find many AWS examples from Noah by following him on LinkedIn.
Adel Nehme: Throughout the episode, we discuss his background, his philosophy around Pragmatic AI, the differences between data science and academia and the real-world, how data scientists can become more action-oriented by creating solutions that solve real world problems, the Importance of DevOps, his most recent book on the Practical Guide to MLOps, how data science can be compared to Brazilian Jiu Jitsu, what data scientists should learn to scale the amount of value they deliver, his thoughts on AutoML and automation and more. Also, we'd absolutely love your feedback on how we can make DataFramed a better show for you and which guests you think we should bring on the show. I left the survey link in the episode description. Please make sure to fill it out as I greatly appreciate it.
Adel Nehme: Noah, it's great to have you on the show. You're someone who has a very prolific career in the data space and is a fountain of knowledge when it comes to creating an operationalizing machine learning models. I'm excited to discuss with you your thoughts on machine learning, MLOps and your most recent book on it with O'Reilly. But before we get started, can you give us a brief background about how you got into the data space?
Noah Gift: Yeah, my background in getting into data was a pretty long winded way. And part of it is that I studied nutritional science in college because of my interest in sports in college. I did football, basketball in high school. So I've always been interested in sports. And nutritional science was to me like a fun way to learn more about myself. And at the time, we didn't know what the word data science was, but in a way, actually my opinion is nutritional science is one of the easiest ways to get started with data science because it's your own body. And we actually did data science on our body in school.
Noah Gift: And I had a really fun teacher who let us do many different experiments in class. There's a class I took called experimental nutritional, I'm sorry, experimental nutrition. And what we did was actually centrifuged our own blood. So we would pull our own blood out, centrifuge it, figure out what the LDL, HDL level was. And that alone was actually a very cool data science moment because the intuition you would have was that the very healthy people in class... Most people in nutritional science are extremely healthy, but the most healthy of the healthy who eat carrots and celery all the time and exercise two hours a day are going to have awesome HDL/LDL ratios.
Noah Gift: And it turned out that there were some outliers and one girl in class was actually one of the ones that I was very shocked with. I was friends with her and she was just exceptionally nutritious. She was always doing the best things. But genetically, there was a history of heart disease in her family. And so we were able to diagnose that from those experiments. And then there's other experiments we did with megadosing vitamin C and then we measured the vitamin C output and turns out that basically vitamins are BS. So I learned some of the cool stuff that was very data science-oriented in nutritional science. And I think it's actually one of the more interesting maybe bachelor's degrees that someone could get if you're interested in some of the things that you can later use for, let's say a master's in data science. So that was what got me started.
Noah Gift: And then I worked at Caltech for a while. That also really helped me in thinking more scientifically. And I met a lot of people that were into Python and I just accidentally picked it up. And in fact, actually one of the people I worked with who's now in the news is Dr. Koonin who has written a book that's actually very polarizing called Unsettled. And it's, I don't know if anyone's ever heard of it, but basically what's so shocking about the book to people is that he's saying that climate science is unsettled, and that's probably one of the most taboo things you could possibly say. And the gut reaction probably when you hear something like this is you say, "Oh, of course. You're a bad person and how dare you even question science and everything."
Noah Gift: But it turns out that I actually worked for him for three years and I got to know him very well. And I would say the least likely person I've ever met in my entire life to be a troll, not interested in science, not very thoughtful, like incredibly intelligent person. He was friends with Richard Feynman, all this stuff. So I think that's another interesting data point for data science was getting around these kinds of people who even have the guts to say something that maybe is very, very unpopular. And I think that's another aspect of data science that I guess maybe stuck with me a little bit.
Noah Gift: And then later when I worked in the film industry for a bit. And then when I was in the Bay... I finally moved to the Bay area. I got more and more serious about probability and statistics. And so one of the only ways I could actually get more serious about that was to get an MBA while I was working full-time. And I took every probability and statistics class that I could. And I did a lot of actually Python programming in my MBA program, linear optimization and things like that. And it turns out that a lot of the stuff that I did is very similar to what I teach today, which is basically on an analytics and probability. And they didn't offer it in 2010.
Noah Gift: And then since then I've gotten more and more interested in machine learning and data science. And when I was a CTO of a startup, we had multiple probability models that we would use to detect who to do business with and so on and so on. And so currently I teach at the Duke Data Science Program, the MIDS Program. Also, I teach in the Engineering Program at duke. I teach a course called Operationalizing Artificial Intelligence and Machine Learning. And I also teach Data Science at Northwestern. So I've moved more into teaching and doing consulting in the last several years.
Practical and Pragmatic Approach for Data Scientists
Adel Nehme: That's great. And I'm very excited to discuss all of this with you. I think one thing you're very known for, and this is something that you can see in a lot of your courses and your books, is a pragmatic and practical approach to data science and machine learning that creates impact. This is something that stands in the title of one of your books, Pragmatic AI. Do you mind walking us through why you think a practical and pragmatic approach is so important for data scientists to adopt? And how can data scientists be led down a non-pragmatic road when it comes to data work?
Noah Gift: Yeah, I think there's a... Let's riff a little bit on that book by Dr. Koonin because I think it's a great example. So his point in the book is that there's several different points that he makes. One of them is that it's difficult to make predictions, which I would hope any data scientists would say that's true. And one of the things that he brings up is that the damage that's already been done to the environment is baked in for the next, let's say, 50 years. And I think that's forget about whether you believe him or not. And I think I really don't know enough about it to say how severe climate change is. I really just am ignorant. So don't listen to me on that.
Noah Gift: But the part that I think is interesting is if it's true that the damage is baked in and we really, for the next 50 years have to live with what happened, then I think this is where the pragmatic approach comes in is we can't argue about, well, we should do this or we should do that about something that in the next 50 years, we have to address with actual adaptation. And I think that's where things could get interesting where what could we do to operationalize climate science?
Noah Gift: Let's say if you're the most liberal, most progressive climate person. And I would probably put myself into that category a little bit. I have electric vehicles, solar, battery, smart home, blah, blah, blah, blah. But the part that I think is interesting beyond all that stuff is, if we can't change for the next 50 years what's happening, how do you operationalize the effects of climate change? You can't just pretend that it won't happen. And so this is where I think the pragmatic approach comes in is well, what can we do?
Noah Gift: And then it really gives you constraints when that there's no way to really get around something that's happening like if sea levels are going to rise, well, what do you do? You can't just say, "Well, it's somebody else's fault and I'm mad at them." Well, what do we do? Do we apply technologies that can maybe address coastal areas with, I don't know, houses that go up and down or something like a desk? Can we figure out ways where we can live better with climates? Maybe we have more solar powered homes with default material? So basically, all homes are basically solar material. And then because the temperature's high, it doesn't really matter because you can automatic pool it based on the new energy. So that concept of there's a constraint that is immovable, and then you must do something with that constraint, I think is really the pragmatic approach.
Noah Gift: So then now let's get into data science and machine learning. If you say, I must get something into production that helps my company or improves the experience of a person that's been diagnosed with breast cancer, for example. If you move the constraint to, whatever I do must improve their health outcome, that's pragmatic. I think academic approaches is thinking more about the why. Why do people get cancer? Or what can we do to get this tool to be better or whatever?
Noah Gift: It's not that those things don't need to occur. In fact, I would say we're way over optimized for the why and the research and everything, but what we have very little of is people with a sense of urgency that, look, there's people dying, the world is in trouble, we have misinformation problems, those are urgent, urgent problems, and they don't necessarily require hand-waving. What they do require is actually specific actions that improve an outcome. And that's what I would call pragmatic AI.
Bridging the Gap Between Academia and Industry
Adel Nehme: I couldn't agree more and super-important to adopt a pragmatic AI approach solve these real-world problems as an industry and as a species. You hinted this throughout your discussion, but I think oftentimes there is a disconnect between what data scientists learn and get in academia or do on cargo versus what is expected of them and the wild ones they join an organization. Where do you think this disconnect shines the most? And why do you think it's a struggle for academic institutions to bridge the gap between academia and industry?
Noah Gift: Yeah, I think that's a great question in I think it's not one thing, but you could definitely make a checklist. And I think let's just start anywhere. One of the things that first comes to mind is that most people that teach at the university level have never had a real job. By real job I mean an industry experience. So that is a problem. I think that that's probably one of the factors of why things have not been implemented into production.
Noah Gift: Now, I don't necessarily want to throw stones at people who are researchers because I think it's a very important job. I think that maybe those people shouldn't be the only people teaching data science and machine learning. So I think that's one thing is there's a mismatch rate of if you've never had a job, imagine someone you go to like your local hardware store, and the person that owns the hardware store that is really a business person and they know about buying lumber and like selling lumber and it literally have never built a bookshelf and you say, "Hey, how do I build a bookshelf?" They could theoretically probably tell you how to build a bookshelf, but the person that works there that builds bookshelves all day long, that's a carpenter, that's why you want the carpenter, they help you. Even though the person who owns the hardware store is very prestigious and wealthy and has high status, that doesn't necessarily mean that you want that person in your house installing the bookshelf. And I think that's one of the issues.
Noah Gift: Another issue is that the tools that are necessary to teach data science are very different than the tools that are industry tools. So a good example is, I'm working with Duke right now on analyzing Whale Sound Data. And the data set I believe is around like 30 terabytes. And so you can't just be like, "Oh yeah, come on, let's put it on my laptop. Let's just play around with that a little bit." It's like that doesn't fit. And then also you can't be like, "Hey, import pandas as PD. And let's make a data frame with this 30 terabytes." It doesn't work.
Noah Gift: So it's easy to show trivial tools when you're teaching data science, but then the real world is way more messy. And then it turns out that, oh, now I got to use a different tool like maybe I have to use Databricks or I have to use SageMaker, and I have to do all these pre-processing and spin up clusters. And that's not necessarily something you just, in the first day of class, you're like, "Okay, come on everybody, let's spin up 20 machines. And then we're going to set up a Databricks and then we're going to analyze 30 terabytes of data." That's not like a... You flip that out of your notebook and start teaching it. And so I think that's probably one of the other aspects of it.
Adel Nehme: Given that a lot of data scientists are exiting academia and entering the industry, what do you think are considerations data teams should make when designing and scoping machine learning or data science projects that are often not taught or discussed in academia, but are learned on the job?
Noah Gift: Yeah, I think the realism of building a software system is really not taught. And I think in particular DevOps. And what DevOps is, is the ability to automatically test your code and improve the quality of your code and deploy your code. And so I think that's something that a company would be wise to do with anybody that's hired, is make sure that they understand the fundamentals of DevOps. So can they test their code? Can they link their code? Do they know what a build system is? Can they do automatic continuous integration? And then can they deploy, let's say a Flask app into production somewhere? And I think that alone is really important and that might address 80% of the issues that people have.
Adel Nehme: That's great. Then you mentioned here continuous integration and DevOps. What are some of the tools you tell students to learn in order to bridge that gap between traditional data science education and DevOps?
Noah Gift: Yeah, I think with Python in particular, I think it's the tools that I would recommend would be GitHub Actions, I think is probably the easiest way to get started with continuous integration. And then another thing that people could use is Pytest for testing in some linting tool, either Pylint or Flake. And just those would be I think go a long way to making someone successful in doing continuous integration and continuous deployment.
Adel Nehme: Another angle to this when working in industry is about bridging the gap of value. And ultimately the data team is producing a model that is solving business problems, and these solutions will be owned by a functional leader in the business units. What do you think are some of the pitfalls data scientists fall into when working with business units that they're not aware of before joining an organization?
Noah Gift: Yeah, I think one of the things that I see a lot is, and I learned the solution to this initially when I was in graduate school was, I had a really wise statistics professor that told me that when you're presenting something to an executive, you don't want to necessarily show them a bunch of code. You want to have a summary. That's a paragraph that says here's the problem solving and here's my recommendation.
Noah Gift: And I think that's probably one of the big things that data scientists don't do is very quickly state what the problem is and how they're solving it. Because then the business person who may not know what code is or may not have knowledge of symbolic notation or whatever, they can just go, "Oh, you're trying to solve this. And then this is what you recommend. Well, we don't want that. Give me another problem statement and a solution." And I think the quicker you can get to the problem statement and how you solve it, that will at least let you work on the right thing because the tragedy would be to work nine months on something. That's really complex and it turns out that they want that to begin with. And I think that disconnect is the most important feedback loop.
Noah Gift: So that's one. And then the second is that I think because the feedback loops have been so long that by the time you come up with a solution, it may not even matter anymore. And so another approach that I think many data scientists would be wise to do would be to embrace tools that do a lot of the automation for you. And I think AutoML in particular is one of those ones where I just really just how I'm confused about people who don't want to get vaccinated. It's like, "What?" It's like make polio great again. It's like, "What?"
Noah Gift: It's the same thing when people don't want to use AutoML. It's like, "What? Why would you not want to use a tool that initially helps you solve the problem?" It doesn't mean that that's going to be the ultimate solution, but initially, why wouldn't you use some automation? And I think that's one of the places that data scientists could really get much more productive use of their time with, is that they embrace as much automation as possible. And again, get to the end of the feedback loop, which is presenting the results to the business leaders because it could be that you're working on the wrong thing. So who cares if you hand-coded all the hyperparameters yourself if they only want what you built?
Adel Nehme: Having a tool agnostic approach in data science is so important, and I'm excited to table our discussion around AutoML at the end of the show. Continuing on your book, Pragmatic AI, a big focus is leveraging cloud computing technology to deliver high impact data science. There's a variety of cloud platforms data scientists can learn like AWS, Azure, Google Cloud, but you're a big proponent of starting off with AWS. Can you walk us through why you view AWS as the better cloud platform and how does it help data teams?
Noah Gift: Yeah, I think part of it's just from a data science perspective, they have the largest market share. And that's probably the number one reason I think AWS is probably the safest choice for most people to go with. And then likewise, like well, why would you use Python? Well, one of the reasons to use Python is that most people know it. And same with AWS, if most people know AWS, it's easy to hire people.
Noah Gift: So I think in general, and this is me as an engineering manager or CTO for many, many years, is that one thing that people don't really take seriously enough is that the best tool isn't necessarily the one you should use. And so for example, I used the language Erlang for a lot of my career because in theory it does actually have some very interesting characteristics about high concurrency. In practice nobody knows it. And so you're dead in the water.
Noah Gift: So it's still the same thing with language like Julia or an exotic cloud platform like let's say Google Cloud or Ali Cloud is, it's not that they're bad. In fact, in some cases, they may have better technology for certain aspects like BigQuery, for example, I think is an incredible tool. But now we get to the implementation details and does the majority of people that you hired, do they know it? And I think that's probably the reason why AWS is the best choice initially. Now, I think a close second, though, is the Microsoft Cloud. And I think there's a lot to like about Microsoft and they're really challenging. So I would say, again, my recommendation would be, from the data perspective, just literally looking at the numbers of who you could hire, AWS first, but Microsoft is probably not a bad choice either.
Main Pitfalls Organizations Face When Operationalizing Machine Learning Models
Adel Nehme: Definitely. Microsoft is making a lot of gains in the cloud space, especially given the recent work with OpenAI and GPT-3. I think this marks a great segue to discuss your most recent book Practical MLOps with O'Reilly. Before diving into the details, do you mind highlighting why you wrote the book and what are the main pitfalls you find organizations face when operationalizing machine learning models?
Noah Gift: Running data science seems multiple data science if I've hired all the data scientists from scratch. And then run that, I've also as a consultant hired and built data science teams. And one thing that I saw was that really we hired some incredibly talented people, but the constraints were set up in the wrong way where, and I would blame myself partially for this, is that if you don't tell someone what they're building or what they're working on, then how could you even measure whether it's successful? Again, going back to COVID or something, it's like if we don't have data that says who's vaccinated, who is not, how many people have it, you need to have a dashboard that tells what's good.
Noah Gift: And I think that's one of the biggest problems with data science is that nobody even knows what's good. I think we're getting closer now, but if you just say, "Hey, we have all these PhDs in our company and we're doing data science," that means nothing. And I think that's one of the reasons I wrote the book is getting into the details about what is good? What should you do? And how could you measure whether someone is successful or not doing data science? That's one.
Noah Gift: The second is that I was at something called O'Reilly's Foo Camp, which is this that you get invited to sometimes if you wrote a book for O'Reilly. It's cool geek out session. And I was talking to Tim O'Reilly who owns O'Reilly and Mike Loukides who is another person who's been with O'Reilly for a long time, and we had a discussion about why is it machine learning isn't getting results? Why can't we be 10 times faster?
Noah Gift: And really one of the things that we came up with was looking at things like, when there's a pandemic like COVID-19, or there's things like breast cancer or climate change or whatever where that there's some real big problems that could be addressed by machine learning, why don't we just have a sense of urgency and solve those problems? Why can't we be 10 times faster? And I think that's really something that was really a driver for the book was, let me see if I can write a book that makes the case for machine learning being 10 times faster, data science being 10 times faster where instead of focusing only on the technique, we're focused on the results. And the result is more important than the technique.
Adel Nehme: And why do you think machine learning is not 10 times faster today? What are some of the reasons behind it?
Noah Gift: I think people are too precious with the technique and not focused enough on the actual outcome. And I share this story in the book, in the final chapter where out of accidentally, I got into Brazilian Jiu-Jitsu, I wasn't looking to get into martial arts. I just happened to work somewhere where there's all these pro UFC fighters training, and I became friends with them. And then I started training with them and I spent years and years doing grappling, but more of like a real world grappling where you're training towards pro fighter.
Noah Gift: And one of the things I discovered was, if you went to somewhere that was more academic and that more academic grappling is when you're in a GE. And so you're in a uniform and there's certain techniques that only work when you're in the uniform. And that's very similar to the university. There's certain things that only work in the university because it's constrained. You have a very constrained model. And then when I would train with people who were experts who were better than me, and then they would try a technique that always works with the GE uniform and then when they're out of it, it didn't work anymore. And someone would try to get my arm in a lock, which again, if you're in the uniform, you're in the academic setting, it would always work. But I've even been able to pull out and free my arm from Olympic Judo. People who have medalled in judo because they know a certain constraint.
Noah Gift: And so I would say the same thing with machine learning. One of the reasons why people I think have had so many problems is, they're looking at things only from an academic perspective and a technique perspective. And really the technique is interesting and it's important in some sense, but ultimately it's like does it work? And I think that's the part of machine learning that we need to be more focused on is, forget that you had a better technique or didn't have the better technique or whatever is, did your company use your machine learning model? Did it improve things? And how quickly did you implement that solution? And I think that's really the sense of urgency on how to work on the right problem at the right time.
Subdomains of MLOps
Adel Nehme: That's awesome. And I really like the comparison between Brazilian Jiu-Jitsu and data science and machine learning. So the book does a great job of introducing a lot of modern tools machine learning practitioners need to work with to start operationalizing machine learning models. Do you mind walking us through the different subdomains of MLOps and what are the most important tools practitioners should learn today?
Noah Gift: Yeah, so I think one of the first places to start, which is probably the weakest place for most data scientists, and I know this because I teach a lot of data scientists is DevOps. I think you must have a mastery of DevOps. So you should know how to do continuous integration, continuous delivery. Fortunately, it's not that hard. You can probably learn the majority of DevOps in a day or two. And so I think that's the fundamental.
Noah Gift: Now another step as well is having some a mechanism to manage your data. And data governance is a very large topic, but in general, in the cloud, it means that your data is most likely in some form of a data lake. And so you have it in object storage with S3, for example. Once you have your data in the cloud and it's in something like object storage, you have the ability to have infinite capacity of Disk I/O or CPU and then you can use tools like SageMaker or Spark or Athena or some other big data technology.
Noah Gift: So I think that's probably the second big one is, you must have it in an environment where you can use modern tools. And then in terms of the more of the machine learning implementation, I think a big question that the people need to have is, again, initially when you're building a prototype, how much of the work do you need to do yourself with tuning hyperparameters? Should you really be training the first version of your model 100% yourself? For example, if you have X and you have Y, you have two columns in a data set, do you really need to hand-pick each one of these algorithms? It's a reductive problem. Why not just click on it and have the machine do it for you?
Noah Gift: Later as you get more and more sophisticated, maybe you don't want to have the machine do everything for you, but I think, again, embracing the high level tools from the beginning with machine learning. And then the final part of machine learning would be where are you deploying this model? Are you bespoke building some complicated system, or are you using a tool like let's say SageMaker that has automatic ability to scale the model up and down, version the models, do A/B testing? I think reinventing the wheel is a very bad idea for people. You should just use the best tool that's available. And again, a good example, one potentially would be something like SageMaker.
Adel Nehme: And the final chapters of the book you highlight some of the key critical challenges, data teams face in MLOps. Key amongst them are ethical and unintended consequences, lack of operational excellence, a hyperfocus on small details as opposed to big picture thinking. Can you share your thoughts here? How do you view the future of MLOps, maybe alleviating some of these challenges?
Noah Gift: Yeah, I'm just going to address them one by one. So I think the ethics part is easy to get hand-wavy and self-righteous about. And I think most people are offended when you're self-righteous. It's just not a good technique to get someone to adopt your point of view. But that doesn't mean we can't talk about ethics and misinformation and some of the impacts. And I think one way to address it would be, if you're a talented person that be careful about who you work for. And I think that's one of the ways that you address the problem is, if you know that you have the skills to do incredible things, I personally would recommend you work for something that's helping the world make it a better place.
Noah Gift: And I think one of the ways you could figure this out is with information, I could categorize things in three levels. You could have a harmful information. So again, telling people that the world is flat or that they shouldn't get vaccinated because it's got a microchip. That's harmful disinformation or misinformation. You're really hurting the world. And then there's B, which is neutral. Like let's say you're making Disney movies or something like, okay, it's entertainment, but that's neutral.
Noah Gift: And then there's the category C, which is you're actively working on things that make the world a better place like helping cure cancer, or providing housing for low income people, or educating people or whatever. And I think that's really one of the ways to address this is just don't work for companies that do misinformation and disinformation. Eventually they'll crumble because no talents person will work for them. So I think that's one of the ways you can address it without being self-righteous. And I hopefully people will do that.
Noah Gift: Now, the other aspect of it is, you are focusing on the small details versus the big details. Yeah, I think this is, again, goes to the technique versus the implementation, is that it doesn't mean that small details are not helpful, but when you're first getting started, if you realize that urgency is very important in that... A good example would be if you look at World War II and Britain in particular, there was a period of time when they were under siege by Nazi Germany. And they really didn't know if they were going to be completely destroyed or not. And so they have a window of let's say 30 days where they have to figure out how to shore up their defenses, turn off the lights so they don't get bombed.
Noah Gift: Should you be working on things that don't solve that problem? That's a really good constraint is like, we will literally be eliminated by a terrible war unless we do X, Y, Z, which is turn off the lights, whatever the things are, they're going to prevent that invasion. So I think the same thing, not that you need to be worried about Nazis invading you, but that if you focus on, okay, what is the only things that I can do that will get this model into production? It doesn't mean those other things are not important, but how do I focus just on the things that immediately help me, that's a good heuristic that you can use to be more productive.
Adel Nehme: And what do you think are tactics data teams can do here to adopt this heuristic and operationalize it and formalize it?
Noah Gift: I think one of the things would be to look at maybe a KPI. And one KPI could be, what is the frequency of models into production? And I think that's probably... I know this trick for software engineering in particular is a reasonable one which is, how often do you deploy your software? I've worked for companies where they deployed once a week, and I've worked for companies where we deployed 20, 30 times a day. The ones where we deployed 20, 30 times a day, we actually had a very, very healthy culture where things are constantly being improved. The ones where we're only deployed software once a week, that's typically a sign there's something wrong.
Noah Gift: And I would say the same thing is, if you can't deploy tons of different models or experiments every day, it doesn't necessarily mean 100% production, but let's say quasi production where there's people in your company that are using those models. If you can't do that at on a daily basis, I think that could be one of the things that you could measure and fix. So let's measure that as a KPI. Why aren't we producing multiple models per day into production? And let's fix it. Let's figure out the root cause and use the Five Whys technique. Why can't we deploy models into production? Well, every time we play a model to production, it blows up. Okay, well, why does it blow up? Well, we don't have continuous delivery. Why don't we have continuous delivery? You break it down until you get into bottom. It's like, "Oh, actually it turns out that there's really no reason why we don't deploy models into production. That's rational and we can easily fix this. Let's go ahead and do it."
Adel Nehme: And to play devil's advocate, do you think over optimizing for a KPI like amount of models deployed can lead to a data team to deploy models without necessarily having ethical considerations before they deploy them?
Noah Gift: I think any KPI needs to be taken with a grain of salt. So just like, especially data scientists should be aware of this, you can't just over-optimize and overfit right on one feature because then you're basically making a bad prediction that doesn't conform to the real world. So I think this is just one of many data points. Initially, though, I think it's probably a good one to start with because it will flush out some of the excuses of like, "Wait a second, we only deploy a model once a year. Let's solve the ethical problems. After we can deploy a model at least once a day."
Noah Gift: Once we've deployed the model once today, and believe me, I'm a huge fan of ethical concerns of this, but let's first make sure we can do anything. Then once we can do something, then let's dig in to maybe then we wait things out and maybe there's another KPI that's equally important that says what's the impact, let's say you're a credit card company all of a sudden minority groups don't get credit card offers anymore for the next 60 days. Yeah, that's really bad. You better fix that. And so you could look at that as a KPI.
Advice for Up and Coming Practitioners
Adel Nehme: That's a smart one. I think there's a great opportunity to integrate risk management best practices when deploying machine learning models at scale. Pivoting away into the future, given the nature of the data science and machine learning space, what do you think is advice you give up and coming practitioners to keep their skills competitive and remain hyperfocused on creating impact?
Noah Gift: Yeah, I think one thing is that I wrote an article a couple years ago that said I don't think the data science job titles by 2029. And it was actually a fairly polarizing article. And I think I've actually been proven true, but my point here is that not that I think data science is bad when you shouldn't do data science. In fact, I think it's actually incredibly helpful, but you can't just be a data science. My opinion, just data science is not enough. I think you're using that as part of what you do.
Noah Gift: And let's go again back to mixed martial arts. So to say like I am a grappler, that's true and that's helpful, but if you're fighting, you're using striking, wrestling, Jiu-Jitsu, Muay Thai, you have all these things that you're doing. So I think data scientists, if you only focus on the fact that you do data science, it's a little too narrow, but if you can say I'm a biologist who does data science, or I'm a machine learning engineer who does data science, or I'm a journalist who of data science, I think that's a better direction for your career because you can't be pinned down and like, "Oh well, you just do data science." No. Data science is one of the things I'm an expert in and that's why I'm so effective at journalism, for example.
Adel Nehme: I think this such is upon a career philosophy you have as well. What do you think about the trade-off between being a specialist and the generalist and the importance of being multidisciplinary?
Noah Gift: Yeah, I don't know if there's that many examples of specialists that the risk reward ratio pays off. Maybe if we look at the Olympics, that's a great example of a specialist, but does that actually that pay off. For most people probably does not pay off to be a specialist at that level. But I think if you look at most of the things that happen in world, having a multidisciplinary approach is really the best way to go, and it allows you to be flexible about what happens in the future versus only being focused on one particular thing.
Noah Gift: So I think in particular with data science that it's happening, it's changing so quickly that that is better to have more of an opportunistic or greedy approach to it, which is, for example, in the next five years, I could see now cleaning energy becoming really a big deal. It used to be more of a super hardcore progressive agenda, and people were turned off by it because of that, but it actually is not fake anymore. Is a real thing. And in fact, my hunch is that it is going to be pervasive in, let's say five years. And so if you're only focused on data science, maybe you're going to lose out on that because you're leaving that up to grabs. Or other one is healthcare. Healthcare is potentially going to go undergo rapid changes. So if you're more interested in having the flexibility to jump into anything, I think you're more suited to being on the cutting edge.
Adel Nehme: I wholeheartedly agree. And I think there's an aspect of personal reward at the only being a multidisciplinary person provides. Now we've mentioned AutoML quite a bit throughout our discussion today. Given the evolving tooling stack and data science and auto machine learning tools and NLP innovations like GPT-3 and GitHub Copilot are definitely shaping up to be disruptors of a lot of the data science workflow. What do you think are the major trends in machine learning and data science that will disrupt how we think about data work and that can act maybe as a backing up of data science roles not necessarily being the same in 2029?
Noah Gift: Yeah, I would say just to be clear. So I think data science in a job title, I think may not. And we'll see it. Or clear if say 80% of data science job title will do the 80/20, probably my opinion won't exist by 2029. But what I do see with all the automated tools here is that either you embrace the automation or you fight it, but the people that embrace it are going to be wildly successful because you're able to get results more quickly. And you're focusing on things that matter versus reimplementing things poorly that a machine can do better. And I think that's really what we're going to see with a lot of domains is that the people that can embrace whatever happens and use the best tool for the job are going to be wildly successful.
Noah Gift: And so I think that's one of the big takeaways is that humans when they use advanced technology are massively productive. One of the simplest examples would be a drill. If you've ever taken a screw and screwed it into a wood yourself, if you're extremely strong and you have a very strong grip, yeah, you might be able to just screw it into a board or something, but why would you do that? You could just take a drill and it takes literally under a second to drill a two inch screw into a board. It's the same thing. It doesn't... How does it even related that a carpenter as a worse carpenter or a better carpenter because they used a drill, it's non sequitur. So I think the same thing applies to data science is, the best data scientists are going to use the best tools because they're going to get more output, they're going to get things that they build. And that's really the future.
Adel Nehme: How do you see the data science workflow looking like given rising automation? Do you think that even data preparation and cleaning will be automated?
Noah Gift: I think some of the cleaning will be automated. I think it maybe in more like the self-driving car thing where... And again, we don't know what's going to happen. I'm a little that in the next five years we'll get self-driving cars. I think we'll get better and better assistive technology. I think the same thing with something like data cleaning, that's a pretty complex topic. I think we could have some really good assistive technologies. That again, why would you not use these assistive technologies? I'm not 100% sure you can automatically do all of it, but you can do a lot of it.
Noah Gift: And I would say one way to think about AutoML is that I think in a way it's, it's like a false flag. People they look at AutoML, they're like, "AutoML." It's like, "Wait for a second." That's just one of the things. It would be saying a drill is the only thing to think about when you're building a bookshelf. You could have a drill, you could have a saw, you could have a level. The machine learning hyperparameter tuning and the automatic training is one aspect, but I would call the whole thing KaizenML, which is basically every single aspect of what you're improving is touched. The business problem is touched. The software engineering problem is touched with DevOps. The data engineering aspect is touched with these automatic cleaning tools.
Noah Gift: And I think all of that together, or KaizenML or MLOPs, those things are the things that think that we're going to see a lot of changes on. So I think getting too caught up in just the word AutoML, again, is really a distraction because it's one of many things that's automated. And if you get hung up just on the fact that it's AutoML, you're losing focus of the fact that the whole thing should be automated, every single thing. And if it's not automated, it's broken.
Adel Nehme: So what do you think then will be the differentiator of high impact data scientists once a lot of data work is automated? What are the marks of a great data science portfolio?
Noah Gift: That's a great question. I would say reimplementing things that can done by automated tools is the worst portfolio. So let's start there. So if you show how you did a bunch of hyperparameter tuning, I wouldn't work on that. Now, I think a more interesting portfolio would be that for very exotic topics, here's one that's a personal topic that I think is interesting is homelessness. And I think we still don't have comprehensive data on this, but imagine building really comprehensive analysis with specific recommendations and conclusions around homelessness like for example, X percentage came from here, X percentage have this, that would be a great example of thought leadership around data. So any kind of a thought leadership that's bespoke and creative, I think is a tremendous data science project.
Noah Gift: So that's one because you can't really... A human versus a machine doesn't even make sense in that context. So don't compete with the machines at things that are better at than you. Compete with the machines at things that they can't do, which is, add context to a problem. The other one I think would be showing that you have very strong software engineering skills which in a way is a great way to compete against other data scientists. Because if almost all, let's say 80% of current data scientists in 2021 don't do software engineering and you do, you're going to stick out.
Noah Gift: And so I think that would be a great way is show that you can do continuous integration, continuous delivery that you have maybe a website that you built that automatically does deployment. That's something I cover in the Coursera course that I built with Duke is doing a Hugo website that automatically deploys. I think anything you can do that shows you have strong software engineering fundamentals would also be a way to stick out.
Call to Action
Adel Nehme: I think what's exciting here as well is the AutoML can act as a tool to allure the barrier to create the thought leadership. Finally Noah, before we wrap up, do you have any final call to action?
Noah Gift: Yeah, I would say just in terms of data science, there's so many intelligent people that I've come across. Thousands of people I've probably interacted with in teaching or books or seminars that my advice to you would be that get as much skills as you can, but then apply those skills to things that are unambiguously good for the world. And you can get paid very well for those. I think that's easy to have workforce, maybe a company that is unambiguously bad, but it's just as easy to work for a company that's unambiguously good. And if you're talented, you owe it to the world to really focus on things that help humanity not destroy it.
Adel Nehme: That's very inspiring. Thank you so much, Noah, for coming on the podcast.
Noah Gift: Happy to be here.
Adel Nehme: That's it for today's episode of DataFramed. Thanks for being with us. I really enjoyed Noah's insights on how to scale value with AI by using a pragmatic philosophy towards AI development. If you enjoy this podcast, make sure to please leave a review on iTunes. Our next episode will be with Syafri Bahar, VP of Data Science at Gojek on the Data Science Powering Gojek. I hope it'll be useful for you. And we hope to catch you next time on DataFramed.