Paige Bailey is a senior Cloud Developer Advocate at Microsoft focused on machine learning and artificial intelligence. Prior to working at Microsoft, Paige was a data scientist and predictive modeler in the energy industry (specializations: drilling and completions optimization; subsurface characterization).
Paige has over a decade of experience with Python, as well as 5 years of experience with R and distributed programming with Spark and uses all of the above at work every day. She is on the committee for SciPy, JupyterCon, and ML4ALL, is an EdX instructor for Python and is currently writing an introductory book on machine learning.
You can find Paige on Twitter (and pretty much everywhere else on the internet!) at @DynamicWebPaige.
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC.
Transcript
Hugo: Hi Paige, and welcome to DataFramed.
Paige: Hi Hugo, I'm happy to be here.
Hugo: I'm really happy to have you here. And we're here today to talk about Data Science in the cloud, and what a week for it, right?
Paige: Yeah.
Hugo: Maybe you can tell us why this is such a great week to have this conversation.
Paige: Okay. I am so pumped to talk about AI and Machine Learning and Data Science in the cloud this week because Build was Monday, Tuesday, Wednesday. Build is Microsoft's primary developer conference of the year. And it's huge, it usually has around 16,000 people in person and usually about half a million watching the keynotes online. So it's crazy big. And then Tuesday, Wednesday, Thursday, so happening simultaneously was Google's Developer Conference called I/O and it also has a lot of people attending in person and a lot of people watching online. And both had tons and tons of announcements about AI, about Data Science, about Machine Learning. It felt like the entire conferences were focused on AI and Machine Learning.
How Did You Get Into Data Science?
Hugo: Incredible. And I can't wait to get into talking about all the very recent development. But to create a bit of suspense, we're not going to do that just yet. I want to find out a bit about you first. So how did you get into Data Science originally?
Paige: Cool. So I started off with Data Science before I think it was called Data Science. As an Undergrad I studied Geophysics and I was fortunate enough to get ... See more
Hugo: That's wild. So what then happened in your professional life that led you to be in your current position at Microsoft?
Paige: The plan was ... My undergrad research university Rice was very focused on getting everybody it could into Grad school. That was the expectation. The plan was always to go to Grad school, but after those first two internships I did a third internship with a company called Chevron, which is very focused on oil and gas. They're one of the major oil companies in the world and the pay was quite good, not going to lie, especially as a college kid who was doing an internship. And the projects that they gave me was also very interesting. They asked me to create a culture database for some of the oil and gas projects that they were working on, and culture is GIS-shape files and other associated people-ish data that you need to have in an oil and gas project in order to make sure that you're drilling in the right places, that you actually own the leases, to determine when leases expire. That sort of thing. So that was the first project, was building this culture database and then when I finished that after like three weeks and they still had multiple weeks of me to be there, I started working more on the three dimensional data and on doing a lot of other research focused work in the deep water space in the Gulf of Mexico. And that was just fascinating to me. Like the size of the data sets and how expensive the data sets were and the realization that nowhere else has this kind of data, that if you wanted to do any sort of three dimensional subsurface work, the oil industry was kind of the only place you could do it. That sold me real quick. So they gave me a job offer after that summer, I was delighted to say yes and that was kind of long and short of it. And it was also my mom got very sick my senior year of college and as the person who needed to take care of her, being able to support myself and support my mom and also do something really, really cool at work, that it just sounded like the best place to be.
Hugo: Yeah. And how long were you there for?
Paige: Yes. I was at Chevron for four and a half years. So I was there from 2013 up until 2018.
Hugo: And then you might have moved to Microsoft?
Paige: Yes. I transitioned from Chevron. I'm working as a data scientist to Microsoft and the progression through Chevron started off with doing earth science application support and also a small scale plugging creation and database work and GIS work to eventually being a data scientist because most of the plugins that I was making were data analysis plugins and most of the real time data processing in real time drilling visualizations that I was making were data science visualizations. So it was just kind of, "Oh, hey, that's what she's doing anyway. Let's just call her that now."
What do your colleagues at Microsoft think that you do?
Hugo: Exactly. So now you're a Senior Cloud Developer Advocate at Microsoft and I want to explore what ... You know, there's a lot of meat in there. There's the cloud, there's developer and there's also advocacy work. And I want to explore what that means, but first through what kind of popular impressions there are of what you do. So what do your colleagues at Microsoft think that you do?
Paige: Cool. So Senior Cloud Developer Advocate, you're right, that is a completely ... Whenever I first saw the job title, I was like, "What the heck is this?" It sounds cool, but what actually would I be doing? And my colleagues at Microsoft, it depends probably on what division that they're in because I guarantee you a lot of folks haven't heard of it either. But the idea is that we have this team of people who are ingrained within various communities, so it might be Javascript, it might be Go, it might be the security world, it might be Python or R and the folks who are active in those communities, who contribute to open source libraries, who speak at a lot of conferences, who have been doing that work for their day job as an engineer, the idea is to get those people into Microsoft and to still keep engaging with the communities, still keep doing engineering so making pool requests to the projects that we're currently working on, to the products that we have on Azure. But also to make sure that whatever's not working for the communities that they're a part of and enthusiastic and passionate about, gets fixed. So if I'm a Python and R person, if I'm using say a Deep Learning virtual machine on Azure and I notice that something's not straightforward for a Python developer, I'll take that feedback and I'll give it to the product team and then we'll build a roadmap to incorporate that change into the product. Or if somebody at a conference tells me, Cosmos DB sounds really, really cool, but I would love to see a tutorial that's focused on it as a graph database then I would take that feedback back as well and probably end up helping to write that tutorial or that quick start. So does that make sense?
What does the cloud mean to you?
Hugo: That makes absolute sense. So you've mentioned Microsoft Azure several times. I want to kind of move into this space of thinking about cloud computing platforms with particular reference to Data Science. And I know Azure can be used for a lot of other things as well. But I'm really wondering what Data Science means to you, what the cloud means to you, and where they intersect, because there are multiple possible definitions of both of these things.
Paige: Cloud computing is a huge revolution in the computing space, and it's also probably going to be one of the most transformative technologies that any of us experience in our lifetime. And it's mostly because of, suddenly, you really do have the ability to democratize any sort of computational power that you have. And what does that mean? That's very inspiring and vague and hand wavy the explanation, but the cloud, just think of it as a collection of servers that is sitting somewhere that you can leverage as needed, bend up and spin down as needed and a lot of additional software focused and security focus tools that sit on it. Say I am a Grad student who needs to analyze terabytes worth of biology data. And historically if you're in Grad school and you needed to use some high powered computer, it might be that you had to stay up until 2 am to use it because that's the slot that you got as a brand new first year Grad student if you needed to use a computer that had a certain threshold for GPU or graphics card or something, then you just had to wait your turn. With the cloud, suddenly you can provision anything that you need. So you can provision additional storage space, you can provision one of the most powerful computers that is out there, you could automatically deploy a web application, like a single page web app as something called a serverless function, which you only pay for the amount of time that people are actually looking at the webpage. Or you might trigger a serverless function for your website where you're only paid for the compute time when somebody pings a REST API. It's amazing the stuff that you can suddenly do. But the thing that excites me the most about the cloud is that it's the very, very first time that we're actually able to do Deep Learning in a very serious way.
Hugo: And that's one of the great things. I think that you've motivated cloud based computing in a lot of ways. And one of the things underlying what you're saying I think is that when our algorithms need to scale or our data sets need to scale or we've got kind of one off projects, we don't need to be continually buying more hardware or changing hardware setups or anything along those lines. I think deep learning is a great example.
Paige: Absolutely. Those are excellent, excellent points and coming from the oil industry, which is kind of outdated, especially in the IT space, but being ... Even requesting a database, requesting a database when I worked at Chevron, it would often take six months to go through all the bureaucratic paperwork. So if you wanted a new machine, you had to schedule it like six months in advance as to what the specs should be, how much hard disk space you would need, all of these things. And then you actually get the machine six months later when it turns out your requirements could have changed, like suddenly that graphics card is no longer the best graphics card on the street. The requirements for that were constantly changing, so now potentially you couldn't run that application and you would have to order a new graphics card. It was just awful. It was an awful, awful user experience.
Hugo: Oh yeah. And I love the fact that when doing stuff on the cloud and collaborating with other people, you don't necessarily need to be worrying about requirements, files or virtual environments and making sure that what happens on my laptop or my computer can run on theirs as well because we all can can verify that we're using exactly the same environment.
Paige: Absolutely. It's being able to concentrate on the thing that you're supposed to be concentrating on, which is actually doing the Data Science work and kind of abstracting away all of the others, you know, non Data Science focused requirements. You don't need to worry about how much disk space you have on your machine, you don't need to worry about package management necessarily if you're using a Deep Learning VM. You don't have to worry about drivers, which were always the biggest headache. Drivers for your graphics card, they were always so atrocious, but now you spin up a VM, you use it however you need and then you spin it down when you're done and it's all for the price of a cup of coffee.
What's your definition of Data Science?
Hugo: Awesome. That's really cool. So we've really got a nice working definition of the cloud. What's your working definition of Data Science? And I hesitate to ask this because putting it into words can sometimes be a bit trite, but I think everyone has their approaches from a different perspective. So I'd love yours there.
Paige: Absolutely. So Data Science, Machine Learning and Deep Learning, I have different definitions for all of the three and I think that dependent on who you talk to, they would probably have drastically different opinions than I do. So Data Science, I usually think of as heavy on statistics, heavy on visualizing, cleaning and understanding data sets. So not necessarily predictive modeling because a lot of people are still focused on kind of descriptive statistics, so taking historic data, visualizing it and understanding patterns and relationships. And I do think that, that still counts as Data Science.
Hugo: I love this especially because you are a Machine Learning and AI advocate to hear you say that Data Science does not necessarily always involve predictive analytics.
Paige: Absolutely. And another thing is that, you can derive business value in so many ways and a lot of companies, they don't really have any insights into what data they have at all. It's not in a centralized location, it's usually of really very poor quality, it's often housed in like, 20 bazillion spreadsheets and if it's valuable to them to answer a question that isn't necessarily a predictive modeling question and that does require a lot of rigorous analysis to do, then I would definitely count that as Data Science. And I usually think that Data Science requires some sort of Python and R. Hadley Wickham has this great quote that you can't really do Data Science in a GUI and I am a firm believer of that. So I know that a lot of data analysis can be done in something like Excel and I think that, that's wonderful, but for Data Science I usually think it requires knowing a little bit of Python, a little bit of R and a little bit of SQL.
Hugo: And does that speak to the fact that something scientific needs to be reproducible, for example?
Paige: Reproducibility and then also usually working with a variety of data sources. So the data engineering aspects, usually that requires some sort of programming unless you want to have a very painful life of merging data sets and Excel spreadsheets. But science is all about reproducibility. If you can't be empirical about what analysis you're doing, then I don't think that you can call it a science.
A brief history of Data Science in the cloud
Hugo: So before we get to the current and future developments with respect to Data Science in the cloud, would you give me a brief history of Data Science in the cloud?
Paige: Sure. So a brief history of Data Science in the cloud. I'm probably not the best person to give this, but I think that everyone would probably agree that it started with Google. So being able to take massive data sets, understand them and then also apply distributed processing techniques to that data. Over time we've transitioned from using things like Hadoop to using more and more Spark. If you have familiarity with that and gone from using kind of CPU machines to GPUs, now we have things like FPGAs and also custom silicon for algorithms like TPUs. So there's a branch of processors called Custom ASICs and TPUs would be one example of that, Tensor Processing Units that the transition has gone from, "Oh wow, we sure do have a lot of data. Wouldn't it be great to understand it?" To just kind of building more and more tools to enable that understanding. And I think that one of the biggest step changes in that has been the kind of open sourcing of tools that are exceptionally powerful for Deep Learning in large scale data analysis. So again, like Apache Spark for spinning up clusters and machines and doing either just data processing in general or with things like mllib, doing distributed Machine Learning or with Spark Deep Learning, doing distributed tensorflow or anything else. I think that open source tooling for Data Science is one of the best advantages that we have as a discipline.
Hugo: Okay, great. So I'm itching to find out about the new developments that you've discovered or been involved in this week. But you mentioned open source. I'm really interested in the future of a trade-off between open source software and productization of Data Science products and for example Machine Learning products. So I'm wondering if you can speak to how you think that will evolve in the future?
Paige: Yeah, so a trend that we're seeing often is taking open source tools. So again, things like Apache Spark, things like maybe Scikit-learn or Kubernetes which isn't necessarily related to Data Science but it kinda is because if you build a Machine Learning model, you should probably want to scale it at some point and the way that you're going to scale it is probably going to be with containers and then once you've got containers, with Kubernetes. But the thing that we see over and over again is that we have these great open source tools and then everybody realizes, "Man, it sure is hard to get all these open source tools to play nicely together." So you end up with companies like Cloudera that help businesses make sense of Hadoop ecosystems and all of the associated packages associated with those. You get companies like Databricks that at first were all about Spark and now it's a whole suite of other tooling that just makes it so that if you're a data scientist and you want to do either Machine Learning or data analysis, all you have to concentrate on is that, and not necessarily about making sure that your Spark cluster is working the way it's supposed to be and rescaling it yourself. It just has auto scaling and wonderful notebooks. But I think that the trend is going to be that we're still going to have open source tools. I think that we're going to see more and more companies spring up to help you make sense of those open source tools and make them production ready. And I also think that unique to Deep Learning into Machine Learning the value is in the data. So having algorithms is great. Open Sourcing Algorithms is great. You suddenly get these wonderful huge communities of folks to work on projects. But it doesn't do you any good to have the algorithms if you don't have data to apply it to, and if you don't have specific business problems. So I think I would love to see more open source tooling around data engineering. I think that the Tidyverse tools are phenomenal and they're beautiful, they're intuitive to developers, they have consistent naming conventions, they solve very specific tasks, that they're also easily composable and extensible. I love Tidyverse and I would especially love to see similar tooling in the Python community and I mean I know we have pandas, but I personally still love doing data engineering in R, I think it's much more intuitive. But I would love to see similar things for Go and for a number of other languages.
The newest developments in Cloud Computing
Hugo: Sure. So I can't wait anymore Paige. I want you to tell me about the newest developments in Cloud Computing that you've discovered this week, and in particular ones that you think will be impactful in the Data Science ecosystem.
Paige: Cool beans. Okay, awesome. So do you want to hear first about the Google stuff or about the Microsoft stuff?
Hugo: It's up to you.
Paige: Okay. I'll go with Google first. So Google announced a number of things they have. So first off they rebranded their research division from Google research to Google AI, and that's not to say that they're not going to do computer science specific research anymore, but it's being very, very transparent that they're focused extremely hard on developing AI tooling, developing new AI products. So that's very-
Hugo: Huge.
Paige: Yeah. They also had that amazing demo where a computer calls and schedules an appointment and uses 'ums' and 'uhs' and sounds very much like a human, that was Google Duplex. They also have a new feature for Google maps where you have extended insights into locations around you. So Microsoft has a similar product called Location Insights, but the integration that's been done with maps is you can pull down street view onto your phone so as you're walking along it's showing you directions almost like an augmented reality situation, because your camera's pointed at the street and it looks exactly like the street on your camera. You can point your camera at various buildings around you and it does automatic detection of what you're looking at. There was another great announcement around photos, so automatically being able to recolor images. There's another announcement about helping compose emails and a ton of announcements, not necessarily at iOS specifically, but at the TensorFlow Developer Summit recently they announced TensorFlow for swift, the programming language, TensorFlow.js for Javascript. And they were even working on node bindings so that you can leverage the GPU and your laptop and do Deep Learning in the browser. There is lots of great research around. They have a... So one of Google's flagship projects is doing diabetic retinopathy imaging or diabetic retinopathy diagnosis rather. This is a disease that if you catch it in time, if you catch it early, you can prevent it and it's very easy to prevent. But if you don't catch it in time, then you know the person actually goes blind. And the only thing that you need in order to diagnose the illness is a picture of the eye. So Google was able to build an algorithm that detected better than doctors even whether or not a person had diabetic retinopathy. And that was included in a new FDA product that was approved. So this is, I think, one of the first algorithms that has received FDA approval in a medical device for use. And that will, I think that's going to kind of transform the number of algorithms that we see in healthcare devices, especially for things like medical imaging. So all of that is just massively cool stuff.
Hugo: And the first set of developments you mentioned are really focused on user experience. The second set I find really interesting because the kind of speaking to how developers can use what Google's working on and I'm just wondering, this last example, the diabetic retinopathy, is this a product that working data scientists can get involved in or is it closed source and proprietary or-
Paige: Everything's open source. That's what I love about TensorFlow so much and about well you know... And so I can learn as well. But TensorFlow... All of the projects that I just mentioned are open source. I think even the dataset, the diabetic retinopathy data set and there's also a long x-ray dataset. But if you want to get involved with either product or or either project rather like all of it is freely available online.
Hugo: That's fantastic. Just out of interest, why do you think Google open source is all of these things? Because in the end they are a business, right?
Paige: So again, it kind of, at least to me, it points back to the data, right? So having all of these powerful algorithms, it's wonderful. It doesn't mean anything if you don't have data. And also the first cohort of tooling that I mentioned. So all of these additional capabilities for photos, all of this additional augmented reality stuff for the maps components, all of this additional composability for emails. If you're using those tools, then you're kind of feeding the data machine, right? Like the data that you input into it, even though these are wonderful platforms and they're incredibly useful for all of the folks who use them. You're still feeding in data which helps make the algorithms better, which ultimately improves Google's tools, some of which they do sell. And the other thing is too, if you're doing data analysis, Deep Learning, Machine Learning on a cloud environment, the two most expensive things that you can do is compute. So what resources are you using in order to actually carry about the work that you're doing? And then also storage for the data, right? So if you open source the algorithms and you show people like, "Wow, isn't this awesome, look at all this cool stuff you can do. Isn't this really, really interesting?" Then if that person wants to turn around and do it, they have to pay for data that they have stored and they also have to pay for the computer that they used in order to analyze that data. So it's kind of like giving you free leather seats for a car if you buy a car, right?
Hugo: That's a great analogy.
Paige: Personally I think it's incredibly wise to open source this software to build a great ecosystem with quick starts and tutorials and to also build straightforward API interfaces and kind of delightful tooling so that folks will actually want to use the products and want to want to use them really happily.
Hugo: Having API products is something that I'm really very interested in because I think for all the good that, for example, online Machine Learning competitions provide us, for example, they do convince a lot of beginners that the output of a Machine Learning model is a csv file.
Paige: Nooooooooooo.
Hugo: Right? But that's what they do, right? And this is not the case now and this isn't the case going into the future clearly.
Paige: Not at all. But I feel your pain. There have been so often, I've worked with data scientists and it's either, "Okay, well the results of this analysis is either going to be this static document that I crank out from my boss and it looks very similar to a traditional business report or it's a csv file with a series of IDs and then whatever you've classified them as." And that's not, "Any data scientist who does that now, don't do it anymore." That's not okay and it's not respectful for any of the software engineers that you work with because the ideal situation is that you have something like a container where you have your algorithm, it doesn't matter if it's a protobuf file, so like a .pb or a .py file if you're doing something with scikit-learn or a .R file if you're doing something with caret or any of the other R Machine Learning packages. But then you also have something like a schema.json to define inputs and outputs. So I expect to get a jpeg less than four megabytes in size and I will output a classification and a confidence level or something of that nature, and then also some sort of script to initialize and run your model. But the idea is you package it up into a container and that way any software developer in your organization can ping it the same way as they would a REST API. That's making something that can fit into an application. It doesn't matter if you have the best algorithm in the world, like the most accurate, insightful classifier ever, if you can't fit it into a business process, then it doesn't matter. You might as well not have created it at all. It's the same as doing sort of research work as a scientist. If you can't communicate what you've just done, then that research might as well never have happened. Communication and being able to integrate your algorithm into existing business applications, existing processes, that's what we should all be trying to do.
Hugo: I couldn't agree more. And this really speaks to productionizing whatever you're working on right?
Paige: Yes. And also to your point before being able to scale it, you might start off with a thousand customers, but if you're doing your job right, then eventually you'll have a million. And what does that mean in terms of changing your computational workloads, or do you need to use Spark now, and if you are using spark, do you need to spin up clusters of machines that are CPU enabled or do you need GPU enabled machines? And then also how do you deal with the data that you have in storage, and do you want to incorporate streaming data and all of these other things? The entire concept behind having kind of an end to end Machine Learning life cycle and how you would go about retraining models over time and doing kind of checks to verify that the data coming in is what you expect to see and what was consistent with the first iteration of the model, all of that is incredibly needed and I don't think enough institutions are doing it yet. There's this great paper called TFX, TensorFlow Extended that was released about halfway through 2017 and it is amazing. Like I re-read this paper at least once every three weeks, but it is just phenomenal and it goes through all of the steps that you should be doing as a Machine Learning engineer to make sure that your model is doing what you hope it's supposed to be doing and is production ready.
Hugo: Well, this actually speaks to my next question that was really stimulated by everything you were saying and we'll include a link to the TFX paper in the show notes for the listeners out there. But my question is if you have data scientists, Machine Learning engineers building these models, you have data engineers working on the data storage, databases, whatever it may be, you have software engineers at the other side using the output of these production as Machine Learning models, whose job is it to make sure that the models stay doing what you think they're doing?
Paige: I personally think that it has to be a constant collaboration between the Machine Learning engineer, the software engineer and the DevOps practitioner. Because you have to have a DevOps mindset. You have to kind of understand what do I need to be logging in terms of my models output? How often should I double check to make sure that the output makes sense and is still the accuracy that I want and that I need? And then also what phase gates should I build into my automated model retraining steps if that system breaks down? So what do I mean by that? I mean that say you build a model, you're really happy with it, it gives you 85% accuracy or 90% accuracy or whatever. You deploy it out to the world and it continues to work great for the first two months and then that third month suddenly your accuracy drops below whatever threshold the business had a requirement for it to be. So suddenly it's like 80% accuracy. Then there should be like a flag in your DevOps process that notices that change and then automatically triggers, "Okay, well data scientists need to look at the model, retune hyper parameters, incorporate additional data, or maybe something happened to change the ... The consumer base changed or something of that nature. But whatever the reason data scientists need to re-evaluate the model and re-architect it.
Hugo: And this challenge is known as model drift, right?
Paige: Yes, model drift. But the DevOps for Data Science space, I think it's going to be huge. And again, there aren't a whole bunch of people who are currently focused on that area. Like right now it seems like everything on the Internet is all about like, "Oh hey, let's build an algorithm. Let's just figure out how to do that." But nobody ever really focus as much on the data engineering aspect. So getting data into a state that it needs to be in order to get into the model or what do you do with a model once it's been created? And that's kind of what I love about DataCamp too, is that you guys have so many great courses on the data engineering side as well as the algorithm building. Like I said, it's going to be much, much more important as Machine Learning models are deployed out into the wild to make sure that they're kept up to date and that they're incorporated into applications in a responsible manner. And that's going to take DevOps practices.
Hugo: Listeners out there: please listen to this. If you want to be ahead of the curve and think about very necessary aspects of the work we all do, listen to Paige and think about DevOps for Data Science. Now, this idea of productionizing Machine Learning models and having APIs I think, actually will dovetail very nicely into a few of the announcements that Microsoft build because I am aware that one thing that's happening that's really exciting is the Cognitive Services API, right?
Paige: Yes. So Cognitive Services are very cool and a lot of folks have similar services, but what they are is basically taking a Deep Learning model that's been trained against millions of images, millions of tagged images or lots and lots of videos, lots and lots of clean speech. So taking lots of data, taking very high end machines, training up Deep Learning models over the course of weeks and then deploying them as rest APIs that anyone can call. So what does that mean? That means that the companies that have most of the data, right? So the Googles, the Amazons, the Facebooks, the Microsofts, the whatever, Microsoft, Google and Amazon have decided to expose those models as rest APIs, that mean you don't have to retrain anything really. You don't have to have millions of images or millions of tagged pieces of data on your own site in order to build a model. All you have to do is write less than 10 lines of code in whatever language you want, ping the REST API, and suddenly you get back, like for the Speech to Text API, you get back spoken words in English or in any other language that you choose. For the Vision API, you get back a plain text description of the image as well as a whole bunch of different tags of things that the API thinks it sees in that image with confidence levels for each. For the face API, you get back emotions, you get back estimated age, estimated gender, you get back specific locations, so like I-right-left or I-right-top, I-right-bottom, whatever, lots and lots of different locations that you can use for various things in your app and you don't need to know Machine Learning or Deep Learning at all in order to leverage those APIs.
Hugo: That is awesome and I think incredibly powerful and very exciting, but there is a dark side to publicly exposing trained models with APIs, right?
Paige: There is totally a dark side. If you expose these APIs to folks who don't necessarily understand Machine Learning or Deep Learning as a process, it might be very confusing to them why sometimes it works and sometimes it doesn't, right? And also there's the inherent bias of any dataset that's used to train the model itself. So let me start with that first one though. So a friend was recently looking at one of our Cognitive Services APIs. This is one of the custom ones. So you're able to upload specific photos and have it trained on those photos. For example, if you have a corpus of employees, you could train the model on your employees and suddenly it would be able to say like, "Oh, I see Paige Bailey, confidence level 75 percent." So she did not know that the 'not' aspects, the kind of the 'not' classifications are just as important as the 'are' classifications, so you could upload five pictures of me and try to train against that, but if you don't have any examples of things that aren't me, it would just kind of naturally assume that any person who looked vaguely like me was me. And then there were also a lot of results given with precision and recall and she didn't quite know what those meant. There was a sort of a misunderstanding about the kinds of pictures that you should upload. For representative training samples, ideally you would want to have lots of different pictures of my face and in various lightings, from various angles, and that would give you the best classification results, but that's never explicitly stated in any API. And then also it's never going to be 100 percent accurate to classify, right? It's always going to be some confidence level and It's never going to be 100%. So helping people understand that uncertainty I think is going to be an incredible responsibility for everybody who has these developer facing APIs that are very powerful and are incredibly useful and do a great job at democratizing AI. But if you don't let people know about of the dangers of using them, then it could get real bad real quick.
Hugo: And that problems associated with people who may not have enough context and enough expertise in Machine Learning, Deep Learning, AI. But there's another problem that a lot of research is going into at the moment, which is if you have bad actors who do have a lot of experience, they can actually for example extract sensitive information about the data that was used to build the model if they're clever enough about it. Right?
Paige: Absolutely. That's something called Adversarial Machine Learning and being able to kind of reverse engineer the model or reverse engineer the data is really, really fascinating. There's a research team out of Google called cleverhans that does amazing work in this space. But it gets very scary, like you can have something called a single pixel attack and that is a real term where if you have a classifier that is supposed to be examining an image and then making some sort of classification on it, so like this is a dog, this is a cat, that's a lama. If you introduce a single red pixel or a single pixel of any color in a very specific location, you can suddenly go from accurately classifying with a high confidence level to classifying incorrectly but within even higher confidence level. So that's a single pixel attack. You can also just introduce random noise, like a .07% random noise and get back a completely inaccurate classification. Because the images, they look pretty much the same to a human for the random noise, one that looks almost exactly the same. If it was a picture of a panda it would still look like a panda, but to the classifier it's just so over fit did it doesn't know any better. It's very scary.
Hugo: It is. And we'll include a few links in the show notes with respect to these poisoning attacks and also extraction attacks. There are a lot of smart people working on these types of challenges. They have actually been able to show that these types of attacks are very hard to defend against as well. But we'll include all of that in the show notes. We discussed the Cognitive Services API, what else happened at Microsoft Build that you find fun and really exciting? Paige: Cool. If you're a C# developer, there's a new thing called the ml.net. So this is some of the more canilla scikit-learning, Machine Learning algorithms you can certainly use as a .net developer. That was one announcement. There is also FPGAs, so FPGAs are incredibly fast inferencers, which means that the kind of the predictive capacity is much, much faster than a TPU. We showed an example of using TensorFlow and classifying chips for a company called Jabil. There was also an announcement about something called Cognitive Search, which I thought was incredibly cool and that is basically you just throw all of the data that you have in various stores in Azure. So you might have some csv files or PDF documents in blob storage, you might have some databases and cognitive search just kind of goes through extracts n-grams, so it extracts keywords and it automatically builds a knowledge graph for you. So what does that mean? It means that so say you have PDFs, it goes through and extracts all of the text from those documents, but it also looks at embedded images within the PDFs and using the cognitive services that I was mentioning before it gives like a plain text description and also a whole bunch of tags for things that it thinks it sees in the image and then automatically links them up together. So the example that we showed it in the keynote was the NBA just kind of put all of its players photos in it. It put all of these PDF documents called 'Game Notes' for player performance and suddenly you were able to find linkages: LeBron James likes wearing Nike basketball shoes and that was because it noticed Lebron in an image, like it was able to detect him specifically and also Nike shoes specifically. And that was just rock-hard awesome.
Hugo: So we're thinking about correlations of patterns in that sense, which is really cool to have that type of recognition.
Paige: Yeah. And it helps you as a data scientist to also ask a lot more interesting questions. So if you see these patterns and relationships, you can start thinking like, "Oh, well now I certainly have data that I didn't have before." I've got no text that was extracted from PDFs but then also counts of how many instances of seeing LeBron and Nike shoes in the same image. And yeah, it gets very interesting very quickly. And I think we also announced the model logging of Azure Machine Learning, so being able to track model performance over time and then also being able to deploy it as a containerized instance. You can package everything up and containers in a completely open source way, but Azure model management also allows you to ping it as a serverless REST API call. So that would reduce the amount of money that you would need to pay in order to leverage that model at will. So it's kind of like creating your own very specialized cognitive service just as a data scientist working for a company. So I love the idea of having a model market place where data scientists from wherever they happen to be, build a model on certain datasets. They deploy it as kind of this REST API, this containerized instance, and then people can ping it and pay accordingly. I think that would be such a cool thing.
Hugo: Yeah. I couldn't agree more and I think this also speaks to the emergence of more and more Transfer Learning occurring in the Machine Learning space.
Paige: Yes. And if you haven't taken a look at it yet, take a look at TensorFlow hub, which was also announced during the TensorFlow Developer Summit. It's an open source portal for sharing datasets, but mostly trained models and also model components, so people can kind of Lego brick and architect their own model by using bits and pieces of other models.
How can beginners get started with Cloud Based Data Science?
Hugo: That's awesome. So I've got time for one more question and what I'd like to know is for beginners, people who have done Data Science locally or on data camp or wherever it may be, what can they do to get started with Cloud Based Data Science?
Paige: So what you can do to get started with Cloud Based Data Science is we have ... Man, like that is a great question. I would highly recommend looking at the documentation on Azure and on Google for GCP and on AWS, but dependent on whatever cloud provider you want to use, they'll probably have documentation and quick starts and tutorials. I don't know of a course that focuses specifically on distributed Machine Learning, but I would love to have one.
Hugo: You should come and teach one of DataCamp sometime.
Paige: I would love to if y'all would like and the packages that I would probably recommend most are mllib with Spark for Distributed Machine Learning and that would be either with CPUs and I think it's just CPUs, but I'm not sure. So don't quote me on that. And then also Spark Deep Learning for the Deep Learning approaches and that does support GPUs. But those three packages are incredibly useful for distributed Machine Learning at scale. And then Machine Learning on the cloud in general, you could probably use the same tools that you love and adore already. So scikit-learn, caret and TensorFlow. It'd just be that you would be accessing different kinds of data, you would use it in the same way, but the connections would probably be a little bit less intuitive and then being able to productionize it. So once it's created, how would you be able to deploy it?
Hugo: Fantastic. Paige, this has been such an absolute pleasure. Thank you for coming on the show.
Paige: It's been awesome. Thank you so much for having me. And I wish you're at PyCon this year.
Hugo: Yeah. Me too. I'm in Australia of course, as you know but PyCon is one of my favorite conferences because in my daily work I kind of get stuck in using Python, really like the scientific Python stack, and I forget how incredibly broad the community and uses of Python are. So that's one of the things I always encourage people to do when they go to PyCon. A lot of the value I think is just finding out what everything else is doing with this kind of incredible language, right?
Paige: Yeah. It's everything from web development to like CIS admin tasks, it's everything.
Hugo: Well thanks once again. Paige. It was an absolute pleasure.
Paige: Thank you.
blog
Cloud Computing and Architecture for Data Scientists
blog
Google Cloud for Data Scientists: Harnessing Cloud Resources for Data Analysis
podcast
Unlocking the Power of Data Science in the Cloud
podcast
Data Science at Airbnb
podcast
Data Science at McKinsey
tutorial