How Data Science and Machine Learning are Shaping Digital Advertising
Discover the role of data science in the online advertising world, the predictability of humans, how Claudia's team builds real time bidding algorithms and detects bots online, along with the ethical implications of all of these evolving concepts.
As the Chief Scientist at Dstillery, Claudia Perlich leads the machine learning efforts that help target consumers and derive insights for marketers. With more than 50 published scientific articles and numerous awards, she is a widely acclaimed expert on big data and machine learning applications, and an active speaker at data science and marketing conferences. Prior to joining Dstillery in 2010, Claudia worked at IBM’s Watson Research Center, focusing on data analytics and machine learning. She holds a PhD in Information Systems from New York University (where she continues to teach at the Stern School of Business), and an MA in Computer Science from the University of Colorado.
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC.
Hugo: Hi Claudia, and welcome to DataFramed.
Claudia: Well, thank you so much for having me.
Hugo: It's such a pleasure to have you on the show. I'm really excited today to be talking about how data science and machine learning are shaping and reshaping digital advertising. Before we get there, I'd like to find out a bit about you.
Hugo: What are you known for in the data science community?
Claudia: I have a couple of different hobbies even within data science. I think what most people may know me for -that’s almost where my “fame” started- I used to participate in a lot of data mining competitions. You may recall the Netflix prize where you could make a million dollars if you were substantially better at recommending movies than their existing algorithm. Much earlier than that, there have been in the field in the geek world of machine learning competitions where people are trying to build the most accurate model on a data server that was provided by the organizer. I've been participating in those for quite a while. Then I won three in a row between 2007 and 2009, one on a breast cancer prediction, the other one was on churn prediction from telecommunication, and also all of the Netflix data set, although we didn't get the million dollars on that one. That's a little bit of my initial claim to fame that really helped being perceived as a kind of hard-core part of the machine learning community.
Hugo: Great, and so I suppose one of the most famous platforms where this can happen now is Kaggle?... See more
Claudia: Kaggle is basically the next gen when it became much more mainstream as machine learning and big data picked up. They provided a very nice interface when our, not just organizers of this conference, but in general non profit organizations, companies, all of them have a very easy way of interfacing with a huge community of thousands of people who have fun building these models.
How Did You Get Into Data Science?
Hugo: This speaks to some of how you got into data science. I want to probe a bit more into what is there in your background that let you to data science? What type of skills did you develop, or what jobs did you have? What did you study that let you to data science?
Claudia: I grew up in East Germany and other than knowing that I was good in math, I didn't really have any convictions and no clear idea what I wanted to do with my life. My dad took me aside and said, "Look, they will need computers everywhere. Why don't you study computer science? If you're good in math and you should be fine with that." That was really how I picked my first choice for my undergrad being computer science. His words were even more prophetic if you think about what happened to data science now being really extremely hard and having all kind of different application areas. I migrated into data science in '95 as an exchange student at CU Boulder when I took my first class on artificial intelligence and artificial networks. I just loved the fact that you could learn so much about the world in all of these different fields by looking at it through the lens of data. Aside from this, kind of very measurable challenges of building the best possible models. That world just appealed to me from the first time I basically tipped my toe into data.
Hugo: I think this actually this is a nice lens through with to view the emergence of data science. From your background, right? Because you were good at math, you studied computer science, and then you moved into model building and predictive analytics and these I suppose are three of the things, which people associate most with whatever data science is these days.
Claudia: The challenge with data science, and I would argue even today with artificial intelligence is that it really depends on the background of the person you're talking to. I think if overall you are right with your characterization, generally there is a sense that you also want statistics as well as some domain knowledge in the mix here. I personally have found that the fact that for my PhD I moved from computer science into a business school here at NYU really shaped my focus away from the purely algorithmic towards being much more interested in what kind of problems you can solve with these tools and algorithms. I think this is really where the birth of data science, in its kind of broad application, originates. It's in using data in combination with algorithms to solve some very specific problems.
Hugo: In that way, it's actually domain centric and question centric?
Claudia: It should definitely start with a good question. You can't just jump into a data set and dig around there and hope to find gold, as it sometimes is put. You need first to really understand what are the things you want to do? Is there any decision you can make? Because if there's nothing you can do, then you don't need to waste your time on looking around in data. I really like to start with the constraints of the problem to then being led down the path towards what data is most appropriate and what algorithm can help me solve it.
Data Science and Digital Advertising: DStillery
Hugo: This is great, because I really wanted to talk about your work in digital advertising and your work as chief scientist at Dstillery. Perhaps you can speak to the types of questions that you use data science to solve in that work?
Claudia: Before we get into this, the story of my life really progresses from getting a PhD in information systems and a business school. Deciding that rather than taking an academic career path, I really wanted to focus on these applications, which brought me to IBM Watson where I stayed for six years. Then, I was lured into the world of digital advertising really by almost the promise of a golden land. It's like a big sand box to play in. Digital advertising has an incredible data footprint where you can experiment and really push these algorithms to the limit. It's a huge experimentational field where you can understand how well you can predict human behavior, which is typically somewhat limited, but a lot better than random. When these methods in this conversation between correlation versus causation can help you to understand causality and when they don't. For me the excitement really was in not just the sheer amount of data, but also the ability to try these things. Put these algorithms to the test and see how they perform kind of in the real world.
The Who, What, Why and How of Digital Advertising
Hugo: Great. Can you give me the elevator pitch on digital advertising?
Claudia: I don't think I can give you the elevator pitch of digital advertising. I can tell you what we do and what I've been really enjoying to do for the last eight years. You may be familiar with the rise of the programmatic advertising world. What that means is, that advertising are now being sold in real time auctions. Every time you interact with a digital device whether this is reading news stories on the web or using an app you will be exposed to many different ads. These ads really were bought in real time as the page loaded in one of these auctions. What Dstillery specifically initially started out doing this, it was the promise of being able to pick extremely selectively the right person and moment by using predictive modeling that informed the automated bidding when such a good opportunity showed up and then adjusting bid prices for that. This is the core promise of programmatic advertising where you'll see everybody all day long and you can choose with very high precision when to interact with a customer. Now around this core problem there are lot of kind of interesting quirky fun other things that can keep a data scientist excited. We have for instance, problems around fraud. There are a lot of these, the moment you have a open market where people can buy and sell. All of a sudden even that environment you run into scenarios where this is not a real person who is sitting there who may or may not see the ad, but in fact it's a bot that was written for the sheer purpose of selling ads. That was a very interesting discovery back in 2012 where our models looked really, really good at predicting certain outcomes. The thing is usually people are not that predictable. Then when we tried understand what was going on, the performance was too good to be true, because bots really are deterministic. They're easy to predict and that was one example of these side problems that we're having. On the other end of the spectrum is we're already spending all of this effort of machine learning and AI on huge data footprints, just for bidding. Then, the clients come back and said, "This is really amazing, you have this great performance. What did you do?" Other than shrugging and saying, "Well, we built a predictive model and, I don't know, 500,000 dimensions." That wasn't the right answer, because they really wanted to understand what we may have found out about their potential future customers. Increasingly now we are looking into translating back what this artificial intelligence kind of found in this vast different behavioral patterns to be able to not just choose the right ad, but to answer more strategic questions about, "Why do my customers actually buy my brand? What's their perspective on the value proposition of the product?" Some of this you find encoded in these models that are very good at predicting. Now, increasingly we work with augmented reality to just give this information back to brands to help them understand what their customers are really doing.
Clicking Behavior: Bots, Human Behavior and Predictability
Hugo: There we have a question posed in non-data science terms. I mean, well, a job to do: to build algorithms that will predict whether people will click or not in order to make real time bids in these real time auctions, but then another non-data science question emerges, which is why are our customers doing this? It's really a translation in both directions, which I think is incredibly interesting. It really speaks to this idea that data science doesn't exist in a vacuum as you stated initially. That is responds to real world questions and we also need to as data scientists translate our results to non-data science people whether they be customers or managers or people we're consulting.
Claudia: One of the most interesting quirks to this is, as you said, we're predicting people clicking. We're actually learned the hard way, that it's a really bad idea to use powerful machine learning, to learn when people click on ads. The reason is, people occasionally click on ads because they're interested in the product, but much more often it's an accident when you're trying to either close it or just change the window. The fact that it's an accident doesn't mean that it’s random. These algorithms are good enough to find out how to predict the accidents. It turns out, it's much easier to predict accidents because they're typically either contextual that people don't pay attention. For instance, we see very, very high click through rates on the flashlight app. Because they are people fumble in the dark and there's a very good chance that you accidentally hit the ad. Other scenarios might be just people with very bad fine motor skills, or eyesight problems, just tend to be more prone to clicking, which definitely doesn't mean that they're truly interested in the product. The interaction between really smart technology and optimization metrics that were kind of okay for a very long time. Now we have to educate back saying, "This is not a good idea to combine these two things, because the technology's actually too powerful. We need to think a lot harder what we're optimizing for, because we may end up doing the exact wrong thing."
Hugo: I suppose what's telling is that in all these cases the things that are easiest to predict are the things you don't want to predict. Whether it be bots clicking, or people using the flashlight app. I think I saw you give a talk once in which you had another very telling example of wanting to advertise airport related stuff, or travel related stuff to people at airports. I'm gonna get this wrong. I'd love for you to tell me that example again.
Claudia: You've remembered correctly. It's really a progression of what I just talked about. If you really listen to me saying, "We have bots that go and visit websites, then we have accidental clicks. What should we optimize towards?" One thing I try to do is say, "Well, nobody accidentally goes to a physical location. What we could try to do is predict whether people will go to a store or an interesting location." I did two different experiments. One was predicting who would go to a car dealership. That one actually was very successful. Really interesting to see the market research, the different brands that a consumer or perspective buyers look at before they then choose and go to that Mercedes car dealership. Since it worked so well, and there is this other group that everybody always wants to reach, which is the frequent traveler. Presumably because they have a lot of money, probably bad conscience and need to bring gifts home so it's always a great audience to reach. Where would you expect to find them if not in airports. It turns out that the people who are much easier to find in airports, are all the people who work there. From the baggage handler or the people that the check-in and they spend their whole day on their digital devices and because they're there every day and have very typical patterns of behavior. Again, by not thinking very clearly what else could explain what you're looking for, we found mostly employees of JFK rather than the elusive frequent traveler.
Hugo: You've hinted at this, but part of your job is to predict human behavior. You've hinted that humans aren't so predictable, but how predictable are they, we?
Claudia: How predictable are we? Well, I was really fascinated by ... this has nothing to do with advertising ... you may follow kind of these competitions that AIs are now engaging in, starting with the chess games back in the ‘90s, or recently we had Jeopardy that Watson won or [the Google AI that won] Go. None of them really has that much to do with predicting human behavior. More so with strategy. What is fascinating that apparently now we have algorithms that can predict when people try really, really hard to be non-predictable. Apparently we finally have the world best poker players using algorithms that analyzes faces and apparently we give a way a lot more than we think. That's kind of a side fun story.
Hugo: The poker face-
Claudia: The poker face-
Hugo: ... isn’t really a real thing.
Claudia: No, not to the machine. It may work with other people but the machine can still see right through it. In our daily activity I think a lot about us is very predictable, at least in my case. I mean you can very quickly figure out what my daily habits are. The other thing that is very predictable are all kinds of consideration activities. Things that require gathering information and take a couple of I don't know days, weeks, months to come to a conclusion. For instance, buying a new refrigerator. For most people that's a serious thing and you spend some time thinking about it. You will have plenty of digital traces of that activity that helps marketers identify, "Oh yeah, these people are in the market for that product." When you look at the more kind of spur of the moment activities, in those cases you might be able to identify that this person is even prone to kind of a buy on short notice of an apple while walking over Union Square Market, because the person likes apple. Can you predict that at this particular moment the person will feel like buying an apple and seeing whether they want? Probably not. There is a huge range and you have seen this, you've probably heard about one of the studies done on Facebook data that was really concerning from a privacy perspective. This is not about predicting future behavior, but what researchers showed is there are a lot of parts of our personality and behaviors that we may not want to make public, but that can be very easily inferred. For instance, sexual orientation, political perspectives, all of these things are really easy to infer from a machine learning algorithm just given your day to day activity.
Hugo: That isn't something necessarily that will help Dstillery do their job though, right? Or is it?
Claudia: This is typically not what we're interested in. In the sense that we are looking to optimize very specific metrics by the marketer such as, number of new customers signing up for a service on their website. For that, I'm not really interested in any of these concerns about who you really are and things like sexual orientation. It's also perfectly anonymized in the sense that I don't know any personally identifiable information about you and even your browsing and digital activity can basically be hashed and obscured. It doesn't mean anything. This being said, I think in general when you now turn it around and go back to the client and they want to know, "What did you find?" I think at that point the boundary between, "Where do I start infringing at audiences kind of privacy when I share certain correlations with character traits?" It becomes really interesting and one of the concerning examples for instance even from our frequent traveler. We did find flight attendant sites, but we also found gay dating sites. Now, is this something that I should be seeing, is this something I should have such easy access towards? Often when I tell it with the story, people in the audience feel somewhat ill at ease about the fact that this is even something that can be that easily revealed.
Hugo: This is something relatively new in the technological landscape. Presumably legislation hasn't caught up with these types of challenges yet from a societal level?
Claudia: We are really struggling right now, finding new ways of possible self constraining how we interact with these technologies and where the line is.
Hugo: What type of data do you have access to that helps you predict human behavior at Dstillery?
Claudia: The data sources in digital advertising are really coming from many, many different places and different layers have very different access rights. You really see specialization. Obviously Facebook knows everything you do on Facebook and will provide versions of that data to their advertisers. We have access through these real time auctions. Basically every auction that happens and we're talking about 100 billion events every day. 100 billion times we are being told that this particular device is right now looking at this particular content. You have this constant stream that ultimately then gets assembled into an activity history of a very granular nature like the URL and the news for instance that you read, alongside with location information if the particular request came from your mobile device. If you're just standing on the corner and you're bored, and you're playing I don't know Candy Crush 15 and there's and ad there. Your phone just told me that your standing there unless you're very diligent about switching off the GPS. This is kind of one of the primary sources is actually the environment itself through which ads are being sold. In addition to that you also have many data vendors who are providing additional information that they have collected of similar granular form.
Hugo: Do you have access, for example, to how much time people spend on websites, even cursor activity, if they have other apps open, any of this type of stuff?
Claudia: The details of your web activity typically remain behind the scene. What you do with your cursor really requires an integration in your browser, which is far beyond anything that is available kind of broadly in the advertising environment. Now, sometimes the ad itself could have technology that for instance tracks how long it is view, and whether or not you went with the cursor over it. Something similar happens in the mobile space when you use your digital devices, what's called the SDK's. It's basically the, almost the operating system it's the fundamental software that is underlying most of the apps that people develop. They themselves might collect data about what you do with the app, but also other apps that are going installed. There's kind of an ongoing attempt from, for instance, Apple to restrict that apps stick to the rules and only kind of look at their own data and only share their own data. There's a lot that comes directly from these deeply integrated parts of the software stack that is providing you apps.
Hugo: How about tracking pixels? I don't know much about what these are, but I hear they're being used more and more.
Claudia: The notion of a tracking pixel, to explain that, one needs to understand the rules of engagement when you're using the web. Most people know about cookies, but they don't really know what they do. A cookie really is just a very small file that just sits on your computer and does absolutely nothing. What is interesting about it, is that I can read it and write to it if your computer requests basically content from one of my servers. If you go to a website and it looks up the content from the New York Times, then the New York Times computer can look at the New York Times cookie it has saved on your computer. Their origin of this was for instance for e-commerce that you could save the content of you're ask it and you didn't lose everything all the time. It was a way of temporarily storing some information. What it is today, the only thing people store in there is basically your ID. We would give you a 20 digit random number that is sitting in cookie on your computer. Now, how we get to the cookie and what does the pixel have to do with it? If you go to the New York Times I would not know about it. In order for me to know about it, there has to be a request to my computer, which is the Dstillery machinery. In order for this to happen we put what is called a pixel on the New York Times with, of course, their cooperation. They have to put the pixel on the New York Times. When you now read the New York Times there could be a request from the New York Times to me, which now allows me to look at my own cookie on your computer. All this really serves exactly one purpose. To give you a persistent identity, so every time I see you, I know it's the same you that it use to be. The cookie itself is entirely passive. It just sits on your computer and only when I'm for one these integrations get access to it can I now add another piece of information to the history that I have collected about you.
Hugo: Right. That was a great explanation of both cookies and tracking pixels. With tracking pixels have you found that they will generally give you far better performance for your models?
Claudia: We have as I explained earlier, this one huge data stream that comes directly from the bidding environment where ads are being bought and sold. Of course that only comes from websites that rely on advertising in order to monetize. You wouldn't see that on the more upscale brands that don't want any ads on their sites. In those cases, when we're talking about kind of the non-ad monetized world that's where these pixel based integrations help us gather information and they are by far more valuable for most of the predictive tasks, because they really express intent and economic purchase behavior, which ultimately helps us predict more intent and purchase behavior. We have found that while the volume of that data source is much lower, it is by far more valuable than the big stream directly.
Big Data, Social Research and Data Anonymization
Hugo: Good to know. We've talked about the masses of data that business's such as Dstillery have, with respect to how a lot of people are interacting with online environments with the online world, which is taking up more and more of our daily lives. What are ways in which these masses of data can be used for social research?
Claudia: There have been really great pieces of social research recently. I think especially around the rise of fake news and how people interact with information, propagate information that lead ultimately to these kind of information bubbles that are being enhanced by the AI itself that's trying to predict what you may want to read about. There are number of researchers that do incredibly important work because it goes beyond just understanding the social nature of our modern generations. It really comes to the fundamental questions of how do we now progress with democracy moving forward if we no longer have a remote hope for objective information when things are shared algorithmically. [...], I also really recommend the work that's done at The Data and Society Group here out of New York City. Then you also have various of these pieces coming from Google and Microsoft. You see a much increased need for this understanding and now we also have much more access to what people actually do and understanding how these processes work.
Hugo: Are these their own way for companies and business's such as Dstillery to anonymize their data and share it with researchers to gain a better understanding of people and society and the model landscape?
Claudia: We specifically here at Dstillery have not taken that step. We have on occasion invited researchers to join us for a period of time to work on our premises on very specific research questions about that. That goes back to the somewhat sensitive nature of the data that we are collecting and some of the legal constraints as well that we have imposed. Usually this type of research is done in collaboration between academia and industry through internship programs or visiting scholars and as this model seems to be more promising, because often it's not just about handing over the data. You need to understand so much more about the data collection, possible biases of what is visible to us and what isn't to derive correct scientific conclusions from it. You really have to immerse yourself in this world to truly understand it.
Hugo: Consumers have a relatively complex relationship with advertising. What value can data science in the online advertising space add to the consumers experience?
Claudia: My overarching sense is that by helping valuable content to monetize, we are part of the ecosystem that allows publishers to be ultimately somewhat independent. With the decline of subscription, many publishers and even blogs where anybody can express themselves have to rely on advertising as a primary source of income. I'm not necessarily convinced that we are truly providing the much needed information. I think having a possibly less disruptive experience and have advertising be part of the fabric that it fits both my interest and the kind of topic of the site where it's being displayed is an acceptable compromise. Whereas what I'm seeing with a lot of concern as advertising has increasing focused more on viewability as a metric. Meaning for instance, that advertisers only want to pay for viewable ads. This makes initially sense from the perspective of the advertiser, but it then puts the publisher in a really difficult position. Because as a result of it you have these absolutely terrible experiences as the user where the ad is kind of following you around on the page and there's no way to get rid of it. Then of course you have click through rates because every time you try to close it something happens. Eventually I think that really only fosters the installation of ad blockers, which now becomes a bigger concern to publishers that are trying to provide independent and free content to readers, if large groups of readers install ad blockers.
Hugo: That's a great point because I think there is a certain balancing that we're talking about whereby as a publisher you want to get your stuff out there as much as possible, but you don't necessarily want to spam people at least to the point of annoying them enough for them to take that type of action. Claudia: Exactly. Yes.
Data Ethics and Data Scientists' Responsibility
Hugo: We're going down the path of discussing a few ethical implications of your work and data science in general. I'd like to go down this path a bit further. You hinted to the idea of biases in data and algorithmic bias. I was wondering if you could speak to some more challenges involved in these areas today?
Claudia: With the vast deployment of automated systems, there have been an increased number of concerns on the ethical side of the implications that these algorithms may have. The simplest or earliest one I think you could refer to is the information bubble where the algorithm isn't necessarily biased, it's just really good at figuring out what you like to hear. As a result when it comes to more important things like political information, if you only hear the side of the story that you like to hear not only does it kind of reinforce your precision, but it gives you the delusion of being absolutely right and certain about it. You no longer have to question yourself or seek the dialogue with other opinions. I think this is one of the early concerns that has nothing to do with even the technology being biased, but the way our brain processes information, and the interaction with kind of preselection that appeases us very much.
Hugo: This is even known now in popular cultures as an “echo chamber”, right?
Claudia: Exactly. This is another term that we have for that. Now the next generation of concerns on that were brought forth are with respect to users in areas for instance as predictive policing, or even something as simple as job recommendations on various job sites. With that concern comes that we have for the better or the worse, our society has certain biases. Our behavior is not up to the overall standard that we want it to be. As a result, if you now train models on behavioral data where, for instance, you have never hired a woman for this position. Therefor you have no data of a woman ever being successful. Therefor none of the candidates that the algorithm will find will be female. The concern is that we could somewhat accidentally propagate, potentially even increase biases that were existent in the data that was used to build a model that now behaves exactly as we used to and not necessarily true to our ideal.
Hugo: This is an example, as you state, of algorithms encoding already existent societal or human biases, which is something we need to be very cognizant of moving forward. I know something else that you’re interested in, though, is the ability for algorithms to create their own biases, which may not be even existing in the data. I'd love to hear your thoughts on that.
Claudia: We touched on this earlier when we talked about bots, and clicks, and even the people working in airports. One of the things I understood is that ultimately when you build a predictive model, it's just doing exactly this, it's going to find the easiest thing to explain. Wherever it finds the most signal or the most information. That becomes a problem when different groups of your population have more or less signal, more or less information. The example that I like to bring forth for people to consider, if for instance a group of people has a consistently lower usage of technology and as such I have less data points of the person. I would be much less likely to target the person either with advertising, or with a job offer. Simply because the model can never quite be sure that this is the right choice to make and there are other easier things to predict. If you look for instance at jobs. If for some reason it is easier to predict success for one gender than the other, although both are equally likely to succeed, what happens if you simple use your algorithm to rank candidates you can easily see very strong majorities of the same gender in the top ten candidates presented. Although, originally 58% of the people who succeeded in that role were actually males so there was a balanced male female representation that they got. That's the concern that I'm having where a lot of the conversations today around making sure that you're training set is unbiased. It is not enough to ensure that you're training set is what I call first order unbiased, meaning you have the exact representation that you want. You still have to take the responsibility for taking action on the predictions, because the predictions can be biased again. Then, you as the user or as the platform have to make a choice to present an again equalized outcome to pick the top end candidates from both genders for instance.
Hugo: You're aware of this and clearly trying to do these types of things in your work, but do you think enough people are aware of this? If not, is educating them part of our job?
Claudia: I had an interesting experience. I went and gave a keynote at Predictive Analytics World this fall in New York City and I spoke exactly about this. After my presentation the general chair walked up and asked the audience, "Well, how many of you knew that this happened?" I think intuitively most data scientists are kind of aware of it, but in this audience I would say maybe out of the 200 people we had 10, 15 hands going up. The rest of them may have an inkling but possibly not fully thought all the way to the implication of what that means and even a bigger challenge, "What now to do about it?" I still like to give that same talk, although I have been giving it for at least one, one and a half years now, because I do find it very important that as a community we understand the implications of our work and that it's not enough to delegate it even to legal restrictions or things like de-biasing data sets. We still need to take responsibility for the usage of this technology.
Hugo: In terms of responsibility, what is the role of data scientists to think about data ethics particularly in a world where we're reaching a point where I mean advertisers may know us better than ourselves?
Claudia: I wouldn't quite go that far. At least I don't think we need to worry about that specifically.
Hugo: Okay, I just think in the case there's the anecdotal example of, if someone has displayed interest in sports cars maybe you advertise flashy cars to them, but if they display an interest in sports cars and your algorithm knows that they may be in debt they also have a history of alcohol abuse, these types of things, what type of ethical considerations need to be in place to help in this type of situation?
Claudia: First off, I think it is important to have an honest and open conversation about it. What I’ve perceived is basically you have two different groups here. People who do data science for a living, and very rightly concerned citizens often with insufficient depth of understand of what even controllable or can be known about these algorithms. In some sense I am the best police for data science because I am the one closest to building them and observing these things. A lot of the examples I talk about whether this is the case of JFK or even clicks and bots, a lot of this happens behind the scenes and it's really my kind of curiosity and diligence to find those things. I would like us to have a more open discourse what we expect from this technology what the comparison that we want to put into place is the right level. What I want to talk about here is not exactly the direction that you're going with some of these abuse cases. More so, when we are looking at failures of machine learning and AI when there is an accident by a self-driving car, when we have mislabeled pictures showing up that could possibly be offensive. What is the expectation? My sense is that society feels that this technology has to be perfect. I think this is where the disconnect in the conversation is, because when you are doing this for a living you do understand that ultimately these systems can be a lot better and can do a lot of good. For instance, diagnosing rare diseases that the doctor that you happened to go to in some rural area has never encountered before. Will that system be perfect? Almost truly not. The answer to how do we as a society engage with that? In my opinion has to be one of realistic expectations and a sense of collaboration between machine and human with a shared responsibility for the action that ultimately we choose to take, based on the accommodations that we get. What if I do observe specific cases that hinder for instance that some people that I observe in the advertising environment are suicidal? Is there something I should do? Do I need to point out to the brand that this might be of consideration that they somehow have responsibility for? I'm not sure, but I feel I would like at least want to be able to speak up without being pushed over the corner of privacy of violating, because privacy doesn't make these things go away, they just become invisible.
Hugo: Exactly. You're speaking to openness and transparency, which I think is incredibly important and it also I think it's heartening and helps that there are people such as yourself who are on one side working as data scientists in such businesses, but are also communicators and explainers and take that duty upon themselves to go out and speak about these types of issues in public forum. Which is very welcome and necessary I think.
Claudia: Thank you, I appreciate that.
What Will The Future Bring For Data Science?
Hugo: We've discussed a lot out about the modern data science landscape. What does the future of data science look like to you?
Claudia: Now you're asking me to really predict the future.
Claudia: On the one hand side I think the appreciation for really all the upside potential that data has will continue. I don't think this is a fluke. I'm really excited about the fact that even though Big Data as a hype is coming to its end, but the increased sensitivity that real society but also institutions, and firms have that they should be more data based or data driven in their decisions. I think that's very important. I think it's also very important as these systems exist that we as a society become what data literate, because recommender systems are not going to go away and we need to understand that we are living in these kind of filled up bubbles or echo chambers that were mentioned before. What does it mean for data science itself? First off, I'm not worried about automating ourselves. I mean, we are automating ourselves all the time, but I think the demand for human skill and supervision of data science systems will only rise. Technology can really not make up for good human intuition and the crucial role it can play in exactly these concerns we have express some of the things that go wrong, and when we can trust the machinery and how we should interact with it. The tooling is incredible and, again, today if you compare that to 20 years ago, I think we will see more of that tooling really being broadly available through either cloud providers or many other open access tools. I do believe the current excitement about deep learning will come to a realization that it's not the answer to every problem. Deep learning is very good for very specific types of problems, and they are really around areas that have a lot of signal. We're talking about vision, but you have very clear rules of the physical world that can be exploited. We're talking about language, people get still better about translation and automatic conversion of audio to text. Obviously I have seen this in reinforcement learning, which is where all of these games go and some come from. There will be a lot of space for good old kind of solid statistics just on bigger data and simple models. I'm quite optimistic for the field with the understanding that these different tools will find their different places.
Favorite Data Science Techniques/Methodologies
Hugo: Speaking of solid statistics, what's one of your favorite techniques or methodologies for data science? Not necessarily favorite, something that you just enjoy implementing or doing?
Claudia: I'm very old fashioned in the sense that I don't trust myself looking at graphs. Graphs are great if I want to tell stories. If I want to tell the story about people fumbling in the dark, then it's very nice to kind of illuminate these things with information about click rates. I really like to look at data almost running over my screen. That's probably me just being really weird. That's okay, that's wasn't what you were asking. I have somewhat ironically taken almost the opposite development than the field. I started out doing artificial known networks back in '95. Then, I down graded if you want to decision trees in 2004, for my dissertation. Today I really value the simplicity and elegance and also transparency that you can get from linear models like logistic regression or even just simple indexing that you would probably refer to as a form of Naïve Bayes, because it's so much easier to look under the hood and understand what might be going on there. It really has become by go to tool over the last I would say 10, 15 years. In fact, I won all of my data mining competitions using some form of a logistic model.
Hugo: Firstly, I love the idea of you just watching data stream across multiple screens. Secondly, I think your passion for interpretable models for decision trees for linear models where you can actually communicate what certain things mean in these certain models also speaks to what we were discussing before, your role as a communicator. You can take the output of what one of these models outputs and speak to a data science manager or someone in HR or whatever it is, or someone in the advertising space who isn't technical about the results of these models, right?
Claudia: This is exactly I think why I gravitate towards it, because initially it was picking out all the fancy stuff. I've come to realize that if you want to impact the world, it doesn't matter what you find exciting. What matters is what you can get other people excited about. Depending on kind of the sophistication level, it's often a really great idea to have the worst possible model. That's the nearest neighbor. The nearest neighbor is awful. It almost never has really good performance compared to some of the more sophisticated because it doesn't really learn anything. You just kind of find something that's similar but it's very difficult to know what similar means. It has one huge advantage, that's exactly how people think. That's the reason why in advertising why we talk about look-a-like models. Now what we build is not look-a-like models but for people to understand, "Yeah, we find other consumers who look like your consumers." That makes sense, they can relate, and they can embrace the technology and start giving it at least a try. Then, after a couple of durations you can swap out that awful nearest neighbor and give them a really good predictive model and they will be very happy moving forward.
Hugo: I suppose it's about establishing trust as well, in that sense. You may have a model that performs better but if nobody has any idea what it's doing they don't know why they should have faith in it or trust it.
Claudia: Trust is definitely very, very important here, and the other part is simply get them involved, because that's what you can do with nearest neighbors. You can say, "Yup, here are the five most similar other cases." Then the person can say, "Nah, that one doesn't count because that was completely different." I said, "Okay, lets delete it." And then just work with the four. It has this nice communication where they feel that they have become part of something and at least that was the case in one of the projects at IBM. Trust was a component, but it was also that they were taken seriously and part of the process and we learned when our models actually had no data and we would have to build something entirely different for those cases where the customer just knew that this was not appropriate.
Advice For Aspiring and Working Data Scientists
Hugo: My final question is, do you have a final call to action for our listeners who are aspiring and working data scientists alike?
Claudia: My sense is, number one just keep you curiosity and your skepticism. I mean have fun with what you do, but always don't take yourself too seriously and definitely not your model. Having some appreciation when you find out why something went wrong, that's much more fun and interesting than finding out that something went right. As a philosophy moving forward, be cautious with the things that you build, and I think that plays into being responsible when you hand them over and clear where you think the limitations are. First and foremost just keep your excitement for it, because that will keep you sharp and be able to identify these things.
Hugo: Claudia, thank you so much for coming on this show. This has been such a great pleasure chatting with you.
Claudia: Thank you so much for having me.