Introducing Katharine Jarmul
Hugo: Hi there Katharine, and welcome to DataFramed.
Katharine: Hi Hugo. Thanks for having me.
Why so many emails?
Hugo: It's such a pleasure to have you on the show. Before we dive in to everything we're talking about, I just want to let you know, I've been receiving far too many emails the past couple of weeks.
Katharine: Oh no.
Hugo: Do you have any indication why?
Katharine: Oh. Okay, yes, yes, yes, emails that ask you for consent. This is consensual email at its best. It's emails asking you if you want to receive more emails.
Hugo: Why is this happening? What's been happening recently that means my inbox is really full?
Katharine: Yes, yes. Probably, if you deal with data, you've already heard of GDPR, or the General Data Protection Regulation, but it went into effect on May 25th, and everybody got a lot of emails just talking about privacy. It was really fantastic. It felt like finally, consensual data collection, we're having less of conversations, but yes, I think most people just deleted them all, which is for better or worse maybe what they all expected when it was sent as a mad rush on the final day.
Hugo: For sure, and it seems like a lot of it's opt in, as well, isn't it?
Katharine: Yes. The idea behind GDPR is that it's essentially, as I said, this consensual-driven, so consent-driven in the sense that you have a right to say, "Okay, I'm fine with you using my data in these ways," or, "I'm fine with you collecting my data in these ways," or, "I'm fine with you reaching out to me in these ways."
I think that's a really great step. I think consent for everything is a really cool concept, and consensual data collection is something that I think we are hopefully starting to realize is really the center of doing ethical data science. I think GDPR is a nice step towards that. It gives a lot of rights to European residents.
I wonder how it will also affect data collection for the rest of the world. It appears that people have different approaches for this, as in some people are essentially creating EU only versions of their platform. For example, the USA Today EU site and a few of the other papers and publications published an EU only package. What I am hoping is that it also allows for a little bit more consensual data collection of people even outside of the EU.
How did you get into data science?
Hugo: Great. I think this provides a really nice teaser of the conversation we'll be getting into with respect to data security, privacy, the GDPR, whether it's enough, what we might see outside the EU. Before all of that, I want to find out a bit about you. Maybe you could start and tell us how you got into data science initially.
Katharine: Yes. It was definitely by accident, as with I guess I would say a lot of people in my era of computing, which essentially was that I was a data journalist. I was working at The Washington Post. I had had some history and background in some computer science and statistics, but I didn't enter directly after school and then I found myself in data journalism.
After that, I got recruited to work at a few startups doing some sort of larger scale data collection and data engineering, essentially back in the initial Hadoop days. From there, I went into some ops and security roles, automating deployment and leading teams on those types of things, and then fell back into data, doing some data wrangling after my book was published with Jaqui Kazil, Data Wrangling with Python. Since then, I have been focused more on natural language processing and machine learning and lately been thinking a lot about data privacy and data security.
I guess after 10 years in this business, you start to think about the intersections of things you care about and for me, that was definitely an important intersection that I think ties in a lot of the passions I have and experience I have in data science as a whole.
Hugo: Where has that led you now? What are you working on currently?
Katharine: Yes, so I'm currently building a new startup called KI Protects. It's KI after künstliche Intelligenz, which is essentially the German translation of AI. Our idea and our goal, our solution really is to bring about a data science compatible data security layer.
The idea is from my experience and from our experience, Andreas Dewes and I, we have seen that the security community and the data science community are not necessarily overlapping in meaningful ways right now. We're trying to think about how we can bring more data security and data privacy concepts to the data science community that makes them really easy to use, really I would say consumer-friendly in a sense of being easily integratable into systems that you might use to process your data, like Apache Kafka and Spark and so forth, and to make it so that you don't have to have data privacy or data security as a core concept of your data science team. You can just do normal data science and you can use our service to help you enforce privacy and security.
Hugo: Great. It sounds like you're essentially trying to help people keep doing as they're doing, and you're filling this particular gap for them with respect to data security and data privacy.
Katharine: Yes, that's the goal is to make it the plugin for data security or data privacy. Of course, this is a delicate and complex topic. We're exploring what we can guarantee and what integrations make sense for different types of companies. We don't have a full product spread available yet, but this is something that we're actively experimenting with, researching and working on. We're fairly confident that we can come up with a few different methods that allow you to use simple APIs for pseudonymization and anonymization of your data sets.
Fundamentals of NLP in Python
Hugo: Awesome. I want to now get into data security and data privacy, but before that, you mentioned your love for NLP. I just want to let everyone know that you've also got a great DataCamp course on fundamentals of NLP in Python, which I had the great pleasure of working on with you.
Katharine: Yes. It was super fun. I love all the great feedback and so forth from folks. If you're starting to get into natural language processing or you're curious what it's all about, I can definitely recommend taking that, as well as the follow up courses, which allow for some fun experimentation with some of the common and best libraries in natural language processing.
What are the challenges currently facing data security and data privacy?
Hugo: Exactly. Let's jump in. What are the biggest challenges currently facing data security and data privacy in your mind?
Katharine: I think one thing that I've noticed over time is the core competency of most data scientists is not necessarily focused on security and privacy. Now we're starting to see perhaps with, for example, the Apple differential privacy team and the Google Brain research that has been focused on security and machine learning more overlap but he average person who has studied statistics or machine learning and who's doing this in the field, they don't necessarily have a strong background in computer security or in data security or info sec, as we might call it, right? I don't see this as a fault of theirs. It's nothing lacking, right? They have a lot of their own specialized training, but the unfortunate circumstance of that is that a lot of the way that we manage and handle data is not necessarily the most secure way and it definitely doesn't always take the ideas of privacy or even user sensitivity in the sense of do I actually need access to full user data.
It doesn't really take these into account very often, and therefore as data scientists, we have access to potentially millions of people's personal data, their messages, their emails or chats, their purchase history. We have access to all of these things, and my question is do we actually need access to all of these to do our job properly? I think this is perhaps a big oversight in terms of how we've built up data management and data science and BI platforms that we use today.
Hugo: You spoke to the lack of focus or knowledge with respect to data security. Do you think this is related to the lack of focus on building ethical systems in general for data science?
Katharine: Well, yes. One of the conversations I've found myself having recently in lieu of the GDPR is that it's been really painful for people to implement consensual data collection in their data science. Why is that? It's because the software is not designed with the user in mind, right? The software is maybe designed with the end user, the internal team in mind, but it's often not designed with the actual customer or the client in mind.
If we had software that was slightly more driven by the clients' desires or demands, like this kind of touches upon design thinking, then it should be cognizant that when we collect user data, that we have marked when they consented, that we have marked what is the provenance of the data, that we have marked how was the data processed and all of these things.
The fact that data provenance has been more of an aspect of research than actually an aspect of every type of data collection software that you can imagine, this is really problematic, because we have accumulated all this data and for some larger corporations, sometimes they have purchased data or they have aggregated data from data marketplaces and so forth. This means that they now have all of this data, some of which was given directly and consensually, and some of it which was just collected by purchasing power or by buying another company and so forth.
This is a nightmare of course when it comes to GPR and you have to figure out and sort out what data was given by whom and under what circumstances, but why might we have this problem in the first place? Why can't we just have perhaps data marketplaces where consumers directly sell their data, if they're going to do that, or also why isn't data provenance essentially where does this data come from and when does it expire, how long is it good for, why aren't these a normal part of how we do data management from the beginning?
How do data scientists react to privacy concerns?
Hugo: I'm interested in how you feel the average, if this is even a well-formed question, how the average data scientist responds to this type of legislation being passed? If you can't speak to the average, maybe you could give a variety of responses that you think are paradigms of how the community's responding.
Katharine: Yes. I guess I would say that I have a feeling people are inherently good and want to build ethical systems. This is the viewpoint that I'm coming from, and I think that a lot of people are like, "Okay, this is painful, but I want to be able to do the right thing. I want to be able to do ethical data science. What does this mean? How might I have to change the ways that I currently process data?"
I think it's sparking a lot of conversations that are thinking, "Okay, well, perhaps in the past we haven't done this very well. How might we start again or how might we better do this in the future?" I do think that there are some people that are just like, "I see it as a nuisance," and there's been this big rash of variety of software and other platform vendors that are simply saying, "Oh, well, we're not going to sell to EU residents anymore."
I see this as terrifying, because why would I want to use a service that can't guarantee that they're going to ask me if they can use my data? This I think shows that essentially, I would argue, that there's a big divide between those that see privacy as a burden and those that see privacy as maybe something that we can strive for, that we need to think about and perhaps change the systems and processes that we use in the meantime.
Hugo: How do you think data scientists generally feel about the idea of sacrificing some model performance for having more ethical models?
Katharine: Yes. I think that this is difficult. I've spoken on the topic of ethical machine learning a few times now. A few times, the reaction was very negative, and people were like, "Well, I don't really see why this is my problem." I think that unfortunately, there is some of that idea, like, "Well, if black folks are treated differently by cops, why should I have to essentially change the distribution of my data set to compensate for this?" They say, "Well, the data's there and that's what the data says, and so I'm just going to build exactly what the data says."
I would say that that is a choice and an action in and of itself, and if you're making that choice and action, you're essentially automating inequalities and you're automating biases, societal biases. When you choose to do that, you're making a statement, and I would say that the statement is that you say that those biases are valid. You say that it's valid that people are treated differently based on their skin color from police or that women earn less than men. This is something that you're validating if you just say, "Well, that's what the underlying statistics of my sample say, so that's what I'm going to do."
I've definitely had those conversations numerous times, and then I've also had conversations with people, "Oh wow, this is really cool. This makes sense. It's so nice to know that there's quite a lot of different performance metrics you can use to analyze the ethics or the treatment of different groups from your model."
Katharine: I think that there's also new energy specifically around FAT/ML and everything that's happening in academia around finding real ways to build ethical models that don't necessarily sacrifice much performance at all.
Hugo: Yes, and I think something you mentioned there is that some people have responded, "It's not my job to think about these things." One thing that data science doesn't have as a profession yet is standards of practice, codes of conduct necessarily. If we think back what's happening historically in other professions, in ancient Greece, the Hippocratic Oath was developed to deal with these types of things for people practicing medicine, right?
Katharine: Yes, yes, yes. I think that if you're building some system that maybe controls some IoT factory device where no humans are affected at all by what you're doing or if you're making some sort of academic model, yes, okay, maybe your impact is very small, but when we're building these systems that interact with humans and now quite a lot that interact directly with we would say the consumer or a person, and affects maybe what the person sees, what they click on, what they think about, what price they pay, and then of course, the massive systems like finance and justice and so forth, this is the impact. We have a growing footprint of the things that data science touches and affects, and because of this, I think that we need to start thinking about if we don't have a Hippocratic Oath, what do we have?
Hugo: I do think there are so many, increasingly more and more such examples emerging. I think one of the ones that I've mentioned a few times on the podcast is judges using the output of a black box model that tells recidivism rate for incarcerated people using the output of that model as input for the parole hearing, right?
Katharine: Yes, yes.
Hugo: Actually, Cathy O'Neil's book, Weapons of Math Destruction, which I recommend to everyone who wants to think about these things, and I actually probably recommend it more to people who don't want to think about these things to check out that book.
Katharine: Yes. There's a new one also called Automating Inequality, which is quite good. I can recommend that one, as well.
Hugo: Yes. We'll link to those in the show notes. I'd love to hear your take on GDPR, and we've moved around, but I'd love to know exactly what it is and what it means for civilians, for users, to start off with.
Katharine: Yes. It means that you have a lot more rights than ever before. If you're a European resident, definitely. If you're another person, then at least perhaps some of these rights essentially, it's like trickle down economics of rights, in that I hope that you have taken some time.
Some of the cool things about GDPR that you may or may not know about is you have the right to delete your data, so you have the right to request that a company delete all of your data. For data science, of course, we're starting to think about this and be like, "What does this mean? What does it mean for my models? What does it mean for my training sets and so forth?" This is definitely something to start thinking about and discussing with your team. How do we create processes that adequately delete or remove a user's data?
Katharine: There's another right to know how your data is used and how your data is processed, and also to opt out of that processing if you want to. This is, again, something we need to think about as data scientists, how we build our pipelines, how we treat data, and how we allow people to opt in and out of probably certain tasks and jobs that we run on data sets over time. You can think of this almost as a nice flag in a database or as something that you store in a separate queryable database that allows you to say, "Okay, this person has opted in or out of processing."
Katharine: One of my favorite ones is actually the right of data portability, and this is the ability to port your data from your current service, whatever it might be, to another service and the idea is that the data has to be transmitted in a machine readable way.
This is also this idea that you have your data perhaps for some app that you use. You would like to try to use a new and different app, and you want to make a request to port your data to that different app, so this again for data science means that you need to create outbound and inbound workflows or streams or something like this that allow people to transmit their data.
I think that this, the data portability, is a real boon also to startups in general because it's this idea that it's kind of like phone number portability, right? It used to be that once you had a phone number and everybody knew it, you were stuck with your service provider until you really wanted to take the big jump and tell everybody you have a new phone number.
I think with data, we've seen these entrenched leaders of data science and data collection essentially. They've been there for now decades, essentially. They've had the advantage of the data that they sit upon, and with data portability, this will hopefully start to shake some things up and create some more competition, because the idea that I can take my data with me and move it to another provider is pretty powerful I think, and also something that I think is a long time coming.
Who owns the data?
Hugo: Yes, me too. I think this is definitely a step in the right direction. I want to pick your brain in a minute about whether you think this is enough or what next steps would look like, but there's a term that both you and I and the GDPR and everything around it uses constantly. The term is your data. Now, when I use Facebook or I use Twitter or I use whatever, what is mine? What do I own in that relationship and in that service?
Katharine: Yes, yes. This is actually a subject of scholarly debate I would say right now, and we're going to have to wait and see exactly how the regulators put this into effect. Now, I'm not a lawyer by any means, but from some of the law articles I've read around this, the intention of the working group that created that article was that it not simply be just the signup form. Their working group notes specifically state that this should be any interactions and data that the user provides the company.
We can think of this as, well, maybe that even goes down to your click stream of data. Maybe that even goes down to every post you have viewed. It probably won't be enforced like that, but we need to think about when we're collecting all of this extra data, when we're collecting and tracking users, what does this mean in terms of the users that have said, "No, please, I don't want to be a part of this"?
How can we respect things like “Do Not Track” and how can we make very clear and evident what we are using data for and have the user want to opt into that? "Hey, if you provide me this, not that I'm going to give you more targeted ads, but I'm going to be able to offer you this extra feature," or something like this.
I think it makes us start thinking about data not just as something that we can use however we wish without really asking about it and ask for every single permission on the phone or track people across devices and all these things, that maybe we should ask first and maybe we should think what data we actually really need and provide a compelling product that means that people want us to use their data.
Hugo: Could this force a bunch of companies to change business models. in essence? I suppose there's the old trope “if you're not paying for the product, you are the product”, right? You have literally got companies that are trying to take as much as possible because of just the value, or let's say the assumed value of the data.
Katharine: Yes, this is this also assumed value, right? One common thing I hear when I go to data conferences and I'm hanging out in the data science tracks or so forth is I hear people say, "Just collect all the data and we'll just save it in case we need it." Some companies have been doing this for decades. They essentially have data from the early thousands and so forth on users still, and you're sitting there wondering, "What are you going to actually use this data and how much of this data do you need?"
Now, I think for somebody that does ad retargeting or something like this, of course, this is the bread and butter, but for the average website or the average app, how much do you think that people would be willing to pay to not be tracked, to not be targeted? Maybe you should start offering similar to some of the products that have launched last week a targeted free or an advertising free experience.
I'm hoping that the consumer models also start to change around this. Of course, I have no idea what this will mean in the market five years from now or 10 years from now, particularly because most of the offerings so far that have been targeting free or ad free are primarily targeted at EU residents and even sometimes not available to you as residents.
What does GDPR mean for organizations and data scientists?
Hugo: We've seen and heard what the GDPR will look like and what it will mean for civilians. What about on the other side of the equation? What does it mean for organizations and for working data scientists?
Katharine: Yes, so it means a lot more documentation and a lot more understanding and sharing of exactly how data is processed, where it comes from, so this idea of tracking data provenance, and what it is used for. I think this is fantastic because I have been, I don't think alone, but feeling like I'm sitting here screaming into the void about documentation testing, version control, like normal software practices for data science.
I think that this is finally the moment where clearing the technical debt that a lot of data science teams have accumulated over time of not versioning their models or not having reproducible systems, not having deterministic trainings and so forth, that this will hopefully be a turning point where we can get rid of some of this technical debt, we can properly document systems that we're using, can have of course everything under version control and automated testing.
All of this is going to benefit you because when you document all of this and you share it, you're essentially fulfilling quite a lot of your duties within GDPR, which is this ability for people to opt out of that processing, so having a process that allows data to be marked as opt out, and then also documenting exactly what processing is used, who are downstream consumers of that data, and where does the data originate from, and under what consent was it given.
I think that this covers quite a lot of what GDPR requirements are for data scientists. The only thing really that's left out is deletion of old data or anonymization or old data, which I think is going to spark hopefully a conversation around how do we expire data or how do we treat data that is old or from previous consent contracts or was purchased and we're not sure exactly how it was collected and under what circumstances.
I think that this idea, if you're in doubt, if you don't know where the data comes from, if you've gone through and you've documented all your systems and nobody has any recollection of where a particular data set or series of data comes from, then you should either delete it or you should go through methods to anonymize it if it's personal data at all. I think that this is essentially a spring cleaning for data science, both in terms of our processing and our data sets.
Hugo: I want to come back to this idea of data anonymization. First, I'd like to know what's been the general response to the GDPR from organizations?
Katharine: Of course, I'm based in Germany, and so of course, the opinion here is that from a consumer standpoint and I think from the media standpoint has been very much that this is a good step. From the businesses here, I think that it has been costly both here and I think everywhere to enforce, to bring yourself within compliance before the due date.
Now, I must remind everyone that everybody had two years to prepare for this, so it was not a surprise that it was going into effect. I think for a lot of folks, unfortunately, this has been costly. I'm hoping that the standards that have now been put in place were not a rush job and perhaps have created better processing that actually allows for this type of compliance in the long-term.
I think that there's also been a boon within Germany and Europe of startups thinking about these problems and starting to offer things, for example, like myself and Andreas with KI Protect, starting to think about what does GDPR mean in the long run, so in the long-term, how do we guarantee better security and privacy and make this just a commonplace thing, not a compliance thing?
Hugo: Am I right in thinking that this doesn't only apply to data from people in the EU, but to data from users that is processed in the EU?
Katharine: I don't know all of the specifics around this, but I can say a winner of GDPR is European data centers, and this is because there's a provision within there that talks about moving data outside of the EU. If data originates from the EU and you want to go process it outside of the EU, you need to explicitly tell people and they need to opt in saying that it's okay for their data to be processed outside of the EU, from what I understand.
There has definitely been a little bit of a pickup in the data center action here, and of course quite a lot of the large companies that process most of their data, let's say in AWS and so forth, this means that finally, I have some instances in AWS Frankfurt, it was always hard to get the GPUs and other things available, and now we're starting to see some parity, which is nice.
I think this is something to think about is when we're moving data all around and we're moving it to different locations in the cloud and so forth, these are real computers in real data centers somewhere, and this means that we also need to think about what implications that has A, within the security of our data, but also B, within compliance.
Hugo: Yes, and I think I was reading a number of tech companies that are processing data and have offices in Ireland for tax reasons, among other things, they may be moving their data processing out in order to perhaps not have to comply with GDPR for the time being.
Katharine: Yes, that makes sense. Yes, of course, in Dublin, there's some very large offices for Apple and Amazon as I understand it and Google and so forth. That's usually the EU-based tech hub, essentially, for the large corporations. Yes, I think this is probably changing the dynamics there and perhaps also changing the dynamics for a lot of the data processing that happens in Luxembourg, as well.
Hugo: What happens if companies don't comply?
Katharine: I think the process goes something like this. You're contacted. You're supposed to have a Data Protection Officer, that's essentially the named person to handle any types of requests and compliance issues. I think that first, you get some sort of warning and they ask you to become compliant. You have some short period to respond to that, and if not, then you can get a fine of 20 million euros or 4% of global revenue. It's not a small fine. It's meant to hurt. For this reason, a lot of people have been wondering, "Well, will they go after small companies, small businesses where this might essentially bankrupt them?" This is, of course, we will wait and see how the regulators plan on enforcing this.
Hugo: This Data Protection Officer, or DPO I think, they're also responsible for if there are any data breaches, right, informing the people affected and whatever the governing body is within even 72 hours or something like that, and personally, not just via a press statement?
Katharine: Yes, yes. There needs to be information sent out to potentially any affected users, as well as of course to the regulation authorities for any data breaches. I believe that this also covers data processor breaches. This is where it comes into effect where if let's say you're reselling data or you're moving data to partners and your partner has a breach, then this is also your responsibility to essentially, they should inform you and then you need to inform the end users.
This hopefully avoids some instances like Equifax and so forth in terms of you can't just sit on the fact that there's a security breach for two or three months and sell your stocks or whatever you want to and eventually like, "Oh yeah, yeah, you may be a victim or identity theft," or something like that.
What does the data privacy landscape currently look like?
Hugo: Can you give me the rundown of what data privacy looks like currently, just the current landscape of how everyone seems to think about it?
Katharine: Yes. Currently, data privacy, as we've been thinking about this at KI Protect, we've been of course investigating where people are coming at this from a variety of markets and so forth. I think currently I see data privacy either as pay to play, essentially, so it's often add on that you can buy for large scale enterprise based systems where you say, "Oh yes, okay, I have all these other things and I'd like you to implement this privacy layer here." This is for a lot of the enterprise databases and so forth, something that they've been working on for some time, which is great. I think that that's fantastic that that's available.
Other than that, it's primarily focused on compliance only solutions. This is this idea of you have HIPAA or you have financial compliance regulations and so forth, and these are focused around, "Okay, we are a database or we are a data processor that only focuses making sure that your hospital data or your bank data or something is treated in a compliant way." This is mainly for these data storage database solutions, which again is fantastic, but what does it mean if you actually want to use your own database, and then you would like to use the data in a compliant way?
I think that there's been some interesting startups within this space that are trying to perhaps allow you to use a special connector to your database that does something similar to differential privacy, not quite differential privacy because this is of course very difficult, but similar to differential privacy, or that employs that k-anonymity or that employs something else like this.
There's a few companies in the space of essentially trying to be the query layer and then using your data sources below that and then providing some sort of guarantees, whether it be differential privacy without necessarily a long-term privacy budget, or whether it be -kanonymity or whether it be pseudonymization.
Katharine: It's a lot of these extra add ons. Other than that, I think most of the privacy conversation has been really led by academia, and Cynthia Dwork's research on differential privacy and its implications also within machine learning, as well as some of the great research that Nicolas Papernot and some of the Google Brain security researchers have been working on, these have been I think amazing contributions, but research perhaps implemented at Google or with Cynthia Dwork's work with Microsoft and so forth, but as far as available to data scientists at my own startup or something, this has really not been available in a real way.
Hugo: Right. You've mentioned, or we've discussed a variety of techniques, such as anonymization, pseudonymization, differential privacy, k-anonymity, and we'll link to a bunch of references in the show notes that people can check out with respect to the nuts and bolts of these. My real question is, can we really anonymize data?
Katharine: Yes. This is of course of much debate, right? The gold standard is, of course, differential privacy. The idea of differential privacy is that it's a fairly simple equation when you actually read it. It's the idea that I would not know that you yourself as an individual were a part of any data set based on the queries or the data that I see from that data set, that there would be within a very small epsilon the ability to determine the probability that you are a part of that data set or not.
This is, of course, a really elegant theory, and I highly recommend reading Dwork's work on this, but in terms of actually implementing it in the way that we use data science today, most of the ways that we guarantee differential privacy is using what is often referred to as a privacy budget.
This budget essentially tracks how much information, we can think of it almost as information gain theory, right? How much information about any individual was essentially gained by the other person via the query or via the data that they accessed? Once the privacy budget reaches a certain level, then we say, "Okay, then there can be no more queries that might reveal more information about this individual."
This is difficult, because, in practice, we often have changing data sets. The data set that I can guarantee privacy on today and the data set I can guarantee privacy on tomorrow, this is ever changing. We're gathering more data. As time goes by, we might have more information that we garner and connect about a particular individual. The more that we do this, of course, the less that we can guarantee privacy.
The second thing is that to keep the privacy budget, let's say indeterminately, this would mean that eventually our data would not be able to be utilized, because we would eventually hit the limits of our privacy budget, and unless that privacy budget is reset for some reason, then that person or that analyst or that data scientist cannot query any information that might be related to that individual.
Katharine: What we see in differential privacy that's been implemented, for example, by the Apple Differential Privacy team or with some of the work that Google has been doing, this is normally a privacy budget within a limited time period, so resetting every day or resetting every few days or something like this.
Hugo: When we think about anonymized data within any organization, and in particular, from civilians who are users of these products, I think one really important question is, how do we know about how our data is being used as users? My question for you is, how technical and how educated do civilians and users on the ground need to be to understand what's happening with their data?
Katharine: Yes. This is interesting, and something that Andreas and I have been thinking about, doing a series of articles and so forth that explain how privacy works and how deanonymization really works at a large scale. I think the average data scientist, they've heard about the Netflix prize. They know about the New York City taxi data, in the sense that with an informed adversary or an adversary with access to potentially some open data, this is quite easy to deanonymize when we're dealing with large scale data sets. If I were to ask my mom, let's say, or my sister, "Hey, do you know if you upload that extra thing to Facebook and then if any of your Facebook data leaks, do you know that the ability for somebody to deanonymize you is essentially guaranteed?" I think maybe not. I don't think we're there in terms of the public conversation.
I do think that breaches like, I forget the name, but the running application that recently had a data ... They released an open data set. Strava, I think it was called? Their open data set essentially leaked information about so called private US military bases or secret US military bases.
Hugo: The fitness tracker, which was mostly used by American citizens, and then you could see in key locations in the Middle East and African nations? Yes.
Katharine: Yes, yes.
Hugo: You could see the military compounds on the map.
Katharine: Yes. This is what happens when we aggregate data, and this is especially a danger for people that are releasing public data, right? You can even think of it, if you're selling data or sending it to a partner or something that when we aggregate this data and even if we say we so called anonymized it, then data in aggregate can also release secrets. This may not be secrets about an individual anymore, but this may be some sort of secret about the group that uses your application.
Is GDPR enough?
Hugo: Right. It saddens me to say this, but we're coming to the end of this conversation. Something I mentioned at the start is that GDPR I think, as you said as well, is very necessary and timely. My question for you is, is it enough and what do we need to be doing or what would you like to see in the future to make further steps in this direction?
Katharine: Yes. GDPR by no means guarantees anonymization. This I think might be something that we should really push for within the data science and machine learning community is how can we solve this very difficult problem, or how can we at least make some inroads to this problem so that when there's a security breach or when there's some issue or when somebody gets their laptop stolen and oops, they had a bunch of customer data or other sensitive data on it, when these things happen, we usually can stop them at the source, right? Maybe we don't necessarily need to always use complete personal data to build a model. Maybe we can start thinking about how to privatize our data in a way before we start the data science process and again, this is definitely something we're thinking about and working on at KI Protect, but this is something that I really hope overall as a field we can push forward. It has some interesting implications as well with ethics. There's a great paper, again, the primary author was Cynthia Dwork, comparing this idea of differential privacy to also the same basis of ethics in a sense that if you do not know my race or if you do know my gender or my age or something like this, you have the potential to build a fairer model. I think that these have interesting overlaps and implications for our industry, and I really hope that we start to think about them as a wholesale solution, not just as a, "Oh, compliance only means I have to do this much." This is what I'm hopeful for and something that I look forward to seeing more in research and also chatting more with my peers and so forth.
Hugo: Yes, I like that, because it sounds like, in a certain way, mindful data science in the sense that you just don't take all the data you have and throw a model at it and see what comes out, right?
Katharine: Yes. You think about the implications of any other data that you share, that you expose both to your team internally and to anyone externally that you think about, essentially, would I want somebody to do this with my data? It's the golden rule of data science.
Call to Action
Hugo: Yes, great. Do you have a final call to action for our listeners out there?
Katharine: Yes, sure. You can check out all of the work that we're working on and we're looking for feedback still with KI Protect. If you want to reach out, we're at KIProtect.com. Also, just if you're working within this space, if you're thinking about these problems, keep at it. You're not alone, and also, feel free to reach out. I think that we need to create a really vocal community within data science that these are important, these are essential, and that this is not only for researchers, although I'm a really big fan of what the research community has been doing. This is also something that practitioners care about and that we want to be able to implement what we're seeing in research and the advances that we're seeing in terms of potentially guaranteeing privacy, preserving machine learning. We want to see this within the greater community and within the tools and open source projects that we love and use.
Hugo: Katharine, it's been such a pleasure having you on the show.
Katharine: Thanks so much, Hugo. I really appreciate it.