Data Science at the BBC
Hugo speaks with Gabriel Straub, the Head of Data Science and Architecture at the BBC, to discuss data science and machine learning at the BBC and much more!
Gabriel is the Head of Data Science and Architecture at the BBC where his role is to help make the organization more data informed and to make it easier for product teams to build data and machine learning powered products. He is a Honorary Senior Research Associate at UCL where his research interests focus on the application of data science on the retail and media industries. He also advises start-ups and VCs on data and machine learning strategies.
He was previously the Data Director at notonthehighstreet.com and Head of Data Science at Tesco. His teams have worked on a diverse range of problems from search engines, recommendation engines, pricing optimization, to vehicle routing problems and store space optimization.
Gabriel has a MA (mathematics) from Cambridge and a MBA from London Business School.
Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC.
Hugo: Hi there Gabriel, and welcome to DataFramed.
Gabriel: Hello, thanks a lot Hugo for having me.
Hugo: Such a pleasure to have you on the show. And we're here today to talk about your work as head of Data Science and Architecture at the BBC, how you're thinking about democratizing and spreading machine learning through the organization, how you think about data products, machine learning as a service, and content recommendation. All of these incredibly exciting and modern things, but before we get into all of that I'd like to find out a bit about you. As we all know, the BBC is a huge organization, and I'm sure there are a lot of opinions about what the head of data science's job actually is. So I'd like to know what you do, but before that I'd really like to know what your colleagues say or think that you do.
Gabriel: That's actually a really good question. I think as any large organization, and the BBC has about 20,000 people that work there, there's quite a difference understanding of machine learning and data science. So probably if you ask some people they would not know at all what it means, some people would probably assume that it's something related to understanding audiences, so the kind of stuff that we potentially might want to call analytics, and then some people who have a bit more of a detailed understanding would probably tell you that a lot of the work that we do is around building recommendation engines and other kind of algorithms that help improve the audience experience. And actually, one of the big challenge... See more
Hugo: That's great. Now, I love this idea of the hopes and the hypes, because we're also talking about constraints and what ML can do, machine learning can do, and what it can't do, and what data is good for and what it isn't, because there's so much hype around this space that I think a lot of people think data and AI are capable of anything, right? But it's about using it wisely and mindfully, right?
Gabriel: Yeah, definitely. There's a lot of this concern, especially in the machine learn community of being hired as a data science savior, as the guy who comes in and is expected to save the business, when the business isn't actually ready and hasn't set in place the right kind of engineering and machine learning basics that you, or data basics, sorry, that you might need in order to make that happen. So I totally agree with you, it's around being actually quite knowledgeable. So for me, a lot of my job I would consider more data product. So being aware of what you can do, and what actually the problem is that you're trying to solve, and trying to figure out how do you bring that together? So the possible with the needed, if that makes sense.
What do you do at the BBC?
Hugo: Yeah, that makes perfect sense. Well now, the second to final point was around building recommendation engines and other algorithms to help improve audience experience, and it strikes me that that's probably one of the closest, besides trying to educate around ML, but maybe you can just give a few words about what you actually do at the BBC.
Gabriel: Yeah, so in a way I talk about wearing two hats, and on one side my job is to try and help the organization get a bit better, a bit more consistent in terms of how we tackle data and machine learning problems. So this is when I am wearing my architecture hat, because we've been around since 1922, so almost 100 years, and we've invented a lot of the broadcasting technology. We've been quite instrumental in inventing radio, TV, etc., and the way that we were really good at inventing these hew things is we almost built separate organizations, and that means that we have a lot of data in lots of different places. Unfortunately today our audiences really want to have access in one place, and they don't really care whether something's called radio, or TV, or any of that stuff that we set up in your organization to run smoothly. But because we've invented all of these different things, we have quite a siloed data approach.
Gabriel: So part of my job is to try and address some of that by developing consistent approaches to storing data, surfacing data, and using machine learning on top of it. But also part of my job is to run a team that is called Data Lab that actually builds machine learning algorithms that provide better audience experiences, and generally that falls into two areas. So recommendations, as you mentioned, so how do we bring the right piece of content, or the right service in front of the audience, given their interest and their context? And the second thing is what we call enrichment, which is all about how do you actually find a piece of content? Coming back to this whole thing that we've been around for 100 years. 100 years ago people didn't think about tagging content in such a way that would later be surfaced through recommendation engines. So that means that we have a whole bunch of content that is badly described, and there is now a big question around what do you do with that?
Hugo: Interesting. So is enrichment related in that sense to discoverability?
Gabriel: In general we talk about metadata in that space, which technically isn't quite the right terminology, but it's how do we find the right descriptive data so that it can be surfaced through be it search engines, recommendation engines, or any other process that you might want to find content in.
Hugo: So you mention that the BBC's been around since 1922, and this is something that's incredibly interesting to me about your role, and the team that you've built out and are building out, that a lot of the time data science is synonymous with tech, and a lot of people think of the tech stack used in tech companies and online companies, but in this case, and a lot of your history actually in data science, is bringing data analytical and data science tools to companies that have existed before tech, and predate tech by a long shot. So I thought we could use that idea as a springboard to discuss how you actually got into data science originally, and your career trajectory up until now.
Gabriel: One of the interesting things though is you could argue maybe the BBC has always been a tech company, right? So we start as bringing together radio, which back then was high tech, and then TV, we were at the forefront again of that. So in a way we've always been tech, it's just that tech has shifted significantly over the last hundred years, so we have a slightly different legacy compared to maybe a pure online based organization.
Hugo: I love it.
Gabriel: In a way, I almost fell into data science coincidentally. So I have a mathematics background, so I understand that side of things, I then went off and worked as a management consultant for a couple of years, and came back to the UK to do an MBA, and I joined Tesco as head of data science, or back then actually I had a fairly long title along the lines of Head of advanced algorithms and forecasting load optimization for general merchandise, something along those lines.
Hugo: What a mouthful.
Gabriel: Exactly. So the longer the title, the more difficult it is to know what you're doing. But the idea was that they had already, Tesco actually was one of those other companies that you don't really think about as a tech company, but actually in the nineties had developed two amazing pieces of technology. So in '96 they created this thing called Clubcard, which was before the time of Google, they were dealing with big data. So basically kept track of all the purchasing that you did for this Clubcard and in return would give you one cent off for every pound that you spent. So they already were working with data innovation back in those days, and the second thing they were really, really good at was forecasting load optimization. So one of the reasons why Tesco had higher margins than a lot of other retailers were because they were really good at managing stock in their grocery business, and therefore to have a very high availability and really low waste.
Gabriel: So I came in after my MBA in order to help them build a similar capability, but related to the general merchandise business, and general merchandise, while you can inherit some of the stuff you've learned from forecasting how many cans of tinned beans you need to buy, it's quite a different supply chain. So in a lot of your grocery business your stuff comes from a warehouse that's maybe one or two days away, it's fresh, it doesn't have long lead times, there's quite a lot of turnover, therefore your stock actually comes into the business and disappears quite quickly. For the general merchandise, at least for a place like Tesco, your lead time is very long, because it gets produced in China, so you're certainly not talking about a week long lead time, but a six month lead time, your sales rates are a lot lower, etc. So there was a bunch of new challenges that we had to resolve.
Gabriel: So my job was really to be a translator, someone who could speak enough of the business to understand what this was all about and then try and translate it into maths, and someone who could understand enough maths to make sure that we could hire and build the right team, and then translate some of the concerns and constraints from maths back into the business.
Hugo: And when was that?
Gabriel: So that was in 2012.
Hugo: That's quite prescient in a lot of ways, because what we do see now, we actually see the emergence of a role called data translator coming out now, which serves that purpose in a lot of respects.
Gabriel: Yeah. So I now call it more of a product role, because in general the product person is the person who tries to understand what is feasible technically, and what the customer wants, and tries and brings those two things together. But yeah, 2012, there was no such thing as data science, at least not in the UK. That was slowly coming across the pond just in that year. So actually we called our team commercial science, because we felt it was all about the science that helps it be commercially more successful, which was also quite on purpose that we didn't really want to focus on the data, we wanted to focus on the impact that we could create.
Career up to the BBC?
Hugo: So what happened in your career then to take you to the BBC?
Gabriel: I was at Tesco for a couple of years, I'd built up a larger team there, by the end we were looking at anything from classical operations research type questions. So how do you do vehicle routing from your online deliveries, or how do you optimize a fulfillment center? We were doing things that we were kind of in the trade space, so beyond forecasting of demand we were also worrying about how do you optimally price a product, what's the right range to have online, etc. So after Tesco I joined Not On The High Street as a data director, and there my job was really to try and understand how do we get a bit on top of the data we have, how do we build a data democracy? So one of the KPIs that I was quite keen on is the percentage of our colleagues who'd actually used data on a weekly basis as part of their job, and then also how do you slowly introduce some slightly more sophisticated machine learning in order to automate some of the decision making and just create a better audience experience.
Hugo: So I like this idea of data democracy, and trying to spread data use throughout organizations as widely as possible. What does that mean though for someone to use data? Like someone in a marketing role, or a sales role, do they need to be able to code, or is working with a GUI enough? What are we talking about?
Gabriel: So that was a lot of the things we were trying to figure out. I think there's a bit of this data literacy, so how do you teach people the right understanding to ask the right questions, because actually conversion rate is not actually the same if two people use it. So you could have a conversion rate that's based on top of views, or you could have a conversion rate that's based on audiences. So you have to have a certain understanding around why would you choose one over the other. So there was a bit of that just giving the right skills, and then a lot of the other stuff that my team was working on was trying to provide the right tooling that would make it as simple as possible, and in my view there's still a certain amount of SQL that's quite useful I that space, but there is not great tools out there that actually allow you to almost create that, not quite dashboards, but pre-computer queries where then people can just put in parameters, and they can play around with it, and they can really learn some of the SQL.
Gabriel: So we tried to also teach people SQL, we tried to teach people a bit more how to use Excel as well, and then just making sure that they knew where the dashboards were. I think the most important thing for me though, was knowing how to ask the right questions, and knowing what questions could be answered with data, and actually answering the questions with the data, rather than trying to use data in order to confirm opinions that they already had. So it was mostly, to be honest, a cultural thing. More than anything else it was around getting people to not think that data is on the other side to creativity, but actually data and creativity work hand in hand.
Hugo: For sure, and that's something we think a lot about here at DataCamp, of course, as well. We're trying to spread the use of data tools and data techniques through organizations, not only for data scientists or analysts, or this type of stuff, but for managers, people at C-level, trying to figure out how much data they need to be able to speak and know about in order to do their jobs as well as possible, essentially.
Gabriel: I think this old world where there was maybe a team that was responsible for data probably doesn't work anymore. I think everyone has to have a certain amount of data literacy now. It's not acceptable for anyone in the organization to say that they can't write, right? Writing is just one of those basic skills, and my assumption is that maths and some sort of minimal data literacy, and potentially even programming is going to be one of those things that will be just a basic skill that will be required.
Hugo: I think that's a bright vision for the future. I want to jump in and talk about your work at the BBC in particular, but I just want to preface that by saying I love this idea of the BBC being a very serious tech, and tech driven, and tech forward company from 1922. I also really like that you mentioned, although Tesco for example and retail and grocery and these types of thing aren't historically tech per se, but they are actually huge innovators in the data space. As you said, and as we know, loyalty cards are a great example of seeing what people buy, segmenting them, and making recommendations to them based on what they've brought previously, right?
Gabriel: Yeah, as you said, right? So that was '96, that was before Google was created, and if you can imagine Tesco has thousands of stores, those stores have quite a few tills, lots of people going through these tills. It wasn't back then real time analytics, but there was still a lot of data that was created through all of these purchases, and Dunnhumby, which was the organization that Tesco then bought, who was dealing with all of this, they were able back in '96 to analyze all of that data in order to provide you with coupons, and it was worth enough money for Tesco to provide you with one pence on every pound, and that might not seem like a lot, but it's one percentage point of margin that you're giving up in retail, which is a low margin business.
Hugo: No, that's a huge amount.
Gabriel: It's massive, right? So in the good times, now Tesco probably has a margin of between somewhere in 3% and 4%, or maybe slightly higher, but definitely below 5% these days. So actually you can imagine, in a way, how brave it was as a decision to say, "Okay, it's worth us giving this to customers, because we believe that gathering that data is worth it." And similarly, the BBC went online and created BBC News in 1997. That was at the beginning where there were probably still quite a lot of people who were thinking that the internet is probably just a passing phase, and it's probably going to disappear. There's quite a lot of innovation that actually happens in the digital space in the companies that you might not consider as being your native tech companies.
Hugo: Yeah. I actually remember as a teenager in the late nineties being quite surprised at all the progressive work the BBC, and I think The New York Times were doing at the time as well, those are the two that caught my attention, at least at the end of high school.
Gabriel: Yeah. I think the BBC also created the BBC Micro, right? Which was one of those things that actually introduced lots of people into programming, and I think it is easy to forget that there were big companies that were at the top of their tech game before Google, Facebook, and all of these friends these days. And I think these big companies are still there, they're still surviving, and they're still innovating in new space, they're just maybe getting a bit less of the press.
Aspects of the BBC
Hugo: Yeah, I think so. So, let's dive in and talk about just what aspects of the BBC, business, content, otherwise, that you think data analytics and data science can have the biggest impact on?
Gabriel: When I think about this kind of stuff, and you look at it from a strategic perspective, what I'm really, really interested in is, the way that I find it's quite useful in this way, is to think about it from a value chain perspective, because for me data science and analytics and all of that stuff is all about decision making, it's all about decision support and making sure that you scale good decision making. That's where I think it's really, really, really powerful, and our value chain, simplistically, is around planning, commissioning, producing, scheduling, and in serving that kind of content, and then there's a whole bunch of operations that underlie all of that stuff. But probably the areas that are the most obvious ones, or the most exciting ones at the moment is, the obvious one is around how do we get the right content in front of audiences in the right way?
Gabriel: So I talked about recommendation engines earlier, and what makes this particularly interesting with the BBC is that we have audio content, we have video content, we have text content, we have things like weather, which is probably text, but actually really is data, we have interactive games for the younger people, we have recipes, we have pictures. So we have a whole bunch of stuff, so in a way we're basically Netflix, and Spotify, and CNN, and a weather channel combined. And that makes it significantly more challenging to bring the right content in front of the right users at the right time.
Hugo: And what type of approach to you use, or how do you think about this problem?
Gabriel: So at the moment a lot of our approach is still quite in the breaking it down into areas. So we've had iPlayer since 2007, so I was one of the earlier video on demand services, and iPlayer has a recommendation engine that currently is very much focused on showing you more iPlayer content. Similarly, we have a product called Sounds, which is our audio product, and again there what we show you inside is more audio products. We're now trying to figure out how do we crack that, and actually it's not necessarily only an algorithmic problem, but it's also a product problem. So if you are in the space where you're watching videos, when does it actually make sense for us to provide you some audio? When does it make sense for us to provide you some text? So some of the stuff that we'll probably start doing first is more understanding what kind of content you've consumed over all of our product portfolio, and then using content you've produced somewhere else in order to provide you with more relevant content on that thing.
Gabriel: So, for example, if you've read a lot about science and technology, then maybe that gives us a hint that you might be interested in Planet Earth, or that kind of documentary style stuff when you're on iplayer, so maybe it gives us the opportunity to recommend that to you, even if you haven't consumed any of those types of content.
Hugo: Something you said there I want to zoom in on, is this idea about making recommendations across different products, because if I recall correctly, historically a lot of the products, as with a lot of orgs, but at the BBC, have been siloed, right? So you even have all the data in a variety of different places.
Gabriel: Yeah. So, it depends a bit on what you mean by data. We're quite lucky that we've went on a journey a couple of years ago to try and at least bring a lot of our audience data together, so that at least helps us to understand what audiences are interacting with. Now we have a bit more work to do in order to bring content data together, because actually there's not much point in me knowing that you've watched a clip with a certain ID if I don't know what the clip is about. So there's still a bit of work there, but yeah, you're exactly right, and that comes back to a bit of that history of having been here for 100 years and actually always building it up separately, and actually from the fact that we are still very heavily a linear broadcaster. So our TV channels, our radio channels are still what produce the most amount of our audience engagement, and they have a very different way of thinking about data than you would have in an online channel.
Hugo: And it also speaks to a point that you mentioned in passing earlier, that before we can even solve these types of challenges, we do need to get all the data, there's a big data engineering challenge that happens before we can even solve these problems, right?
Gabriel: Yeah. Sometimes we don't even have the data. So, articles is one of those examples. We use tags in the articles, we developed a system in 2012 to help us with sports. So actually we were once upon a time, we probably still are, one of the biggest users of linked data. So in 2012 we developed these ideas around how would you be able to track medals in sports across different people, so we knew which kind of personalities were related to which sports, and to which country and stuff like that, so we could actually give you interesting new ways of navigating our content set. We use something quite similar in the news world, but in the news world it's a bit less clear what is a sensible tag for people, because they don't really understand why that tag actually creates value.
Gabriel: So our most common tags are 'UK' and 'politics', and these tags are probably not descriptive enough to really drive personalized recommendations, because just because you care about the UK, you probably don't care about a lot of the UK stuff, and similarly with politics actually. You might care about UK politics, but a lot less about politics in South America, and finding the right processes to bring the right granularity of data into our systems is one of the things that we're working on at the moment as well, to try and look into that.
Hugo: Interesting. That's a big challenge. Actually, that reminded me, you gave a talk, which we'll link to in the show notes, in which I recall you mentioned a related challenge, which is different uses of terms and tags within different parts of the organization. I think the example you gave is Manchester City, right? So if a sports broadcaster speaks about Manchester City, they might be talking about a team, whereas somebody else might be talking about the actual city. So that tag might mean very different things in different contexts.
Gabriel: Yeah, and I think the other example that I tend to talk about is 'pirates', because pirates has at least three meanings. So they can be nice pirates that you have in children's programs, and then you have software pirates, and then you have the Somali pirates that kill people.
Gabriel: And it actually becomes quite a problematic one if you confuse them, because you definitely don't want to show a kid the Somali pirates just because they've just been consuming a TV children's program.
Hugo: No, we've worked very hard as a society to convince children that pirates are great, essentially.
Gabriel: Yeah, and recent history has shown us that actually maybe, depending where you are, you might disagree with that statement.
Hugo: How do you think about tagging in general, all this historical data? I mean, how do you think about labeling it? That seems like a huge task and a huge challenge.
Gabriel: So I think there's going to be probably two approaches that we're going to use. One is there will be a certain amount of manual tagging, probably for the newer content where we just need to get better about it, where we need to create more consistent approaches, processes and tags, and obviously this is another area where machine learning can be quite exciting. So already our R&D team has been working on this for a while, so we have pretty decent tools that allows us to do topic extraction, or entity extraction out of text. These have been very heavily trained on news, so we'll have to see how do they work for something like drama, or maybe articles that are a bit, or topics that are maybe not so update-y but more entertain-y or informing.
Gabriel: We've also worked on things like facial detection, where we have a fairly good system, and particularly for British politicians and stuff like that, that maybe the big commercial products care less about. And there's a whole bunch of other suite in there that our R&D department has been working on, and is now working on making available to the rest of the business. That is quite exciting, because it means that we can then try and find ways of getting a lot more data out of all of or archives. And coming back to the fact that we've been producing content for almost 100 years, we actually have quite a lot of archival content, and most of the commercial stuff as well is quite expensive. So if you were going to run any of the existing machine learning as a service tools across that, probably that would not be affordable for us. So it's great that we have something to start with that was built internally for our use cases and our needs.
Hugo: So in terms of making recommendations as well, a big challenge in the recommendation space these days is considering filter bubbles and echo chambers. So maybe you can speak to how you think about that. I have a potentially related question, so I'm going to throw two questions at you at the same time, feel free to answer them in any way you want. Whether you have humans in the loop with respect to these types of recommendations as well, like some sort of human editorial role?
Gabriel: Yeah. So I think, first of all, echo chambers for me are less issues of machine learning than issues of business models. Now, that doesn't mean that the data scientists aren't responsible for it, but the data scientists are basically optimizing for something that they're being asked for to optimize by their business, right? So you tend to have echo chambers or filter bubbles in particularly in places that are related to the attention economy. Where the product itself has an incentive to keep you engaged or on the platform for as long as possible, so they can show you as many ads as possible. This is not quite our business model. Our business model is that we are funded through TV license, we would like you to have a positive opinion of the BBC, and we believe that the more you interact with us, the better it is, but we don't have to drive you quite that hard because we don't serve you any advertising. So that's one of the things.
Gabriel: The other thing is we do have humans in the loop. So for us, again, we've been around for 100 years, this whole thing, also where we sit as a public broadcaster, we have to be impartial, we have to be objective. So actually, we've already had to deal with this thing around how do you make sure that people get multiple sides of a story? We've had to deal with this for 100 years, and we've had very good editorial guidelines and processes in place to help us with this. The way that we deal with this is that we actually have an editorial person that works very, very closely with our team.
Gabriel: So as we build these algorithms, we will show the outputs to her, and have discussions with her around what do you think happens if we do this, how do these results look for you? In practice what that means is, if we notice that your horizon is narrowing down too much, we stop recommending you purely, so only 50% of the content going forward will be recommended out of the algorithm, and then the 50% of the rest will be, again, a cold start, where we've curated certain topics and brands that we think are relevant to the audience that we're trying to hit. So it's something that we're very conscious about.
Hugo: Yeah, there's a lot in there. One thing that came to mind is something you've spoken about before, which is the fact that this is great for your viewers and audience, but it's also essential for you, perhaps even in a legal sense, because unlike Facebook, for example, and other players in the attention economy, as you say, you both produce content and distribute it, so I presume it's a legal question as well?
Gabriel: Yeah. So we are definitely liable for the content we produce, and in particular there's two areas where this becomes challenging. So in the product that my team has just been working on, and that was actually released two days ago, so we're very excited about that.
Gabriel: We basically, we've created a short form video product that takes clips from the BBC and shows them to you. Now, a lot of the clips generally in the BBC are embedded within text, and that allows in the text to balance some of the views or clarify some of the views. Now, because we're just pulling out videos and there's no oversight around ability to explain something, there's the potential that these can be felt like they're being taken out of context, and that provides problems for us.
Gabriel: The second challenge that we have, and this is probably more relevant to the platform conversation is, as a media organization we can be responsible for contempt of court, and the example we came across not too long ago was, actually if someone is, for example, denying that he sexually harassed someone, and we have a video of that person denying this thing, giving that denial, if we then underneath that video show related content, and we have a good content to content recommendation, then there's a chance that we will show other content related to sexual harassment, and some of these people might be proven sexual harassers, or something that actually where the court decided that they were guilty, and that could be considered a position on the guilt of that person from our side, and that is contempt of court. So we need to find the right ways of how to manage that, and that's quite challenging at the moment because a lot of our processes, as an organization, were based around us always really driving what the viewers will see.
Gabriel: Now, in a one to one relationship, as driven by recommendation engines, that is no longer possible. So we're trying to figure out what exactly is the best way of doing that, which again, is the reason that we have an editorial person working with us very closely to make sure that we are on the right side of the law, but also the right side of our editorial guidelines, and the right side of the public service remit that we have as an organization.
Hugo: And it also seems that having a human editor in the loop is a great way to position data science not as a new discipline and arm of the organization trying to take it over, but as embedding itself in the organization that will incorporate the traditions and history and culture of the organization within it, having data as one input to what the organization does.
Gabriel: Yeah. So I'm a strong believer that data science is there in order to augment an organization. So I think there's certain decisions that data science can automate, and generally these are the decisions that most people done really want to do because they're quite repetitive, but data science is really powerful because, it's mostly powerful if it can automate the stuff that then frees up people to actually do the stuff that is actually more interesting. So the creative decisions, the more strategic stuff that an algorithm is going to take quite a while to be able to properly support.
What challenges do you face in incorporating data science at the BBC?
Hugo: So in that case, what challenges are involved in incorporating data science into the decision function at organizations such as the BBC?
Gabriel: Yeah, so we talked a bit about that just in our previous question. So, as we're a media organization, and a media organization that is quite often in the spotlight, we are sometimes a bit, we're quite nervous around how stuff can go wrong, and I think with data science it's a lot less predictable what results you will get going forward, because you cannot observe all of the possible ways of how content could be placed in front of an audience, just because everyone will see something slightly different. And we also don't really yet understand how machine learning will be interacting with our editorial heritage, and I think finding that balance where actually machine learning supports editorial and editorial moves away from... At the moment basically our editorial process is what I would call one of micro decisions.
Gabriel: So our editorial teams will decide on which videos go where, which text we're using to put in front of audience, what exactly the title is. That doesn't really scale very well to one-to-one relationships. So we need to find the right ways of how we move this from these micro interventions to macro interventions, where instead we will work quite closely with our editorial teams to develop rules that guide what algorithms can and can't do, and make sure that it's still within the heritage. So a lot about that is around how do we provide people with enough reassurance that this new world that is a bit more algorithmically driven does not go too far away from the BBC's public remit and our editorial heritage, and our editorial, stuff that really is what makes us the organization that we are.
Hugo: Yeah, and that, once again, speaks to incorporating data tools into the organization, as opposed to the other way around.
Gabriel: Yes, exactly.
Hugo: So, I know that you're interested in what you refer to as applying machine learning in a sensible way, and I'm just wondering what this evokes for you, or what sensible ML means to you?
Gabriel: Yeah. So we talk about responsible machine learning a lot of the time, because of some of the stuff that you've mentioned earlier, like filter bubbles. I think as a public service organization we have a bit of a concern that machine learning might not always be used for the benefit of the individuals, and therefore we want to make sure that at least when we build stuff it really provides the users with the true agency. So there's quite interesting research where people say that they want to own the data that they create, and they feel that it's up to them, but actually they don't really understand how that data is used by big tech organizations. And for us also this responsible ML also means that it's really about it being ...
Gabriel: So if there's a machine learning algorithm, that more and more will decide on what kind of content you get access to, let's make it specific here and talk about news, we need to make sure that there is no commercial or political agenda behind that, because obviously news massively shapes opinions, opinion shapes elections, and we've seen quite a lot over the last couple of years how if you're not careful, and you don't really understand what's happening there, you can get yourself into quite a bad place as a country. So really making sure that people find the information they can trust is really important for us, and this is the independent stuff.
Gabriel: Impartiality is also really key for us. So the BBC has been built as a public service organization, because in 1922, or actually in the twenties, there was this feeling that radio was just too powerful a technology, and it shouldn't be owned by commercial interests at all. It was probably also too powerful a technology to be owned by the state, which is why it is separate from the state. And there's a strong feeling that maybe machine learning can become a similar technology to that, and we have to be very careful that we do not use existing biases that might be in the data to reinforce some sort of negative loop. And finally, it's about universality, so how do you make sure that the benefits of machine learning are for everyone? And we don't end up in a world where you can't really afford any of the stuff that I might advertise to you, therefore I'm not going to provide you with any content.
Hugo: There's a lot of stuff in there that I'd like to touch upon. When you were talking about independence, you mentioned this idea of the vitality of it, how vital it is that people can actually trust the recommendations, and the algorithms that people are providing. It seems like in general we are going in the opposite direction in some ways, I mean we had things this year like GDPR, which helped to a certain extent around consent of use of data and right to delete, among other things, but a lot of the time I think people, including myself, don't even know what data is collected, how it's collected, why it's collected, what it's being used for, who it's being shared with, and these seem like huge challenges, right?
Gabriel: Yeah, I would agree with that, and I think it comes back to me ... ML is not necessarily the thing that creates these problems, ML is the thing that exaggerates those problems, or exacerbates those problems. The problems are probably more around the business models that are basically reinforcing some of this ML behavior, but I think us as organizations, we will need to put a lot of effort over the coming years to clean up our act and be a lot more transparent around what we do, really be clear about what our algorithms, what data we use, and how they work, but also give people the ability to opt out of it if they are feeling uncomfortable with it. I think it's all going to be about providing people with real agency again, because we run the risk otherwise that we destroy this technology that actually has a lot of opportunity, and to be fair is the only thing that probably will solve a lot of the problems that we have going forward.
Hugo: Right. Speaking of GDPR, was GDPR compliance a pretty serious issue for you at the BBC?
Gabriel: Did spend a fair amount of effort and time to make sure that we would become GDPR compliant, because we do collect information about people, we have sign in, so there's quite a few areas, obviously as a large organization as well, you have a lot of information about colleagues, etc. So GDPR in my view actually hits a lot more organizations than organizations realize, and any big organization needs to be doubly careful in this space.
Hugo: Absolutely. I want to start thinking about data products, machine learning as a service, your thoughts about how we can spread machine learning knowledge and practice in general. So I suppose as a bouncing board for this, maybe I can say you're involved heavily in developing broader data science and machine learning architectures, in particular to make sure that best practices are adopted, for example. I'm just wondering what this involves and how you think about this?
Gabriel: So one of the challenges that we have across the organization, as I mentioned before, we have for example a great R&D team that will build some stuff, and then we have lots of products teams that could probably use the stuff that's being built by the R&D team. But it's not necessarily an easy transfer of the technology from out of R&D into the actual product teams. So one of the things that I'm quite interested in at the moment is what I would call ML as a platform, or ML as a service, which is how do we make it developing and deploying machine learning models at BBC scale as simple as watching TV?
Gabriel: So what is the process that we need to put in place, what are the platforms that we need to put in place that make it quite simple to create a model, bake a model, and then hand over into a system where it will run and scale? And ideally you do all of this while being aware what other teams are doing, so that you can build on the shoulder of giants, that rue aware of the results that have not worked somewhere, so you don't try these things, and what that enables as well is that you can embed certain metrics, for example, at the end of a test. So we can embed some of the things that we're really keen on, in terms of our responsible machine learning, and make sure that those tests are passed before anyone can put anything into production.
Gabriel: So it gives us the ability to also encourage the teams to be aligned with some of the thinking that we have in the space around responsible machine learning.
Hugo: So in this sense, with machine learning as a platform, and machine learning as a service, do you envision a future in which people across many teams and parts of the organization can use GUIs to build and deploy models, as opposed to writing code?
Gabriel: Well, that's a good question. I'm not sure it will go quite that far, but I definitely envision a world where actually I believe that there will be more smart people not sitting in my team than sitting in my team, and we need to find ways of allowing those people to contribute to the work that we're trying to do. So I definitely envision a world where they can start with something that isn't from scratch, and where there is enough there that helps them make sure that they follow a proper data science process, and that it makes it much easier for them because they don't have to worry about infrastructure and all of the other stuff that otherwise takes a lot of your time. They don't need to worry about data integration and all of that stuff. Whether it goes all the way over GUI, I'm less sure, I'm not sure. That all depends on how far we can go on this journey.
Hugo: And I'm just wondering, I clearly think about this a great deal, but is there a danger involved in this that essentially if people are building machine learning models, most of the time they'll be building mathematical models, and if they don't actually understand the math behind the models they're building, is that dangerous in some way?
Gabriel: I think it is, and this is why it's important that you embed a certain amount of process around it, and certain scores that you run at the end. And I see this a lot when actually interviewing people for some of roles, is you can notice certain people who just apply technologies, or algorithms, or methods, and they don't really understand the assumptions behind it, and at some point those assumptions break down and you get unexpected results. I think it's still something that where a lot of people this is just a question of training, right?
Gabriel: So there's a question around making sure that they, and I don't think they need to understand all the algebra or whatever arithmetic that sits behind some of the models. They need to understand where stuff can become dangerous, and what the assumptions are that go into the model, and what the assumptions are that come out of the model. And I think that's something that you can teach people, and I think that's much more valuable to teach that kind of stuff than to teach them how to set up the right infrastructure, etc., which can be automated a bit more, or put into this platform.
Hugo: Right, and it's interesting that you mentioned hiring and that process. I suppose a question that our listeners wouldn't be very interested in is, when you hire for your team, what do you look for, and what type of people would best do the work that you need to do?
Gabriel: I think we're looking for people who are very curious, who are obsessed to a certain extent with finding better solutions, but people who understand that actually it's all about solving the problem, it's not about applying the coolest, newest, or whatever algorithm, and that's why I was talking a bit about these assumptions. So do they understand what the limitations are of their technique, and do they only move onto the more complicated technique if the limitations are deal breaking? Or do they start with the newest thing just because that's what everyone's talking about? So that, in a way, very pragmatic approach to data science, which is always around, actually I'm not here to create a cool algorithm, I'm here to create a business problem, and I therefore need to understand what parts of that business problem really matter, and therefore decide on the right where my assumptions that feed into the algorithm more or less align with the assumptions I have underneath the business problem.
Hugo: In that sense, it's a really practical game.
Gabriel: I think so. I think actually we're just starting an apprenticeship in data science as well next year, and I actually really like this idea of data science being an apprenticeship. I think you do need to have a certain amount of minimal understanding of programming and mathematics, etc. but there is nothing that will compensate for your ability to just work from models and learn the hard way, right? Like I'm sure for your career, you will have done a bunch of stuff where you think, "This is feeling a little bit too good. This is converging too quick." Or something like that, and then you realize that you've leaked from your target set data into your training set, or something like that, and you will never, unless you've experienced that a couple of times, you will not be aware that that kind of stuff can happen.
Hugo: I love that. Particularly as data science is a discipline where career paths aren't necessarily clear, and the role of junior data scientist is something that we're seeing a bit more of now, but in all honesty, it's in a woeful state in terms of people being able to enter from a certain level, the industry as a whole.
Gabriel: Yeah, it's an interesting question as well. I think the other challenge you have a bit is that a junior data scientist is not the same as a junior data scientist in a different place. There's not even a consistent definition of data scientist, right? In certain places data scientists are product analysts, in certain places data scientists are research scientists. So it's very confusing for anyone out there who's trying to break into this to try and understand where should they start, because it's not very clear what would be expected if you purely look at the job title.
Aspiring Data Scientists
Hugo: For sure. And we'll actually link to your careers page and the apprenticeship when that goes live in the show notes, so our listeners can check that out. A question that I hear a lot from aspiring data scientists is whether they should go to grad school, or start as a data analyst and learn a bunch of stuff on the job, and I was wondering if you had any advice around that question?
Gabriel: That's a really good question. I think I've had great data scientists coming from both directions. I think it really depends on what kind of data scientist you want to be, right? So there's this concept of the research data scientist, and the applied data scientist. And the research data scientists are there to build new algorithms that then the applied data scientists can use to really solve the problem. So I think it really depends on what you're more passionate about. Are you more passionate around writing lots of papers? Are you more passionate around creating new knowledge? Or are you more passionate around really trying to get the last percentage of efficiency or the next 10X growth out of the products that you're building? And depending on what you're more passionate about and what maybe feels more natural to you, I would decide one or the other. I don't think there's a one-size-fits-all, because again, there's not one-size-fits-all data scientists across organizations anyway.
Hugo: The other thing I would say is that if you're going to go to grad school, you have to really, really, really want to go to grad school.
Gabriel: Yeah, you need to really love your maths, right? Otherwise there's no point in doing this for a couple of years.
Favorite Data Science Technique
Hugo: Yeah, exactly. So I'd love to just get slightly in the weeds and a bit technical. I'd just love to know what one of your favorite data science-y techniques or methodologies is?
Gabriel: I don't know whether you count this as data science, but definitely probably my favorite own is Kalman filters, and I really like that because in a way it's so cool to be able to say, "Yeah, I've used rocket technology in order to optimize the organization." So Kalman filters were developed in order to help during the Apollo mission to bring the rockets back into earth, because actually the precision you need is insane, in terms of the angle etc., and Kalman filters allow you to appreciate that there's measuring error, and then movement error, and it deals with that, and it brings both of them together to actually, it's this really weird thing where you have uncertainty in measurement and uncertainty in movement and somehow through this magic of Kalman filters it has less uncertainty.
Gabriel: And I've used this, so far the BBC is the first organization where I haven't yet been able to implement Kalman filters, but in the past I've used this, because I also believe that if you measure something like price elasticity, for example, you probably have an uncertainty in your measurement, because it's the price elasticity of the users who have bought this product at this point in time. You haven't talked to everyone in your country, so you have the measurement uncertainty, and price elasticity isn't fixed, it probably moves over the times. So you actually have quite a neat parallel to this whole rocket idea. So actually beyond it being actually cool to say that you're powered by rocket science, I think it actually also provides some interesting and useful tooling, and it's not very commonly used beyond something like self driving cars, but even there they've moved onto more sophisticated techniques. But I don't think Kalman filters are something that a lot of your data scientists would come across in their training.
Hugo: I've never used Kalman filters, and I look forward to checking them out, and I'm also sure you're looking for an opening and it's only a matter of time before you get to use them at the BBC.
Gabriel: Yeah definitely, definitely. I'm trying to not be too distracting to my team though. They need to just get stuff out, right? As we mentioned beforehand, it's all about making sure you have impact on users, not about using the cool techniques. So I need to listen to that advice myself every now and then, as well.
Hugo: That actually brings something else to mind. Just quickly, I'm wondering, mentioning your team, what do you consider the most important part of your role in dealing with your team, or managing your team, or your sense of responsibility there, if that makes sense?
Gabriel: So I really manage a mixed product team. So in my team I have data scientists, but I also have software engineers, data engineers, architects, and actually a product manager, and all of that stuff around this. So for me the biggest job is, again, being that translator I think that I talked about earlier. So trying to find the right opportunities where we can really contribute and explaining that to the business, but also making sure that the business, sorry, that the team who has to build all of that stuff understands how it fits into the bigger context, and to then create the space and the opportunity for the team to really show how they can impact the BBC.
Hugo: I love that idea of creating space in that kind of role to allow your team to flourish and facilitate them doing the best jobs possible.
Gabriel: Yeah. I mean, it comes back to this, right? No one wants to be the savior data scientist that comes in and then gets an impossible task, and then suddenly the organization gets disenchanted, and gets rid again of data science. I think out of managing that possibility of what's actually really possible by giving people the space to grow behind, while giving them the protection, so that they can suddenly come out of the ... and show something for the organization that they never thought was possible. I think that's the stuff that's really the exciting and challenging part of my role.
Call to Action
Hugo: Yeah, incredibly exciting. So my final question, Gabriel, is, do you have a final call to action for our listeners out there?
Gabriel: Yeah. I think for me, the thing that I'm more and more passionate about, is the thing that actually as data scientists it's quite easy for us to say that we're just scientists who observe the world, and I fundamentally disagree with that. I think we are world shapers. So it's not, if you're building a recommendation engine, it's not like you're observing what people are looking at and you guess. You're shaping their decisions, you're shaping their behavior, and as we write more and more algorithms that decide what kind of news people see, what kind of universities people can apply to, whether people get an interview or not, who you get matched to on dating sites, whether you get a mortgage, what rate, whether you're going to go to prison, etc. I think data scientists need to start taking a lot more responsibility about the outcomes.
Gabriel: It's not good enough for us to say, "Well, I've been asked to optimize for this business objective and I just did it, and all of the bias that was in there was already in the data." I think we really need to take a lot more responsibility, because we are really the only ones that properly understand what's happening. Because I think that data science has such a huge role to play for our future, because there's this whole bunch of problems that we cannot solve without it. Be that efficient energy distribution, a whole bunch of health care stuff, we really need to make sure that we can make the most out of the potential, and I think the only way we can do that is by creating proper customer agency. And for customer agency to be there, customers need to trust that whatever we build is in their interest, and not just in the interest of the organizations we work for.
Hugo: Yeah. I really like this idea of thinking about the responsibility of data science and ML, in terms of the impact it has, and I actually had Cathy O'Neil on the last season, author of Weapons of Math Destruction, and she's reconfigured her definition of data science, and she now says, "Data science doesn't just predict the future, it causes the future." So that's a line that she thinks about a lot.
Gabriel: So I talk about data scientists now as market makers, because I think it is actually we create, we connect stuff, and for those connections we change the realities. And recommendations is the simplest one. So I personally don't believe that you have a perfectly formed view of what you would like to see, for example when you go onto Netflix or something like that, or the BBC. I think instead what happens is that we give you a bunch of possibilities and then as you interact with those possibilities your opinion about what you would like to see really forms. And with a recommendation for some entertainment, that might not matter, but for a recommendation for news, or all the other places where machine learning is now being use, around HR, all of that stuff that actually has a fundamental impact on what choices people have in front of them, I think it's really, really important to take to heart that actually you are a market maker.
Hugo: Yeah, and I actually really like the twist you make on ethical data science and data science ethics, which is of course a current huge conversation, but the twist you make in terms of turning it from ethical machine learning to thinking about responsible machine learning, and the responsibility of data analysts and data scientists and machine learning engineers in this context.
Gabriel: Ethics is just this word that creates too much discussion about it. I think, to be honest, responsibility is also not well enough defined. So no one would say we use irresponsible machine learning, right? It's very easy to say, "Of course it's responsible." I think the purpose therefore is we have to push organizations even further to say, "Okay, so what does that mean? What are the trade offs that you're going to take? How are you going to optimize between the benefits for your organizations versus the benefits for the individual?" What's that cost function of individual freedom, if you want, compared to organizational benefits? And I think that's the way that it's going to become more responsible.
Hugo: Absolutely. Gabriel, thank you so much for coming on DataFramed.
Gabriel: Thank you so much for having me.
How Data Scientists Can Thrive in Consulting
Pratik Agrawal, Partner at Kearney, joins us to discuss how data teams can scale value in consulting environments.
Unlocking Scalable ROI for Data Teams
Increasing Diverse Representation in Data Science
Reshaping Data with pandas in Python
Reshaping Data with tidyr in R
Data Quality Dimensions Cheat Sheet