Official Blog
dataframed
+1

Online Experiments at Booking.com (Transcript)

What do online experiments, data science and product development look like at Booking.com, the world’s largest accommodations provider?

Introducing Lukas Vermeer

Hugo: Hi Lukas and welcome to DataFramed.

Lukas: Hey Hugo.

Hugo: Great to have you on the show. I’m really excited today to be talking about online experimentation at Booking.com, talking about the role of the scientific method in data science and diversity and tech. But before we get into all of these issues, I want to find out a bit about you and I want to take a slightly sinuous approach by first finding out what your think your colleagues think that you do.

Lukas: I think if you ask them, they would mumble, "Something, something, experiment, something." Something related to experimentation-

Hugo: I love it.

Lukas: ... at least. But I think practically speaking, people come to me when they have questions about experiments or about metric design or when they have difficulty analyzing the results of their own experiments. So the way we're set up, we allow lots and lots of different teams to run experiments using our shared infrastructure. So I'm responsible for that infrastructure, and the training and the documentation around it, and the methodological metrics that we use, so that's my real responsibility. But the end result is that people are doing these experiments by themselves and then when they get stuck, they somehow end up with me. So I think if you ask people, "What does Lukas do?" Then, they would say, "He answers my questions about experiments."

Hugo: Great. Hence the, "Something, something, experiment, something." But as you said the basis of what you do is structured around infrastructure and tools, methodology and training.

Lukas: Yes, yes. I try not to get in the way. I try to let them do these things as self-service as much as possible. We're not a central authority, not an ivory tower in that sense, so we really try to enable and empower people to do these experiments themselves.

Hugo: Great. And I'm really looking forward to delving deeper into what experiments look like at Booking.com but first, are there other things that you're responsible for?

Lukas: Yeah. So more broadly speaking, I work with the wider analysis community that's very spread out within Booking and the research community. So analytics and research are a little bit different from experimentation in the sense that analytics is usually more open-ended, it's more exploratory. And user search, I mean more qualitative studies, trying to understand what users are thinking, what our customers want. And I work with both of those communities as well, trying to support them, seeing how they can collaborate with experimentation. Trying to understand how with these three pillars, we can improve the customer experience.

Hugo: Cool. Speaking to the research aspects, it sounds like — as we'll discuss later on — that you do really think about data science and the scientific method and the data scientific approach.

Lukas: Yeah. So I'm not doing data science for fun, we're doing this for a purpose. And our purpose here is to help, we say, empower people to experience the world. So this is really about helping end users and trying to figure out what it is that they need to make a decision. And the other aspect of that is we're doing that by building a product. And so all of the analytics and all of the data science that we do is all geared towards helping product development. So that's either through exploratory analysis, trying to figure out where to go or qualitative understanding, why people want or need a certain thing. And then quantitative, trying to quantify what the impact of something is, trying to confirm that something has actually solved a user issue. So I see all of these things as being related to product development and customer experience, not as fields in it of themselves.

Hugo: I know you mentioned already that the hierarchy at Booking.com is relatively flat, but how many people do you work with or are responsible for?

Lukas: So I have no direct reports, so no one directly reports to me but we have hundreds of people who are using the experimentation infrastructure and I do feel responsible for their behavior so to speak. So the group I work with is I think 1,500 people, 2,000 people, something like this.

Hugo: Amazing.

Lukas: It's a pretty large organization.

Hugo: Absolutely. And you're also, in that respect, involved in hiring, right?

Lukas: Yes. To make this work, you need good people. If you want people to have their own responsibility, if you want them to be accountable for their own results, you want to give them this power, then you need to have people that you really trust. So I spend a lot of time on hiring as well and making sure that we get the right people with the right mindset into this company. So that will be people who are really thinking about how does their work impact customers, so not doing it for the sake of the work but doing it because they want to make the product better. And they're also able to function independently, so not people who like to be told what to do. But they're also vocal and can explain why they do things, why they do them in that way and are willing to change their mind when someone challenges their approach. So I spend a lot of time on hiring.

Hugo: Fantastic. It sounds like a really interesting culture. And are you hiring at the moment?

Lukas: Oh, yeah. Always, always.

How Did You Get Into Data Science?

Hugo: Great. Well, we might circle back to that. But any listeners who're interested and enjoy this conversation, please do get in touch with Booking. So now I know what you do at Booking, I'd like to know a bit about your journey there. How did you get involved in data science initially?

Lukas: Well, in high school, I was good at two things. I was good at computers and I was good at biology. And I was interested in biology because of the systems aspect of it. So I liked to figuring out when a patient has a particular diagnosis, how do we figure out what part of the system is failing? How do we fix that? I liked computers and complex systems. And this was back in 2000, and I found a studies here in Utrecht at Utrecht University that was computing science with machine learning or AI minor. It was called artificial intelligence. And this was before data science was cool, right? This was still when these conferences were still referring to machine learning or statistical analysis.

Hugo: Yeah. I think it was before data science was even a term.

Lukas: Yeah, yeah. It wasn't coined yet. And so that's where my journey began, with a fairly technical background in computer science. And then after I finished my studies, after eight long years because I'm slow, I joined a consulting company. I was working for a consulting company. I was trying to get the marketing department to implement recommendation systems and as part of the system, there was a control group. So there was the way for the system to figure out for itself whether it was being effective or not. You want something to compare against. If you're making recommendations, you want to know do those recommendations actually add value. So it had a control group built in and one of the questions I would get a lot was, "Can we please turn off the control group? Because we just want to give all of our customers the best thing." And that spoke to me as such a fundamental misunderstanding of the scientific method. I thought that was very striking that many of these companies, they were trying to apply these machine learning methods but they don't really understand the core foundations of data analysis. And so then at some point, I ran into a man who worked at Booking at a conference and we started talking and he was really interested in my work and I was very interested in the culture that they seemed to have. So I joined Booking five years ago mostly because I was fed up with companies that didn't really understand this idea of being open to change, being open to being proven wrong and trying to build a product that actually provably helps people rather than going by gut all the time and just using data to confirm your initial beliefs. So I started building ranking systems and recommendations here at booking, that was the first thing I did. I wasn't really good at that, I think. It wasn't my forte. But one thing that I noticed is a lot of people were running experiments and I had opinions, very strong opinions about how they were doing that. And I was so vocal about that that at some point, people just asked me, "Do you want to do that for a full-time job?" That's about four years ago. And so since then, I've been responsible for experimentation. Just because I was so loud.

Hugo: Yeah. And even a prior point in the trajectory, your trajectory that you described, is realizing that experiments with respect to the recommendation systems for marketing weren't actually being done correctly, right? With respect to the control group.

Lukas: The cool thing about that is the tool was working, right? In terms of execution, it was executing that experiment correctly. The method was actually very sound. It was just that people weren't listening. They weren't using that data to then inform their decision. When I go to conferences about data science, I'm always worried that all these fancy methods that are being proposed, that they don't end up with anyone actually changing their mind. What's the point of doing that data analysis if you're going to do that exact same thing that you were going to do anyway?

What are the biggest business challenges faced at Booking.com?

Hugo: Right. And we're actually talking about making business decisions and improving our customer's experiences and customer's lives. As you stated earlier, you're not in data science only for fun, you're facing a lot of business challenges. So I'm wondering what are the biggest business challenges faced at Booking.com.

Lukas: Oh, boy. So my role is quite technical so I'm not on the business side of thing but I think the gist of it is we're the world's largest accommodations provider, so the biggest seller of rooms and we want to help people pick the right place for them to stay. So we have 1.7 million accommodations on our site and that's a lot of places to stay, and we want to give people the information so they can then make a decision of where to stay. Now, that's already a challenge but I think the other thing that's going on is we also see that consumers want more than just a bed to sleep in. We want to power them to experience the world. That means we need to help them find things that are not just a bed but places to see, things to do as well as places to stay. So those are two sort of business challenges. And I think on the technical end, we've been growing faster than Moore's law for about two decades now. You could think of us as the biggest hockey sticked startup in the world.

Hugo: And this in term of number of transactions, or number of customers, or number of hotels, or ...?

Lukas: All of those. All of the above.

Hugo: Everything.

Lukas: So growth is growth, right? We're a marketplace so all these things are related.

Hugo: And how do you scale then?

Lukas: Well, one part of scaling is the technical scaling, right? So it's a constant challenge just to make sure that stuff works, just making sure that we survive next summer and that all of our systems keep working. It's an interesting technical challenge but for me, the bigger challenge is the organizational challenge. Our size as a company has also grown a lot and like you said, we're a very flat organization, we like to give people a lot of power, we like to let them make their own decisions at the lower level. So it's not the boss telling them what to do, the boss is telling them what to aim for and then they execute, they decide how to get there. And scaling that is tricky because it's not just building more layers to the org chart. You have to find ways to help people communicate, how do we share knowledge, how do I find out what other people are doing, how do I challenge them. So if I disagree with what they're doing, how do I escalate? How do we do training? When I started doing experiments, I could talk to all of my users. We were all on one floor. I could walk up to everyone who was running experiments and ask them what was on their mind. And if I disagreed with any decisions, I could walk up to them, go talk to them. We're running so many experiments now, I can't even read all of them anymore. Just going too fast.

Hugo: And as you said, you have 1,500 or 2,000 people running experiments, right?

Lukas: Pretty much, yeah.

Hugo: So you've identified three, perhaps four really interesting challenges. There's the matching problem of getting your customers or your clients the information they need, the hotels they need. Then, on top of that, as you say, empowering them to experience the world. Then, scaling up not just technically but organizationally which involve a whole set of sub-challenges. What I'm interested in is how data science can help you and Booking to solve these challenges.

Lukas: How not? I don't think there's any part of Booking that doesn't use data science in one shape or form. So the obvious ones are where I started, right? Working on ranking and recommendation systems to help people choose from this 1.7 million properties, trying to make sure that the stuff that we show them is relevant to them. That's a sort of prediction problem, right? Trying to predict which things these people like. Then, on a marketplace, there's lots of interesting marketplace dynamics, things like fraud. Trying to prevent fraud to protect our customers but also our partners. Fake credit cards of course are an issue. And then we have our own customer service in-house, so we don't outsource our customer service, these people really work for Booking. That also helps sort of the feeling of belonging. Our CS operation is so huge that just predicting how many phone calls we're going to get in one of these 24 different languages that we support and scheduling staff so that we have the right people available at the right time is an interesting scheduling and prediction problem, with the added problem of many of our agents speaking multiple languages. So that scheduling challenge is actually a very nice one. And of course, analysis and experimentation, more of the supporting decision-making at all levels. So I'm more concerned with the infrastructure that allows people that do product development to make decisions but obviously leadership also needs data analysis to figure out where to go. Yeah, so I can't really think of a place where data science doesn't in some way affect how we work.

How is online experimentation used to make decisions at Booking?

Hugo: So Lukas, that provides a great segue into what I'm really interested in talking about next and it's the explicit role of online experimentation in decision-making at Booking. So can you tell me a number of concrete ways in which online experimentation is used to make decisions at Booking?

Lukas: So in the end, we're interested in making the product better for our customers, right? We're in it for improving the customer experience and experiments are a great way to do that in measurable way. So essentially, you come up with a hypothesis about a customer problem that you're trying to solve, then you propose a solution. You say, "I think the customer is struggling with this but I can fix that by doing ..." Something. And then you use an experiment to validate that the solutions or features help customers find what they need so that's where experimentation comes in. It really comes in at the end where you're trying to figure out, did this thing actually work in the way that I hypothesized that it would work? And so what that does for us as an organization is that it ensures that these individual teams are empowered to function autonomously, that they can make independent decisions. Because everyone can look at what their intent was, what the outcome was and as long as they trust the shared infrastructure, they can also trust the results of that experiment. So in some sense, this experimentation helps us validate that the changes that we make really help users but also it helps us have this flat organization where these individual teams are empowered to do that autonomously.

Hugo: So if I'm a customer, what type of changes could I expect to see if I'm a subject in one of your experiments?

Lukas: Nothing that you wouldn't see if we were just not running experiments and changing product, right? I think any product on the internet needs to evolve and needs to adapt to customer's needs. So you do things like, "It look like this button has low contrast ..." For instance, we have a lab here, user research lab, where we bring people in. And sometimes we notice that a certain piece of information or a certain button is really not obvious to people. You can see this in eye tracking studies or you can see this when you ask them, "Did you notice this?" And they say, "No, no. I didn't see this." So we have some indication that this is not clear enough and then you hypothesize that this information is important and people are not seeing it so to help users, I must make it more obvious. And so you do things like increase the contrast or make it bigger. I always joke that we should make very important stuff blink but apparently that's more like '90s internet. When I was doing web design, that was cool. But you come up with some proposed solution for how you would help users find this information better. So then you implement that but that's no different from implementing a normal product change. The only thing that's different here is now we're going to validate that when we made that change, users actually responded differently. So that when you made the button more obvious, did more people actually hover over it? Did more people click on it? Did that help them make a decision? Did they appreciate that decision more than decisions they were making before?

Hugo: And also, you want people to do more of the thing that makes business better and the customer more satisfied, right?

Lukas: So for us, luckily, those two things are very much aligned. I think business-wise ... I don't want to get into the philosophy of the value of the capitalist society, right? But in some sense, all companies and their consumers are at odds because a business wants to make money and consumers want to spend as little of it as possible. But we are a travel company, people want to travel, people like to go places. And I think it's our interest that they go to places that they like so that they come back. And so we're selling a product that makes people happy, we want them to find the product that makes them the happiest so that the long-term, both parties get the most value out of it.

The paradigm of online experimentation

Hugo: I know you're very passionate about experiments. I think you've even told me other people may call you very opinionated about experiments. A lot of the time when people first talk about online experiments, they talk about a red button versus a green button-

Lukas: Oh, God.

Hugo: ... and I'm wondering what your take is on this "paradigm" of online experimentation.

Lukas: Yeah. I think this all started with the 41 shades of blue article that Google shared when they found out one shade of blue was actually better than the other shades of blue. I find that such an uninspired way about thinking about product. That's so narrow.

Hugo: Tell me more about that.

Lukas: Look. I think when people talk about website optimization, they're thinking of this as there's a website, it has a very clear purpose, I can very clearly measure it and I'm just going to optimize that space. And even the term, "website optimization" suggests that it's an optimization problem. And with a background in machine learning, optimization problems are a very specific thing, right? But for product development, none of these things are true. It's very difficult to really get down and measure customer satisfaction or long-term value. These are things that are not easily quantifiable, so thinking that you can use this to optimize is an illusion I think. But moreover, when you're building product, you're not operating within a confined space. This website has lots of different things on it and if you think only about changing the color of a button, then you're not really expanding your product, making it do better things that customers need. So for instance, there might be information that people need to make a decision about which hotel is the best for them. Let's say people want to know whether they can bring their pets. I'm just making this up. So people want to know, "Can I bring my pet to this place? Yes or no?" Now, if you think of website optimization as changing the color of the buttons, there's no color that's going to help people understand whether pets are allowed or not. Just doesn't work that way. What you need is you need to try probably using qualitative methods first. Try to understand what it is that people need, then think of features that you can build that serve that need and then you validate whether that need is served. That's a much broader view than just changing the color of buttons. And it also requires that you start writing down more clearly in the form of a hypothesis, what is the need that you're serving, what is the problem that you're solving. And so the example with the two button colors, I use that a lot and I say to people, "Look. Website optimization is saying I'm going to change the button from yellow to blue and then see if the magic number goes up." That's it. That's website optimization. But it doesn't help you understand what's going on. And if I said, "Well, instead of changing the button to blue, let's change it to green", you would be guessing which one of these two would be more effective. But if I said, "We know from eye tracking studies that people don't notice this button and we believe or we suspect that is the case because the contrast between the white font on the button and the yellow background is too low for people to distinguish the text. And we think we can resolve this contrast issue by changing the background color to be darker so that the white text stands out more brightly. And so we will make the button blue and then we'll check whether people actually hover over this button more, whether they click on it more and whether they actually find what they need more." Now, the implementation of this is exactly the same, right? I'm still changing the color of the button but the intent is so much different.

Hugo: In that sense, you're really testing a hypothesis that you have for other reasons as well.

Lukas: Yes. Yeah, so there's some data to back this up, right? We already have qualitative data here and we really need it because we're trying to understand, we have some theory about how this implementation might actually work. So the thing that changes is not the implementation but the thing that changea is the things that you can now do as a human. And so one of the thing that changes is that if I said, "Well, if instead of blue, let's make it light green." Before, with the website optimization approach, you would be guessing which of these colors was better. But now that I've told you that the intent is to increase contrast, you look at these two colors and you go, "What? Light green does not at all increase the contrast, so it doesn't even get to the point." So you reject this potential implementation by understanding what it is that it's trying to achieve in the first place. And another thing you can do is actually come up with alternative implementations. So not just reject them but you can also say, "Well, what if instead of changing the background, we change the color of the font? What if instead of changing the contrast, we change the size? We'll make the button bigger. Or do we make it blink?" So we can propose alternatives based on understanding what it is that the thing is trying to achieve.

Hugo: Yeah, absolutely. So what you're really speaking to is making your experiments more robust and being more mindful about your assumptions and why you're doing what you're doing.

Lukas: I'm talking about ... I'm in it to help customers and to help customers, I need to understand what I'm doing to my product. This is not an optimization problem.

Hugo: So I'd like to keep our focus firmly on online experimentation but I'd like to actually take it from a slightly different vantage point. And I'd like to know, people who use the infrastructures that you build, people running online experimentations, what they actually do. Do they need to have a strong statistical or computational foundation or do you productize everything to hide a lot of the things that are happening in the back end?

Lukas: Here you mean?

Hugo: Yeah.

Lukas: So we subtract away a lot of the elements of the method. There are similar products out there, right? You can think of Optimizely or Google Optimize. Conductrics, one of my favorites. These are all products that do very similar things. They allow you to set up an experiment without understanding a lot about what's going underneath the hood. One way in which we are more embedded or more tailored towards at Booking.com culture is that the way our product development is set up is we have lots of small heterogeneous teams, so teams that have different skills in them. We don't have siloed IT and marketing. We put those people together to build a product. And so we don't really need to have these sort of self-service features for setting up experiments that a lot of these companies provide. So it's not point and click, it is an API because every team will have developers on it anyway so that's not the challenge. And on the flip side, we spend a lot more time of thinking of ways to democratize the decision process itself. So how do we make sure that people have all the information that they need to make a decision. So that's providing them with not just statistics but also guidance on method, right? So one thing for instance, power calculations, figuring out how long an experiment needs to run. That is something that we try to really make self-service, let people do that independently but also just let them stick to it. So it's one thing to do power analysis, it's another thing to then run your experiment for exactly that duration. And so the tool is very much geared towards self-servicing the method.

What are the limitations of data science?

Hugo: So we've spoken a lot about the data science capabilities that you guys leverage at Booking.com and the role of online experimentation. I want to step back a bit. People think of data science as a modern superpower and as with all superpowers, there's a lot of hype and a lot of buzz and a lot of expectations that won't be met. So I'm wondering to your mind, what can't data science do at Booking or elsewhere, or what is the limitation, the event horizon of the possibilities of data science?

Lukas: I think a lot of stuff you read about data science concerns itself with prediction and correlation models. I think it was Wired said that at some point, we won't need theory anymore, we just need more data. That's just so at odds with my own experience, my own world view.

Hugo: Sounds like nonsense.

Lukas: Well, that's a bit extreme to say that. I think there are lots of things where predictions are actually fine. You don't really need understanding. If you're trying to predict what a consumer will buy, then I don't really need to understand why you like this particular type of hotel. So for recommendations and ranking, I don't think we need understanding.

Hugo: Not necessarily but you do need some sort of ... You can't just throw data at a question without understanding how to interpret the question, interpret the data, interpret the answer and build some sort of cognitive model as well.

Lukas: Yeah, absolutely. And so this is what I call or have called in the past, the philosophers and telescope builders.

Hugo: Right.

Lukas: So the thing that I think is missing from the data science field or the thing that is not emphasized enough is the need for people who think about the causal connections between things. Understanding rather than just predicting. Constructing theories. Thinking about what it is that I can learn with this method and what are the things that I cannot learn, what are the blind spots, what is the white space. So those are the people I call philosophers because they think about what is the meaning of what we are finding. And so the other group I think are undervalued are the people who build telescopes, the people who work on the telemetry, the people who think about how to measure things. It's sometimes difficult to even think about what data do I have, what metrics can I build and this is a field or a challenge in itself, especially on the internet where a lot of the stuff that we want to know is actually happening on your computer and we don't have access to that, we can't see you. Consumers feel that they don't have a lot of control over their own privacy but they do. We are a remote party, right? We can't really see that much. And so telemetry is hugely important. The example I often use is that the heliocentric model, the idea that the earth was revolving around the sun, that was 2,000 years old before someone built a telescope powerful enough to make some observations that actually corroborated that idea so much that people rejected the geocentric model. So it wasn't that this theory wasn't around and that people didn't think of this before, it was just that you need someone who can build the infrastructure or the telemetry that allows you to measure the things that you need to support that. I think that's something that's hugely undervalued, people who can really think about what data do we have, what data is missing. Missing data is a huge issue.

Hugo: And do you see yourself as a philosopher or a telescope builder?

Lukas: I should've seen that one coming.

Hugo: You set yourself up for it.

Lukas: Yeah. Well, considering that I came up with those two roles, I would say I'm more of a philosopher than a telescope builder, wouldn't you?

Hugo: I would but the rest of our conversation is firmly revolved around your role in telemetry and building telescopes and-

Lukas: Yeah, that's true. Yeah.

Hugo: ... building product and infrastructure for online experimentation. But this is great because I think actually where this discussion is going a movement from ... The first half of this conversation really was about the telemetry. Now we're moving kind of into the philosophical implications of data science which I think are very much missing from the conversation. And I'm not really talking about metaphysics per se but I'm talking about thinking about the scientific method, the implications of everything we do and working through all of our assumptions and just things we do on a daily basis. And I think that's a really nice way to frame ... I know there's a paper that you love and you've spoken about called, "Many analysts, one dataset" and I'd love for you tell us a bit about this paper because it makes clear how small variations in an analysis at different points can actually affect results so that different analysts can get totally different results. So I'd love to use this paper as a starting point to really talk about the role of diversity in data science and tech as a whole and to get your opinion on that.

Lukas: I love that paper. It's so good. I recommend it to ... It's on a open source foundation.

Hugo: And we'll put it in the show notes.

Lukas: Oh, excellent. Yeah, yeah. So I'd recommend it to any analyst that I speak to, people who are used to doing data analysis. Because I think one assumption that people make a lot is that data is objective. And data points are already theory-laden, we can talk about that later. But the process of analysis itself is a subjective process and I don't say that lightly, this paper actually shows that this is the case. So as an analyst, you feel that you're following the data, that all of the smaller decisions that you make makes sense and that they're inconsequential to that objective decision that reach in the end. That's not true. There's this concept of the garden of forking paths which is this idea that while you are making all these small decisions, you're actually meandering towards a particular decision but that's not necessarily the only decision or the only conclusion that you could've ended up with. If you had made minor different decisions along the way, you could have ended up with a completely different conclusion even based on the same data. So in this paper, they actually try to show this by taking a fixed data set of referees whistleblowing for different players in the soccer league. So we have data for each player and for each referee, whether they were given yellow or red card which is what happens when you make an offense. And they also included the ethnicity of the player and then asked 30 something groups of analysts, so individual groups of analysts, they asked them, "Do you think or is there evidence that referees are ethnically biased? Yes or no?" So can you find any evidence of bias? Now, the interesting thing is that even though the data set is fixed here, every group was given exactly the same data and the questions, what seems like a pretty straightforward yes, no question, you get 30 something different answers because each of these groups decided to analyze the data in slightly different ways, bucket it in different ways, model it in different ways, use different techniques and so they end up with I think a 60/40 or 70/30 split between yes and no. Not quite 50/50 but close enough. And then what they did is they said to these groups, "Now that you've come up with your answer, please discuss amongst yourselves the pros and cons of each of your analysis." So they were actually allowed to discuss amongst themselves and between groups, the different approaches. And then they were allowed to adjust their analysis, change their approach and relatively little changed. Even though people now knew that it was possible to come up with a different conclusion, they didn't really change their own.

Hugo: And there are a lot of steps in there that are actually incredibly scary for us as scientist and data scientists and people trying to make business decisions, right? Not only of course that you can give the same data set and have what a lot of people presume are objective analysis occur, but then when you get them in a room together that you're going to get people being pretty stubborn, which is something I can relate to. Don't get me wrong, I'm not claiming to be the most flexible.

Lukas: Well, everyone has their own confirmation bias, right? Everyone wants to draw a conclusion. It's difficult to change someone's mind. I think one way to avoid this kind of thing is ... The one way to avoid this thing ... That made me think a lot about this for data science in general is that in experimentation, it's actually common practice to preregister your analysis before you see any data. You describe exactly what you're going to do, you describe how you're going to analyze the data and so this is what a hypothesis essentially is, then you run your experiment and analyze the results. And by preregistering that analysis, your analysis is not influenced by any data that comes up after the fact. So essentially, you avoid the garden of forking paths by laying out the route you're going to take before you start your journey. That's not always possible for confirmatory analysis and when you know what the data's going to look like then that's a possibility, but sometimes you do exploratory analysis and then it's a lot more difficult to try to sort of describe what steps you're going to take. So another approach that I think people in risk analysis have already adopted is this idea of bootstrapping the analysis by having many analysts perform independently the same analysis. So you get one analyst or one group of analysts to try to answer this question, but you get two or three or four trying to answer exactly the same question and you compare the answers you get. Now, you mentioned diversity but I think that's one of the reason diversity is so crucial. Because if all these analysts are like-minded, then you won't actually get a very good bootstrap. You get a biased sample. So you want that group of analysts to have different backgrounds, to come from different parts of the world, to have different trainings.

Hugo: Different experiences.

Lukas: Yeah. Different experiences, different views of the world. Maybe even different political positions, right? So that you get a very broad sense of what potential answers to this question could be. Now, of course, that's a very expensive way to answer the question, right? Because now you need more than one analyst. So I'm not saying you should do this for all questions but certainly very, very important business questions, I would be very hesitant to get one group of analysts or one person answering these questions because you want to have some sense of how sure am I that this is the only correct answer to my question. And when I propose this to analysts, I almost always get pushback when I suggest this to data scientist. The idea that data is subjective is so persuasive and the confirmation bias is so strong that people don't generally accept this. Maybe this is true for other people, not for me.

Hugo: Yeah. And people generally don't like to admit perhaps that they've done something wrong or thought about something in a manner which then looks like it could be something different.

Lukas: Right, right. And in some sense, this gets to their skill, it gets to their ... You are effectively challenging their ability to objectively assess data and I think there's other disciplines where this is less of a problem. Think about developers. If you say to a developer, "We want to make sure that we test all the code before we roll it out because we want to make sure there's no bugs." There's very few developers that would argue that they don't ever write bugs and it's because they've been confronted with stuff that they wrote that doesn't work several times and made them realize that they're not perfect and so that leads to acceptance. I think the same somehow needs to happen in data science where unless we show people their analysis is not the only way to come to an answer and that maybe their answer is not as objective as they thought, it's going to be very difficult for people to accept the idea that they might be wrong.

Diversity at Booking.com

Hugo: So I want to focus for a second on your statement that the more diverse the group of analysts, the better. And I'm just wondering, is this something which is stated explicitly and implemented at Booking?

Lukas: Oh, yeah. “Diversity gives us strength” is one of our core values. We very firmly and explicitly believe that we want a diverse as possible group of people to work on this product because it will help us make better decisions, because it will help us understand customers better. We're a travel company, we're global so we want as many voices as possible to be represented. So an example here, the office that I'm sitting here is in Amsterdam. It's not a very large building. It's right next to Rembrandtplein Square which is really in the heart of Amsterdam. It's smack middle. In this relatively small building, there is a hundred different nationalities which I think is enormous. And that's not accidental, that's on purpose. We want as much diversity as we can because we think that actually is a core strength.

Hugo: That's really interesting and I think that type and magnitude of diversity coupled with the flat hierarchy that we spoke of earlier would make for incredible conversations and dialogue within the company as a whole.

Lukas: Oh, yeah. Our lunch conversations are awesome. Every day, you meet different people from different parts of the world who think about the world completely differently. It's also a very humbling experience, right? Because you think of your own upbringing, your own environment makes you who you are and you think of that as the only right way of what's to be, right? So it's almost like the garden of forking paths, there is also a garden of forking paths in upbringing, in education, in how you live your life and actually that diversity is something that really gives me a lot of energy. I love this group of weird people that are so alien to me. It's great.

Hugo: Awesome.

Lukas: One other thing about diversity is that ... So we wrote a paper about our experimentation infrastructure, the underlying stuff that we use to run these experiments, and one of the points that we make is that we actually have two completely independent data pipelines that are trying to measure the same things but in different ways and they are maintained and built by different people. And those people are of course allowed to talk to each other but they hardly look at each other's code and that's exactly this idea of bootstrapping. We acknowledge that the experimentation infrastructure that we have will ... This is not a maybe but it's a definite. There will be edge cases and bugs and it's very difficult to uncover them unless you have something to compare against. And so we have these two independent data pipelines and we compare the results of those for every individual experiment so that we can say with more confidence, "Here are the results of your experiment as corroborated by two independent sources."

Hugo: That's fantastic. So I think this has been a phenomenal whirlwind tour through all of the online experiments you guys do at Booking.com. The infrastructure you've built out, the role of science in data science, how you consider diversity particularly in a developing, I suppose, ecosystem of roles of data science which requires such diverse opinions and backgrounds and experiences.

Lukas: Yeah. So if you're a factory and you want people to do the same thing over and over and over, then you don't want diversity. Diversity is bad. You want people to be the same. You want a homogenous group. But if you're doing innovative things, if you want to learn things, if you want to expand, if you want to grow new things, then you need diversity. Homogenous groups in that case are actually harmful. There's lots of science to back this up. I can send you another paper, one of my favorites, where they looked at the decisions that are made by groups. When you take lots of people who are really, really, really good but they're really good at the same thing versus a group of people who are not that great but they are more diverse, so their expertise is not as centered as the first group. And even if that expertise is centered on the topic at hand, the diverse group still does better.

A final call to action

Hugo: Amazing. I'd love to see that paper and we'll share it in the show notes as well. So before we close out, I'd like to know if you have a final call to action for our listeners out there.

Lukas: I think we talked about this a lot already. So if you want to build a better product, if you want to help customers, then you have to learn and understand what they need. It's not an optimization problem, it's a learning problem. And you and the people around you are probably not representative for the audience that you have. Facebook has this saying of failing fast. It's a bit of a miss because we're not really after failing. Failing is something that happens along the way but it's not what we're after. We're after learning and improving and we acknowledge that in that process, we will do things that are not working because we are learning. We have to learn from our mistakes. And so when I joined Booking, all this thinking hadn't evolved this much and so I just jumped head long into the first problem that I was given and that was ... That on the main site of Booking ... So when you type in Booking.com, you land on our website, we have a few destinations that we recommend to you. And under each destination, we had five hotels for that particular destination that we thought were relevant to users. We're trying to help people want to explore, find places without having to type. Because the other thing you can do on this page is type where you want to go but if you don't really know where you want to go, it won't help you explore. And so one of the things that my team that I joined was working on was finding the optimal ranking for the hotels that were being shown. So for instance, as a destination, you show Paris. Which five from the 3,000 hotels that we have in Paris do we show? And so for about six months, we tried different ways of ranking those hotels based on academic papers that I found. And what we tried to do was things like, should we increase the diversity of the hotels we show so that people get a better understanding of what the city's like or should we use context about where they're from so that people from China get different recommendations than people from Europe or should we use your past history so that if you're logged in, we can use hotels that you've stayed at before. So you can think of many ways to improve the ranking but what we were using at the time, the success metric was the classic website optimization metric of how many of these people then eventually book and stay at the property, so how many people actually stay at the hotel. And we were not measuring any of the in-between steps because we considered them to be irrelevant. We didn't really look at the behavior on the site itself until at some point, I found a paper that talked about a ranking that required us to know how many clicked on each individual hotel. So that's not something we had measured up until then because we didn't think it was important. But then we started measuring how many people clicked on each hotel and surprise, surprise, it turns out no one clicks on the hotels. So that immediately challenged my core assumption that these hotels are helping people find hotels that they like and then book them because all I had been looking at was whether they booked them and stayed, not whether they were getting to those hotels through clicking on them. So now that sort of challenged the notion that this ranking was actually important and it made me realize later also that there are multiple layers of assumptions being made here and that I was only challenging the superficial top layer which is what A/B testing, which is what website optimization, which is what changing button color seems to be about is challenging those superficial assumptions when what you should really be doing is looking at what are some of the core assumptions that are being made that I can challenge to make sure that I'm spending my time on things that customers value. And so in this case, the layer was ... I was assuming I can improve the ranking and that assumes the ranking is actually important, that it matters which hotels you show there. That assumes that the feature itself is important, that actually showing hotels there is important. That assumes that anything that you show on that page is actually important. Because if people only go there to search, then everything else you show there is irrelevant, right? So these layers upon layers of assumptions and I had been only challenging the outer layer. As soon as I realized this, we challenged one of the lower layers and what we did is we removed ... We hid the hotels themselves. So rather than trying to optimize ranking, there are now no more hotels. Now, what happened is that nothing changed. No one cared. Users were not clicking on them. Now they're not there, they're still not clicking on them. There was no impact whatsoever.

Hugo: That is a wonderful demonstration of how many assumptions we actually implicitly and subconsciously make. Assumptions all the way down and you need to go down and figure out which --- Lukas: Yes, exactly. It’s turtles all the way down.

Hugo: Yeah, exactly. And it's the garden of forking assumptions.

Lukas: And I wasted six months of my time, right? Me and my team, we wasted six months of our time scrapping the superficial assumption. So the irony is that actually hiding these hotels was the easiest thing we had ever done in those six months. All these ranking algorithms, they took days or weeks to build. Whereas just hiding these hotels, challenging that core assumption was a five-minute change. And the way we continued delivery, the moment we decided that, an hour later, this test is running. You can already see it's not doing anything. And I could've come in at Booking and within the first day, I could've shown ... If I had thought about this upfront, I could've shown there was no value to this product and removed it and then spent six months doing something else. But instead because I wasn't thinking about it, because I wasn't challenging those assumptions, I wasted six months of our time. They didn't fire me, so that's good.

Hugo: That's something.

Lukas: I'm still here.

Hugo: Otherwise, we wouldn't be talking right now.

Lukas: Yeah, exactly. That's something. But it did make me realize that oftentimes when we're trying to improve a product, we're making many unspoken assumptions and actually writing those down can be super, super important. That's the philosophy I was talking about.

Hugo: Yeah, exactly. And I think it's very telling that you've learnt from such a mistake, you're improving from it but also that you're communicating about it so that we can tell all the listeners out there to try to uncover what assumptions you're actually making and to challenge them before you do six months-

Lukas: Sometimes people say, "Well, you can only use A/B testing to test small things." And I think this is wrong for two reasons. One is that you can actually A/B test large changes and we do but the thing is that it's not necessarily about how big the change is, it's about how important that assumption is. So in this case, hiding those hotels is a one-line change. It's a bunch of CSS that hides them, does nothing else. That's a very small "change" but it has ginormous impact on how we allocate our time and how much we can help customers. Because in the end, we're still trying to improve this product. I'm not just clocking my hours trying to not get fired, I'm trying to make changes that help customers and to do that, I need to challenge those core assumptions first so I can figure out which things to focus on. I think this is true in many businesses, not just for Booking.

Hugo: Oh, yeah. It's even true in basic science research. I think my listeners are probably sick and tired of me telling this story now. But I've seen grad students in biology, which is my former incarnation, pipette for four years, and do 10 times as many experiments as they needed to if they'd realize how to actually perform the experiments to test the hypothesis that they've made, right?

Lukas: Right. And in many cases, these core assumptions, they're not just important for you to decide whether you proceed but you also expect to see much larger effects there. And so the irony is that once you uncover these big assumptions, it's easier to challenge them and it's easier to detect an impact. So it's not like ... The difficult part here is the philosophy part, the difficult part is thinking about what are the assumptions that I'm making and then challenging them and being open to being challenged.

Hugo: Very much so. Lukas, it's been an absolute pleasure having you on the show.

Lukas: Thank you. It was great. Love to talk about these stuff. You mentioned at the beginning you were excited to talk about experiments at Booking.com. To me, that's my bread and butter, I do this every day. I love it.

Hugo: Fantastic. That's why I invited you on.

Lukas: If you have any listeners who want to come join me and come work for Booking and talk about experiments all day, then by all means, send them my way.

Hugo: Absolutely.

Want to leave a comment?