Hugo: Ben, welcome to DataFramed.
Ben: Thanks, Hugo, it's great to be here and I am looking forward to speaking with you today about Convoy and data science.
Hugo: I couldn't agree more. It's great to have you here, and in our ongoing exploration on DataFramed of what data science is and what it can be, I'm pretty darn excited to be speaking to you today about your work at Convoy and the role data science can play in revolutionizing the trucking industry, arguably one of the most impactful industries in North America. But first, I want to talk about you.
Ben: Okay, thanks.
What Is A Data Scientist?
Hugo: People always ask me, what do data scientists actually do? Ben, what do your colleagues think that you do?
Ben: That's a great question. I think the term "data scientist" is really vague. It means different things to different people. I come to data science from both a natural science and a social science background. I also took a detour through Silicon Valley before I got my PhD in economics. So for me, I kind of spend half my day in meetings talking about the science of how to run our platform, or how we're developing our approach to data science here, or specific questions, say, experimental design. I spend about half my time mentoring other data scientists, so helping a junior data scientist work on an experimental design for a campaign or, say, build a survival model to understand customer retention.
Then I spend another half of my time working as an individual contributor thinking about the aspects that drive our platform and just trying to have an impact by fixing the thing that's currently on fire. So thinking about things like auctions or matching or pricing and those kinds of issues. Anyone who can do math will probably note that that adds up to three halfs, but I work at a startup so things don't add up to one.
Hugo: I couldn't agree more. At DataCamp, I feel like sometimes I'm operating in the form of an infinite sum, actually. And as your background is in physics, and you've told me there are only three numbers to physicists, right?
Ben: Right. There's zero, one, and infinity, and everything else is a matter of scaling. That's the old joke.
Ben's Career in Data Science
Hugo: Exactly. So how did you get into data science?
Ben: That's a great question, and I think I've always been working with software and data and math and statistics and models, and after DJ Patil coined the term and it went viral, then someone in the marketing department rebranded me as a data scientist, but even when I was an undergraduate in college, I was writing software to understand scientific questions. So the first date science-y thing I did was I wrote a program to understand crater relaxation for McKinnon, who's an amazing planetary scientist at Washington University, and the band Frankie was big there, and since we were dealing with crater relaxation, I called my program Frankie. Then more recently, after I did my PhD, I was at University of Chicago and was recruited to work at Amazon. So in that time, that was like 2012. So at that point, I'd say I was really made the transition to taking stuff from academia, and applying it in industry.
Hugo: So now you're working as a data scientist at Convoy?
Hugo: Tell me a bit about what Convoy does, what its mission is?
Ben: This is a great example of how it's much easier for software to come into an established industry, than for an established industry to bring software in itself. So the freight industry has been around for a long time. In the US, it's a super huge market. It's like $800 billion. There's this problem, which is most shippers are heavily fragmented, like the 95th percentile ... Sorry, carriers. And carriers are trucking companies, they have one or more trucks. And the carriers are heavily fragmented and the 95th percentile carrier has like two or three trucks. So the way the shippers connect with them are through these brokers, and the brokers operate using 1980s technology. So phones and fax machines. What we're doing at Convoy is we're helping pioneer this new approach called digital freight where we're using technology and in particular data science to make this whole process better for everyone, because it's really important to match the right carrier with the right shipper's load, because then everything works better, so we need to price it, match it, and automate the whole process, and that's just going to revolutionize the cost structure of the industry. For all of you out there, you're all listening and you're also consumers, most things you buy take eight or ten truck trips to get to you. So, if we can lower the cost of freight, prices hopefully should become much more competitive on things you buy.
Hugo: So what does this look like on the ground for either ... So you said carriers are the drivers?
Ben: Carriers are ... It's a kind of complicated definition because a lot of them are like owner operators, so kind of the American dream. They could be ... For instance, here in the Pacific Northwest, they could be like a Russian immigrant who comes over here with nothing and then works his way up into becoming an owner-operator with one truck, and then as he becomes more successful, he could build out his fleet, and that's something Convoy's trying to help owner-operators do. Or it could be a more established mid-sized trucking company. So typically they have one or more trucks. It's hard for an individual to scale more than about 20 trucks because then logistics becomes complicated and you have HR and things like that to deal with.
Hugo: For sure. So what does Convoy look like on the ground for carriers? Do they use apps or ...
Ben: Yeah, that's a great question. So the main way they interact with us is through the Convoy app, and we have an onboarding process that makes it really simple for them to give us their packet of information. So things like insurance and licensing information, and then we can activate their account. At that point they will see offers that are served to them based on the kind of loads they tell us they like. So they might say, "Hey, I like to operate on the I-5 corridor," and then they can accept those loads in-app and never have to talk to a human. So it's super efficient.
They can also bid on loads as well. So if they don't like our price, or they're competing with other characters for a load, they can bid, the price can go up or down depending on market conditions, and then the other thing we do that's great for them is say they take a load from San Francisco to Los Angeles, well if we see they do that we then say, "Hey, we've got a load from Los Angeles coming back to where you were or going somewhere else." So we're aware of that, to just try to keep them moving and earning money so that they're empty less often, because it turns out that trucks are empty about 40% of the time, which is bad for the environment and bad for the truckers.
Hugo: Yeah, I imagine there are all types of traveling salesmen problems you’re trying to figure out, you don't want to necessarily have a driver drive and empty truck from Seattle to the Bay Area in order to then transport stuff from there elsewhere. You'd like to figure out how they can actually be ... Their trip can be optimized with respect to how much they're carrying?
Ben: Yeah, absolutely. So as we get more liquidity on the platform it becomes easier and easier to be able to solve these kinds of problems and basically keep carriers in a state of almost constant motion.
Hugo: That's really cool. So this is ... One of the reasons I find this so interesting is because we think about the role data science plays in modern industry, we think of tech a lot of the time, which is an industry that was born with this rise in access to data. Whereas you're talking about revolutionizing an industry that predates all of this, this technological infrastructure.
Ben: Absolutely. And if we're successful in what we're trying to do, we are going to change the cost structure for a major industry that affects many Americans. One of the amazing facts I've learned at Convoy is that the most common job in something like 45ish states is truck driver. So this affects a lot of people in terms of their employment, and it also affects everyone who's a consumer. This can have a big impact on many people's lives in a positive way, in particular for the truck drivers themselves, we're making it easier for them to run their own business, grow that business, and just have a much, much less friction in how they run that business.
Data Science at Convoy
Hugo: For sure. So how does data science then play a pivotal role in Convoy's mission?
Ben: Data science is central to what we do. In fact, from the first day we opened, we've always been automated. So we have to solve a wealth of fascinating problems with lots of economics, so it's very important to be able to predict the price because we need to price correctly to shippers. Often we enter into long-term contracts with them and we also have to price correctly for the carriers. Then we need to make sure we solve the matching problem to match the right carrier to the right load. Because that has better outcomes for everyone. So, for instance, right could mean less dead head space, so that's the time you drive empty to get to the start of the load, and also the destination endpoints could matter. We worry about auctions and the whole price acceptance mechanism for the carriers themselves, as well as many other things related to the carrier life cycle, and making sure that everything's going well once a carrier's picked up a load. So there are a wealth of problems like that.
Hugo: It strikes me as a business you might come up against the chicken and egg problem, in the sense that to convince shippers to come on board with you, you need to have carriers on board, and to convince carriers you need to have shippers.
Ben: Yeah, that's something that's really hard for any platform. There's a great post by Simon Rothman, who's one of our investors from the series A round, and he wrote this great post and he says basically what you have to do is you have to start two companies at the same time. So you have to bootstrap a company on the supply and the demand side at the same time, and it makes it really tricky because we've got to keep this in balance.
So, fortunately, we have an amazing sales team, and they're really good at generating quality demand, and then we quickly build supply to try to keep everything in balance, and that's something we track. That's something any platform worries about. They think a lot about liquidity and maintaining balance. It's like, if you went to a dating site and let's assume you're heterosexual and if there are no women on there it's a bad dating experience. So you need to have equal numbers of men and women.
Hugo: The reason this is kind of the front of my mind is previously we've had a similar challenge, at DataCamp in the early days, getting students while getting instructors as well, because instructors want an audience and students want the best instructors, right?
Ben: Yeah, absolutely. Then the other thing I'd say is that there are also network effects. So once it gets going, these businesses tend to grow exponentially, so it's really exciting. It means that if you do this right you'll find that you're not sleeping.
Hugo: Absolutely. So you told me one of the most important things is a really strong sales team, how is the data science team integrated into the company with respect to, for example, the sales team?
Ben: Yeah. I think one of the big things we can do is help them understand pricing, things like how to bid on loads, what loads should you bid on, and those are some ways that we can work very closely with them. Their pricing is an incredibly complex thing, and there are all kinds of incentives around it. So that would take hours to unpack.
Hugo: We've been discussing, kind of circling around the role of data science. What specific types of data science questions that you need to answer in your job?
Ben: So I think one of the things that's really good about Convoy in our culture is we're really data-driven, and so we do a huge amount of experimentation just like everyone else, I would assume, in industry, or certainly the top players. So the way we answer questions is often through experimentation, and that's the only way to solve it. So we do a lot of experimentation. We've invested heavily in building a very good experimentation framework that enables us to iterate quickly on experiments so that there are different approaches to running A/B tests, like Bayesian or frequentist, and like the Bayesian or sequential analysis will get you an answer much more quickly. They also make it easier to discuss results with product managers.
Hugo: Let's just step back a bit and maybe you can give us an example of an A/B test that you perform?
Ben: Great. So the kind of thing that you would typically A/B test is, does some new UX flow work better? So let's say ... One of the things we care about is that when a driver completes a trip that they automatically upload their paperwork. It's something called a BOL, bill of lading. If we rolled out, say, a new process to improve that and make it easier, we could run an experiment where we put half the carriers, keep them on the old technology, and put half in the new technology, and then after some time we can say with some probability whether or not the new process works better.
Hugo: That's a great example and a great description of A/B testing as well. I'm going to probe a bit more. You mentioned Bayesian methods converge more quickly or give you results more quickly than frequentist methods. I know I'm putting you on the spot here, but would you mind for the laypeople out there, just giving a brief description of the difference between frequentist statistics, which people may be more familiar with, and the Bayesian methods that you're discussing, in this case?
Ben: Certainly. So the first real approach to statistics was the Bayesian method, that was developed by Thomas Bayes, he was actually a vicar. His idea works a lot the way your intuition works, which is you start out with some prior set of beliefs about something, like my new process is better, say it's going to cause a 10% lift. Then over time as you observe data, you update those beliefs. So this Bayesian updating then will converge to what we call the posterior, and that is the distribution that we expect the lift to have.
The frequentist view ... Before I go on to frequentist, the Bayesian methods were very hard to computer until maybe two decades ago. We didn't really have the computational resources. Since then there have been huge improvements, and it's much easier to computer these models. Before then you could only really solve special cases. So people like R.A. Fisher in particular were very critical, I think, of the Bayesian approach, and so they developed the frequentist approach in the early 1900s, if I'm correct. There the idea is there is some true value of the parameter, and if I sample enough data, as I get more and more data, that's going to converge to the truth.
So frequentists tend to talk about P values and confidence intervals, and that's the traditional hypothesis testing you're used to knowing. So in the frequentist world, you'd have to go to a marketing manager and say, "Conditional on the null hypothesis being true the probability that I observed an effect as big as the one I saw or bigger is 5.7%." So then you'd have this argument because the traditional view on significance is that you would use the significance level of 5% to say that the result was significant. So in this case, if you were a good person you would not be able to reject the null hypothesis, but your marketing manager's probably going to say, "5.7%'s really close to 5%, let's say we just use a 10% significance level." And you're in this world of hurt. Whereas with the Bayesian method you can go to the product manager and say, "There's a 94.3% chance or 98% chance that variant A is better than variant B," and so it's a much easier conversation to have.
Hugo: In the example you're talking about, variant A and variant B, or the parameter we're looking at would be the number of people who successfully upload all their paperwork?
Hugo: As a function of whatever UX looks like.
Ben: Right. So it's whatever you're testing. Like does my new workflow have higher clickthrough rate or higher checkout? Does it lower churn? Whatever you're interested in. So yeah. I think the Bayesian method makes it much easier to have conversations with product managers, in addition to ... In our simulation studies, we find it converges much more quickly. For our business. Your mileage may vary on your business, and if you're in Europe your kilometers may vary.
Hugo: I think I recall Dave Robinson actually wrote a number of posts for Stack Overflow about Bayesian A/B testing there, and showed that for their experiments didn't necessarily converge more quickly. I need to check that, and we can put that in the show notes as well.
Ben: Yeah. There also were some great blog posts by Evan Miller on the subject.
Other Data Science Techniques and Methods: Econometrics and Machine Learning
Hugo: You mentioned experimental design. That's a really interesting approach, because I don't think a lot of people would expect that experimental design and these types of methods would play such a huge role in reinventing the trucking industry using data science. What other types of techniques and methods are you guys interested in?
Ben: Yeah, so just before we go on I think one of the things I would ... Just to close out A/B testing is, there's a lot of wisdom that people have about the trucking industry. So our company is half tech startup and half trucking industry veterans, and they have a lot of hard-won knowledge and intuition, but it's not always correct or precise. So by performing experiments, we can make things much more concrete.
Hugo: Is there something political about that as well, and social in the sense that a lot of these people have been around for significant amount of time, have certain amounts of power, and kind of hard-earned knowledge in some ways, and is there a view that tech startups can come and in inverted commas disrupt that and there needs to be social behavior reflecting that?
Ben: Yeah. I mean I can only speak within the confines of the Convoy culture, and we have a great, really team-orientated culture. It feels like when I played on good hockey teams, and I think that both sides are really appreciative of what they bring to the table. You might even have two industry veterans who don't agree about something. So in Convoy we try not to have disagreements that can't be resolved through ... Let me rephrase that. So if a disagreement can't be resolved through some kind of intellectual argument with theory or facts, then we run an experiment. Instead of sitting around and having an argument that can't be resolved.
Hugo: So what else? What other types of techniques and methodologies are you guys interested in using?
Ben: Yeah, so, I mean, we're very much a practical, applied data science shop. We're trying to solve concrete business problems in short amounts of time. So we use kind of the standard toolkit you would expect us to use. So we're very agnostic about tools, so people tend to use the best tool for the job, and so in terms of technologies that could be R or Python, they both have strengths and weaknesses. I think it's good to know both. Then in terms of specific approaches, sometimes machine learning is best, like particular if we need to predict something, then predict whether or not someone's going to upload a BOL or, say, be a good carrier. Then you might build a logistic regression or some kind of boosted classifier. But there are other times where you need to understand if A caused B, maybe we weren't able to run an experiment, then we would be back in the world of applied statistics and do some kind of regression analysis. Kind of my first win at Convoy was before I started they had released a feature without an A/B test, and they wanted to know whether or not this new feature helped, and I was able to use the causal impact package that Google developed using Bayesian structural time series to show that the new feature had had a beneficial impact.
Hugo: Fantastic. And that's on data already collected?
Ben: Right. So we have existing data, and so then you have to go and try to make sure that everything is as good as randomly assigned, hopefully. So one of the key features to be an experiment, right, so you need random assignment to treatment, and you also need to satisfy some other things like that your assignment's individualistic and probabilistic and unconfounded. These are technical terms. So if any of these fail, then you're back in the world of observational data, and then you're going to use applied statistical methods or econometrics to try to create something that's as good as randomly assigned so that you can then make some kind of causal statement about whether or not A caused B.
Hugo: Can you remind me what econometrics is?
Ben: Oh, econometrics is the set of statistical tools that economists have developed for dealing with economic problems. For the business world, those tools are super helpful because most of our problems are economic in nature, and so a classic example would be dealing with something like sample selection and other forms of what are called endogeneity, where you have outcomes that are co-determined within the model. The classic example is if you're trying to understand whether or not increasing the size of your police force will reduce crime, well you could think of crime as probably a function of the amount of police you have, but police itself is also a function of the level of crime. So there's this ... That's an example of simultaneity. Sample selection, once you start to look for it, you see it everywhere, that would be like I tried to run an experiment to see if small class size improves reading comprehension, but all the parents of kids who are posh insist that their kids are in the small class, and so now you have kids who are in the smaller classes are more clever, and now you've got selection bias.
Hugo: So econometrics has created a bunch of tools to deal with these types of challenges?
Ben: Right. In particular, there's one set of tools that are well known to economists but not to data scientists outside economists, and that's panel data. These tools are really good for dealing with what economists would call individual heterogeneity. So if I'm looking at the behavior of carriers or shipments, these have individual quirks that I can't observe, and if I can observe a carrier over time, panel data gives me great methods to remove these unobserved individual effects that could confound my estimates.
Hugo: That's very interesting. It seems to me, correct me if I'm wrong, that what you're saying is that there's a whole bunch of tools that have been developed by some very smart people in econometrics that could be utilized in data science but haven't seen the light of day yet in this world of data science?
Ben: Right. Hey I love machine learning and it's great, but there are also problems where it doesn't work and I think people become over-focused on machine learning to the point of overlooking econometric methods that are often very useful and can solve problems that can't be solved with machine learning. I remember talking with someone at Uber a while before I joined Convoy and he actually said they had encountered problems that they could not solve with machine learning but they could only solve them by building a structural econometric model, which is a very complicated process, it often takes about a year or more to build one of these models and get it to work well. But you try to model the whole behavior process and utility function, but when you're done you have a very rich and powerful model where you can make good predictions about counter-factual outcomes.
Hugo: Cool. And that was at Uber, you said?
Ben: Yeah. It actually happened during my interview with them, I interviewed with them before I went to Convoy.
Geospatial Data, Quality and Self-Driving Cars
Hugo: It also sounds as though ... We've discussed machine learning, econometrics, experimental design and Bayesian methods for A/B testing. It seems like you have a lot of geographical data, geospatial data and time series of geospatial data. Does this play a role in any of your work?
Ben: Yeah. I think that that data's really important and we're just beginning to unlock what it can do, but we use it in the app in a lot of ways to make carriers' experience better. So for instance, when a carrier shows up to pick up a load, we can automatically check them in based on geospatial data. Some loading facilities are very poor, and if they keep the truck waiting too long they have to pay something called detention. So we can just start auto-paying out detention when the carrier is eligible for it instead of making them go through some laborious documentation process, which is not dissimilar to trying to file an insurance claim in the US.
Hugo: Something that sprung to mind though when you said we have all this data which perhaps you haven't explored in all its potential yet, it seems like there could be a potential for all types of social research with respect to the data you're uncovering as well?
Ben: Yeah. I think that there are a lot of really interesting questions we can answer about matching and platforms and auctions that I would love to get into more deeply. I'm sure there are many academic papers that could be written on the subject.
Hugo: What data science projects in particular at Convoy have you been involved in that you consider the most impactful on or telling about society?
Ben: That's a great question. I've been at Convoy a little over a year and we're a bit over two years old. I primarily focused on pricing and experimentation. I think one of the most interesting experiments that we ran was shortly after I started we ran an experiment where we gave preferential access to loads, to higher quality carriers, and quality went down. Everyone was shocked. Like, wait, we're giving the high-quality people early access to work and quality's going down? What's going on? This makes no sense. A bad manager would say, "Hey, you data scientist, you're stupid." Proportionally we have like a really good data science manager here, Ziad Ismail, who has built a great data culture. He let us dig into it. So I started to think about the matching literature in economics. So my hypothesis was that because we had restricted the pool of carriers to a smaller pool, even though they were higher quality, the match quality on the job went down. So we were able to verify that and then we did some regression analysis to show that match quality had a causal impact on quality. So that was a really exciting discovery because I think it showed how important matching is on our platform.
Hugo: That's incredible. So to parse that, just for myself, giving high-quality carriers early access to loads meant that the matching them to shippers, the quality of that matching algorithm went down in some sense and that caused a reduction in quality of carriers?
Ben: Yeah. The quality of how the work was carried out because there was the smaller pool of eligible carriers, and even though they were better, the fact that there were fewer people to potentially match with the work trumped the fact that they were higher quality.
Hugo: I think there’ss a Brazilian ant of some sort that has 30% ... I'm not saying that this is a direct analog, but let's say they have 30% of ants in each colony that does nothing, right? If you removed that 30% and come back a day later, there's another 30% of those ants that do nothing. Now I'm not saying that there are shippers or carriers that do nothing, but that there is some sort of stabilizing force happening there. Well, that's super interesting that it actually went the other way, that something that intuitively would ... you think would result in better quality resulted in worse.
Ben: Yeah. This is a great example why it's so important to experiment. When we run an experiment here, we started this nice tradition where people vote on which outcome they think will win. So it gets the whole company involved in experiments, and typically the UX team give the winners donuts or cool stickers or something like that. So we've had many experiments like that where what you expect to happen doesn't happen because it's very easy to fall in love with some feature that you think is amazing and the reality is that it gets ever harder to find something that's going to move the platform forward.
Hugo: In a sense these types of experiments, you're running some sort of laboratory right?
Ben: Yeah, we are. Focused on our business questions.
Hugo: If this is probing too much into company strategy or private material just let me know. I'm just wondering how you gauge quality of carriers or quality of a delivery or anything like that?
Ben: There's an industry-wide problem in trucking, which is 10% of the time roughly when a carrier is committed to take a load they just no-show or they tell you at the last minute they're not going to take it, and usually the excuse is my truck broke down. Trucks don't break down 10% of the time. What it means is that someone offered them a higher paying load. There are some carriers who do this all the time, and there are some carriers, like I've looked at their data where they've done 100, 300 trips, and they basically never fall off. So fall off is one of the key components we use to measure quality because it's super expensive for us when it happens because we're committed to providing a really high-quality shipping experience for the shippers, and so we have to then go find another truck who will cover the load at the last minute, and that's really expensive.
Hugo: That makes sense. Is there any consideration with respect to the advent of self-driving cars or self-driving trucks within your company as a whole?
Ben: Yeah. I should just say there are some other things that we think about in terms of quality like-
Hugo: Oh please, yeah.
Ben: The on-time percentage of the driver, we try to get them to use the app is really important because that allows us to drive cost down in compliance and safety. So all those are really important things, but in the specific experiment I mentioned, fall off was the main thing I worried about. We do find that if we can get carriers to start using the app then all kinds of good things happen and are possible. So another thing we do as data scientists is a very economics-y, which is think about how to structure the incentives on the platform to get the behavior we want. An example of that would be if a carrier uses the app, they get quick pay. That means we pay them same day and the standard norm in the industry is carriers get paid about 30 days after they do the work, which means they typically sell the liability to a factoring company and lose another 3%. So we're effectively giving them a 3% raise if they ... It's like 2-3% for the factor. So we give them a 2-3% raise if they use the app.
Hugo: Yeah. It's a raise and it also, it means they have more liquid assets as well right?
Ben: Right. I don't know about you but when I do work I like to get paid, and if I have to wait 30 days to get paid it's not a pleasant experience. I've got to buy groceries and rent and bike parts.
Hugo: How about with respect to the advent of self-driving cars and self-driving trucks. Is that something you guys are actively thinking about at Convoy?
Ben: Yeah. We are definitely actively thinking about that and I know the founders have spent a lot of time through their network being very plugged in and on top of that. At the end of the day, when self-driving trucks show up, and it will probably be in phases where there's different levels of automation, they're still going to need to connect with freight, and we have a platform that does that. Our goal is to be able to just integrate that on the supply side of the platform.
Multi-Facetness in Data Science
Hugo: We've been talking all about the impact of data science on trucking, with respect to the work you do at Convoy, and as we've kind of ... We've approached data science from a variety of different directions. It's clear that a lot of things play into this discipline, and you for one, you've got a background in ... Well, you're a computational economist, ex-physicist, and also previously a research scientist at Amazon. So my question is, how do all of these disciplines and histories play into your role in what you do as a data scientist?
Ben: I think you can never have too many tools and when I was a young physicist, I was really lucky, I worked with John Wheeler, and he used to say, besides channeling Niels Bohr, he would say, "Never quote anything until you know the answer." This is a very famous Wheelerism, but what he means is you should have a sense of what's the right answer for your scientific problem. There was a famous example with Feynman, where Feynman came in and thought he'd proved something and Wheeler said you're wrong, and Feynman was really annoyed because how could Wheeler have done this problem, Feynman just did it, he thought for the first time. There was an error in Feynman's calculation. Wheeler's knowledge of physics was so deep he just knew that it had to be wrong, it didn't make sense. So things like if you go out and run an experiment to measure lift on your direct mail campaign and you get 10%, that is probably wrong. You just wouldn't expect that to be true. So I think physics is very helpful in that way in terms of being scrappy and building up math chops, particularly linear algebra. I think linear algebra is super important for success in data science, perhaps more so than calculus. Economics gave me theoretical tools for thinking about business problems as well as econometrics tools for confronting the theory with data. My time in software engineering gave me the software school skills to turn statistics into code. Everywhere you work, hopefully, you're gaining new skills and you're learning. I think that's super important for data scientists. I think also culturally Amazon teaches a very adult way about thinking about problems. Having a sense of urgency, being focused on impact, it's a lot like being in a Ph.D. program where you learn to ask yourself the question regularly throughout the day, is what I'm working on going to get my Ph.D. done, and if the answer is no you're working on the wrong thing.
Hugo: I remember you gave a great talk, at a Data Science Pop-Up Seattle, called Correctness in Data Science. The reason I liked it is because you gave direction to what types of mistakes are made and what we can do as a field to correct those in terms of building a well-defined discipline, which at the moment is I suppose a vague conglomerate of techniques, concepts, and applications. So I'm wondering if you could speak to what you think the major mistakes that you see data scientists are making today.
Ben: I'm really glad you liked that talk.
Hugo: I loved it, and we'll put it in the show notes as well so everyone who listens will watch it.
Ben: Cool. First of all, I hope the viewers enjoy it, and so I think correctness of scientific models is super important, and a lot of people, particularly when they're starting out think, oh my code ran successfully, it produced a number, it must be right. Well, back up. You want to make sure that's the right number. There's an epistemological framework for thinking about that, that came out of the nuclear industry, called verification, validation and uncertainty quantification. I'm indebted to Robert Rosner, who was my postdoc supervisor and more importantly former director of Argon for introducing me to VV & UQ. There basically are three parts to VV & UQ. The first V is verification. That's making sure your code correctly implements the model. Whether or not the model's correct. That means you should do things like unit test, you can also generate synthetic data through Monte Carlo methods, with known parameters, and make sure that you get the expected results and do things like that to make sure your code is correct. Validation is making sure your model has fidelity to reality, so that's doing things like running experiments afterwards to make sure that your model is an accurate representation of reality. Uncertainty quantification is about thinking about the limits to your model. What assumptions have you made? Do they hold? Could something like a tsunami show up and take out your nuclear power plant? Maybe you should plan for that. So I think those are some basic things. I love to ask BI engineers when I interview them, how do they know if their SQL is correct, and they usually look at me like this is a super crazy weird question. But SQL is crucial, or whatever you're using to pull your data because if you assemble a rubbish data set, nothing you do is going to get better. Even if you do super fancy statistics, you're not going to be able to fix the fact that you didn't assemble your data correctly. So it's important to be very methodical and check as you assemble the data that it is correct, so you should think about join plans, you should test it on subsets of the data and make sure aggregate statistics makes sense and distributions are appropriate. Check sensible things like you didn't get 10% lift on your direct mail campaign. Then the other thing that's really important too is models go into deployment and so that means you might need integration tests, or other tests to make sure that what's in production is faithful to what was developed in research.
Data Science in Production
Hugo: So where does ... Either where you've worked previously or at Convoy now, where does a data scientist sit in terms of putting what they work on into production?
Ben: That varies a lot by organization and group, so in some organizations or groups a data scientist will just do pure research and then pass things over the wall and engineering will do something magical. That can be problematic, things are often lost in translation. Many engineers are not happy if you give them R code. Hopefully, they're happy with Python. At Convoy we're trying to work so that data scientists own their model end to end, and that we have a machine learning platform that allows us to deploy models. We're not all the way there yet, we have some more work to do in that regard, but I think that's better for everyone because then the engineers can just call against the data service to get whatever result they need. Also, the way we're organized at Convoy I think is very conducive for good data science and that we are grouped into product groups that all collaborate closely, but a product group will consist of a product manager, one or more data scientists, and a bunch of engineers. The thing that's great, too, about our product managers here is that they're all super technical people. Most of them have master's in computer science or equivalent, a good MBA, and they can all write SQL. There's a product manager here, in the interview I asked him a SQL question that involved using left outer join, and he solved it in like 60 seconds. I think he's tired of this example, but that's the caliber of PMs we have here. They're technical, data orientated, they can write SQL, they can understand undergrad level stats, and that's very helpful because that makes them advocates for doing data science correctly, then we have a lot of extra social stuff to make sure that the data scientists continue to connect and collaborate horizontally. So for instance, we run like a data science brown bag. We have technical one-on-ones where I meet with other data scientists and make sure that they're heading in the right direction and answer any technical questions they have so they're not blocked.
Hugo: I also think a technical product manager, as you say, a lot of wins but one of the major ones is that they can really have the conversation with you as well, right?
Ben: Yeah. They are very invested in being data-driven. They know that if they write a plan for a new feature they need to work with the data scientist to have a Stack Overflow plan or some other plan to verify and validate their ideas. They're all advocates for using data, and I think we have a super high bar for PMs, but it's crucial in an organization like this that they can participate in the data conversation, because often they're driving research questions. An example of how data-driven they are is Ziad Ismail, who is the chief product officer. He writes SQL. I mean, this guy stays up late at night writing SQL to understand the business and has as deep a knowledge of the data in our data warehouse as anyone.
Hugo: That's cool. So you yourself though have ... I must say, a very impressive toolbox of statistical, econometric chops and data science techniques. I'm wondering what your favorite technique or methodology for data science is?
Ben: Yeah, that's a great question, and it's kind of like saying what's your favorite bird or whatever, camera. The British expression I guess is how long is a piece of string?
Hugo: And nobody can answer that.
Ben: Right? So I tend to like ...
Hugo: Yeah, what are you interested in? What do you like?
Ben: ... using the best tool for the job. So I've certainly ... I came from a Ph.D. program that was very strong in panel data methods, like there were people at UCL where I studied like Richard Blundell, and others who did a lot to drive that forward. So that's a strength I have, is using panel data methods. But I also like a lot of the core ML tools. It really depends on the problem. I want to use the best tool. So what I'm happiest most about is not using the tool, but solving interesting problems. I'm an applied scientist at the end of the day. I've worked on a range of problems from quantum cosmology to bioinformatics to trucking, among other things. So having interesting problems is what matters and being able to find the right tool to solve it is important. In addition to tool data and software.
Future of Data Science
Hugo: So we've discussed a lot about modern data science. What does future data science look like to you?
Ben: I think we're in this amazing time. Like if you've got math and stats skills, there's so much data just exploding everywhere. I think we're going to have a very fun and interesting time until Elon Musk figures out how to put us all out of business.
Hugo: What will happen until then?
Ben: At some point, there will be tools that are going to automate away a lot of the simple models. I think you're starting to see companies trying to sell commodity churn models and things like that, which I think can be problematic because ultimately every company's data is unique and you might as well build your own churn model from scratch, but you'll probably see more commodity models and more tools that automate a lot of the lower hanging data science fruit. So I think to have a successful and happy career you want to move further up the value chain, where there are things that can't be replaced by automation like automated feature engineering. For instance, at a previous company I worked at Context Relevant, they made good progress to automating feature engineering for a large class of problems.
Hugo: What types of skills would you suggest aspiring and even well-seasoned data scientists develop in order to not have their jobs automated?
Ben: The first thing is you can never know too much math. I think that that's something that's really worth investing, and that starts at a young age. When I taught at Galvanize, where they run a data science boot camp, I can remediate lack of programming with someone in eight weeks. I can't remediate lack of math in eight weeks or 12 weeks. That's years of study. So I think you should continually invest in math. I think after that you want to master the core algorithms, then you need to keep reading. It's really important to keep reading. A lot of people stop reading when they get into industry, and particularly for the more advanced people, you want to choose some specialization that plays to your interests. So experimentation is something that's particularly interesting to me, as are Bayesian methods, and these are both things that I've worked on going much more deeply in recent years. I know a lot of people all right going into deep learning, that's a very competitive space. So I think there are a lot of other interesting and important areas in data science. So yeah, sure, if you want to go into deep learning, go into deep learning. But I think there's benefit in being a bit contrarian.
Hugo: I think so as well. So with all that having been said, do you have a final call to action for budding and established data scientists alike?
Ben: I think the main thing I would say to someone who's interested in getting into data science is to understand that you're setting yourself up for a life of learning and that it's a marathon, it's not a sprint. It's like getting a Ph.D. You need to pace yourself. This is something that could take you multiple years to pull off. So you need to keep investing. So maybe you should watch a little less Netflix at night and spend a little more time reading the relevant books, papers, writing code, playing with models, and if you don't have that excitement about data there may be some other place you're more happy.
Hugo: So keep learning, keep reading, keep doing.
Ben: Yeah. For me really, I think for a lot of us in the profession, data's like an Agatha Christie, there's like this mystery in there and I want to unlock it and solve it and figure out if it was Colonel Mustard in the living room with a candlestick.
Hugo: That's fantastic, Ben. So you're living your data science life as detective fiction.
Ben: Yeah. Something like that.
Hugo: That's incredible. I can relate to that a lot because the data just keeps giving. I've got a colleague who always tells us that you need to listen to your data. It will speak as long as you're listening, right?
Ben: Right. I think also, I've seen ... When you talked earlier about the mistakes people make, one mistake I've seen a lot of scientists make is they leap into modeling too soon without doing EDA. EDA is the famous Tukey term for exploratory data analysis. It's really worth investing in some time in EDA, because you will discover surprising things. So when I joined Convoy, I started doing some EDA on the data, and I found some data cleaning and outlier problems that had not been addressed, and we fixed those and got like a 10% improvement in the pricing model. That was like free performance.
Hugo: That's very telling. I mean doing something like EDA, exploratory data analysis, because it is tempting to just jump in and try to build models straight away and all of that type of stuff, but I always encourage people to try to visualize their data in 100 different ways and look at their summary statistics and all of that type of stuff before doing anything else.
Ben: Right. And I try to teach students a very methodical, standardized approach, and that's one of the things I think is great about CRISP-DM, which is the cross-industry standard process for data mining, and it's probably the best workflow I've seen for a data science project, and it's really good to go through all the steps to make sure you don't leave anything out. So you start out with just understand the business problem, then understand what data you have. Prepare your data. Model, evaluate and deploy. And at any point, you may find some mistake that you need to go back and address in the one of the earlier steps. So you start modeling and you realize, oh, I didn't do my feature engineering right, or I need to add a feature to capture some key behavior where the model's failing. It's very good to be systematic like that so you don't duplicate effort or miss steps, and ultimately in terms of correctness, which we've also touched on, I'd like to kind of find a way to make CRISP-DM and VV & UQ, and I think then you're in a very powerful, professional and mature setup.
Hugo: Fantastic. That really speaks to kind of a systematic structure for what future data science could look like or incorporate.
Ben: Right. Then the other part of that is the modeling is the fun part, and getting to the answer, but the stuff that comes before it is super important and it takes about 80% of your time. That cleaning data, preparing it. Then the modeling is kind of fast and it's like this high and it's like, boom, where'd it go? I'm back to normal. I think that we will see more tools hopefully that will make that pre-modeling period faster, because that's where you're going to get your big productivity gains, is if you can become faster in that phase. So that's a good place to invest. So you should learn things like UNIX and get really good at using all the command line tools and other technology or platforms that are going to help you get your data ready to model more quickly.
Hugo: Exactly. Ben, it's been an absolute pleasure having you on the show.
Ben: Thanks, Hugo. It was a treat to be here and I wish you the best of luck with the show.
Hugo: Thank you.
← Back to blog