Building High Performing Data Teams
Syafri is the Vice President (VP) of Data Science at Gojek. He obtained his master's degree in Applied Mathematics (Financial Engineering) from Twente University and a bachelor’s degree in Mathematics from Brawijaya University. He graduated from both universities with honors (cum-laude). He is a former VP of Data Science at FinAccel. Prior to that, he has spent 9+ years living and working in the Netherlands as a Quant (quantitative modeler) at various Financial Institutions ranging from banks, insurance, asset management, consultancy, etc.
Adel is a Data Science educator, speaker, and Evangelist at DataCamp where he has released various courses and live training on data analysis, machine learning, and data engineering. He is passionate about spreading data skills and data literacy throughout organizations and the intersection of technology and society. He has an MSc in Data Science and Business Analytics. In his free time, you can find him hanging out with his cat Louis.
Transcript
Adel Nehme: Hello, this is Adel Nehme from DataCamp, and welcome to DataFramed, a podcast covering all things data and its impact on organizations across the world. One thing we always think about DataCamp, whether in webinars, white papers or podcast episodes is the data maturity of organizations and what are the hallmarks of a data-driven organization.
Adel Nehme: It's oftentime a combination of infrastructure, skills, tools, organizational models, and processes that enable data-driven decision making at scale. This is why I'm so excited to have Syafri Bahar, VP of data science at Gojek on today's show. Gojek is an Indonesian super app that provides more than 20 data-enabled digital services such as food delivery, transportation, commerce, payments, and more. It was the country's first unicorn and decacorn, for that matter, and Syafri oversees a large portfolio of data products and manages a variety of data roles.
Adel Nehme: Throughout the episode, Syafri discusses his background, the hallmarks of a high-impact data team, how he measures the ROI on data activities, the skills needed in every successful data team, what is the best organizational model for data mature organizations? How COVID-19 affected Gojek's data teams. His thoughts on data literacy and data governance. Future trends in data science and AI, and why data scientists should sharpen their math and machine learning skills in an age of increasing automation.
Adel Nehme: Also, we'd absolutely love your feedback on how we can make DataFramed a better show for you, and ... See more
Syafri Bahar: Yeah, sure. Thanks a lot, actually, I'm happy to be in this podcast.
Adel Nehme: I'm excited to discuss with you the data science powering Gojek and all the best practices you've developed leading data science at such a data mature organization. Before we begin though, can you give us a brief introduction about your background and Gojek's mission for those who may not be familiar with it?
Syafri Bahar: Yeah, so my name is Syafri. I've been in Gojek career for around three years. So prior to this, I spent... Like most of my careers are actually doing modeling within financial institutions. So banks, asset management, insurance. I'm hopping from one type of risk into like another type of risk. And I think what's really nice about it, because each of the risk type actually incorporates different mathematical models. So it also gave me quite some exposure to different quantitative techniques available out there.
Syafri Bahar: And then, yeah, now I'm with Gojek. So, I'm overseeing data as a whole actually. So I'm overseeing a portfolio related to consumer science and analytics at Gojek. And I basically oversee several job ladders. At Gojek, we have data science, we have decision science, and also business intelligence. So yeah, that's a little bit about me. I'm sure you're also curious about Gojek.
Syafri Bahar: Gojek is in itself is I would characterize it as, super app actually, like on-demand app. Especially with the merger with Tokopedia, we're the largest in Indonesia, for sure for now. And then with Gojek, we have around 20-plus products, you name it, actually. So we have ride hailing, we have food deliveries, we have logistics, even streaming service, actually. So it's quite a lot.
Syafri Bahar: And then those are the different offerings that we can actually give to our customers in Indonesia. I think one out of four Indonesians actually has Gojek installed. So that's a good thing. In terms of our drivers, we have around two and a half million drivers. So it's quite a huge basically, I think together with our new friend, Tokopedia, we've contributed to around 2% of Indonesian GDP. So it's pretty huge. Yeah.
Adel Nehme: Yeah, that's massive. And I think for our Western audiences, the best similarity to Gojek maybe it would be WeChat, if I'm not mistaken. So given the wide variety of tools and services Gojek provides, I'm sure data science plays a massive role throughout the value chain of different Gojek products and services. Do you mind giving a brief overview of some of these key areas of data science that deliver value for Gojek? Whether for customers or internal usage?
Syafri Bahar: Yeah, sure. So I think, especially what I really like about Gojek, because data science plays a central role in different product offerings that we have. Basically, if we look at the whole lifecycle of customers from acquisitions, activation, retention, maximizations, right? From their booking values, even the customer services. We have a lot of machine learning system actually powering those use cases. So it's actually quite broad.
Syafri Bahar: But if I can name some of them, so definitely, I think one of the first use case of data science at Gojek will be in our matchmaking engine. So basically, this is an engine that responsible to match drivers and orders, right? And what's really nice about it, I think from a business perspective, that it is a multi-objective. So one can actually specify what kind of objectives that business wants to optimize. And then the system actually will learn itself, and then gives the best allocation, so that's number one.
Syafri Bahar: Standard, plain, vanilla use cases are also you can see on search and the recommendation engine that we have. So we also apply there. And I think especially one of the biggest machine learning system as well, that we have at Gojek is called [Gobstopper]. So gobstopper, in essence, is a promo-allocation engine, right? So it is basically responsible for like 80% of the demand generation budget of the company, which is really huge actually.
Syafri Bahar: So what this engine will do is that it will allocate the right vouchers to the right customers. Now, we're basically combining counterfactual machine learning also with some abstract type of optimization in order to achieve that. So it's quite a combinations of multiple things. So I'm pretty excited about it.
Adel Nehme: Yeah, that's a very large portfolio to manage. It must be very exciting. I want to set the stage for today's conversation. So you're currently the VP of data science at Gojek. Gojek is quite a data mature organization. It was born with data in mind, and many organizations look at firms like Gojek as really the gold standard or North Star in where they want to become.
What are the hallmarks or characteristics of a high-performing data team?
Adel Nehme: So there are many levers towards succeeding in data transformation, and cultivating high-performance teams. High-performance data teams is such an important part of this. I wanted to know what do you think are the hallmarks or characteristics of a high-performing data team?
Syafri Bahar: I think, for me, high-performing team, but then there are a couple of characterizations the way I would describe the high-performing team, right? So number one, they need to be empowered. And I think the executive sponsorship plays a very big role in this. Because again, if you look at the machine learning investment, sometime it takes years.
Syafri Bahar: Not even months, it takes years until it manifests into something which can be measurable, right? So having an executive sponsor is very important, definitely number one. And having that will ensure that the team will feel empowered. And I think to me, empowering also means that there should be relative freedom for the teams to try out different approaches. And I think especially within Gojek, we always encourage our data scientists to try out the latest techniques, the latest stacks. And then they come back to us with the result of experiments.
Syafri Bahar: And this is also very interesting, because this is also what we try to do also on company level. So when we talk to executives, executives, expect already that we tried out all those different things. And then when we have the conversation, it's not about whether we can start investigating it or not but, "Hey, these are some of the new approaches that we've tried. Here are the results, should we scale up? Yes or no?"
Syafri Bahar: But I think that that also trickles down to the data scientist teams that we have. So empowerment is really important. So ability to really be very agile with the approach and experiment, measured, do like very fast iterations. And then still related to empowerment is really also about have the courage to fail fast, but then to learn as well out of it. So that's very important.
Syafri Bahar: And number two, I think everything needs to be measurable. I think we will discuss it also a bit later, about measurability. So in all the different machine learning systems that we have, the first question is the product engineering a system that we want to integrate with? Does it have sufficient capability to do measurement? We want to make sure that that's in place before we actually even engage in any machine learning system projects that we have.
Syafri Bahar: So, that's also really important. And definitely, number three, I think the team also needs to be empowered to do decentralized decision making, for sure. It's again, part of the empowerment, because being able to take decisions by themself, using of course, a scientific method, we'll be able to, again, empower the team to make the right decisions without needing a very complex decision-making structure. Yeah.
Adel Nehme: And you mentioned here, measurability of output. One thing I've seen you discuss, and you've mentioned here is the importance of ensuring high-leverage teams improving the ROI of every single data scientist on your team. And at the heart of that, is the ability to measure impact. I think a lot of data leaders struggle with quantifying their work, especially in metrics business leaders care about. So can you describe what are the steps that you had to cross in order to reach such a high level of transparency on the quantitative value of your team's output?
Syafri Bahar: Yeah, and I think definitely one very crucial ingredient of having a data-driven conversation, especially with the executives, it's really to start everything by asking the right questions like for example, what is the impact? Where can I find the data? And what are the measurements that we use as North Star? I think those conversations will actually trickle down also to the team level execution, right? So that's definitely ingredient number one, having a sponsorship from the executive. And if I look at our CEO, Kevin, so he used to be head of BI actually for Zalora before Gojek.
Syafri Bahar: So he get together with another senior data leader back then, Crystal, actually they built data organization. And so we understand really the value of having a proper data asking the right questions. Having the right North Star metrics is also very important. Because using this metric, actually, we can march everyone towards the same direction, right? So definitely, that is a very crucial. Second thing, like having a proper tooling in place is also very important. Having the proper infrastructure to do the measurements. And I got to say having a mature A/B testing capability, for example, it's really important.
Syafri Bahar: Configuration management is really important. And all these different infrastructures actually that we can think of, it is very important to have in place. So that's number like layer number two. And layer number three is really being able to have the right methodology to measure as well, because not everyone can be A/B tested, right? That's also the reason why we have a dedicated job ladder for that. We call it decision science, actually, where they basically incorporate a lot of statistical techniques to really answer some of the most big questions that we have within the company.
Syafri Bahar: For example, how do we measure impact of loyalty? You can't really A/B test loyalty, right? And then I think for that purposes, we resort into many causal inference techniques in order to be able to do that. And then, again, there are so many choices, they really depend on the use cases or questions that we try to answer. But I think what I want to say is that it is really important to have a really proper scientific ways in order to ask the questions and to measure them, and be really intentional about it. Having someone who's really looking into this and really expert, because it's a discipline on its own actually will really help a lot in order to get to that stage.
Data Solutions To Invest In
Adel Nehme: So given that you've really emphasized creating this infrastructure and having this multi-layered approach to measuring impact of data solution, how does that play into your decision-making process around which data solutions to further invest in down the line?
Syafri Bahar: Yeah, yeah, it really influence a lot of our decision making, right? Because again, machine learning investment, or data science investment, in generals, are quite expensive. So it is really important to kind of being able to size the market size, before we even start any machine learning engagement, right? We need to be able to identify, I mean, if we compare five use cases, for example, are we talking about $1 versus $100 impact? Or are we talking about $40, $60 impact.
Syafri Bahar: So I think it's really important to have that. And we really use all these different infrastructure to really become the basis for various conversations that we have within the company, for sure. That's how important it is to be able to have this measurement and be able to use that in order to make the right investment. And not only that, actually, I'm talking... I mean, that's even the first layer. So the first layer is really about do we want to make investment yes or no?
Syafri Bahar: Second layer of questions can be, "Okay now, if I want to tackle these problems, I have 10 different solutions, different data science, there'll be different ways to frame the problems, right?" I can frame it as unsupervised, supervised, reinforcement learning, for example. Each comes with its own degree of complexity. And I think it is really important to be able to size that effort; really to measure the trade-off between effort and impact, putting in the nice quadrant, and then really, "Okay, this is the approach that we want to do." So, that's how deep we went into decision making basically using all those different infrastructure to make decisions on our data science projects. Yeah.
Adel Nehme: You mentioned here the use of decision scientists. So from a skills perspective, given how prevalent and diverse data science is within Gojek, what are the different roles you hire for and the different skill sets you think every data team should have?
Syafri Bahar: Yeah, that's an interesting questions, actually. So first of all, we want a, especially relevant for this podcast is data scientists. All right, so I think for us data scientists at Gojek, they specialize at building scalable machine learning systems. So inherently, there are expectations for data scientists at Gojek to be full stack and also to be able to apply good software engineering principles in building this machine learning system.
Syafri Bahar: So that's data scientists and even just to make it even clearer if I were to articulate right, so data scientists at Gojek specialize at helping company making microscale decisions very rapidly, high frequent micro decisions. Okay. Now, it comes to decision scientists. So decision scientists, if I want to contrast it with data scientist. Decision scientist will specialize in really helping making less frequent, big decisions, which inherently require a lot of like statistical knowledge in order to be able to find the problems and apply the right techniques basically.
Syafri Bahar: And then we also have business intelligence basically. So the role of business intelligence within Gojek is really to make sure that you have one single version of truth, where we look at things. Then they're also responsible in defining the right metrics, making sure the data is available. And they also to some sort of extent do the [inaudible] as well. Being able to map different business flows that we have, and translate it into tables that we have.
Syafri Bahar: Just to make sure that we have a proper data models. Because I think that's also very important because we cannot really do a lot of the advanced analysis that we may want to do, if you don't have a very solid foundations on the data, right? Having single version of truth is really important. Having reliable data that does not break once every week is all very important. So yeah, so those are different personas. And of course, you also MLOps machine learning engineers as well, that we hire within Gojek.
Adel Nehme: I'm excited to cover this more as we're talking about data teams. Oftentimes organizations undergoing data or digital transformation, struggle with identifying the best possible way to organize their data talent. Some organizations gravitate towards like a centralized center of excellence model, other towards an embedded model where data scientists are integrated within functional teams. How is the data talent organized at Gojek?
Syafri Bahar: Yeah, I think for us, it was also an evolution, I would say. So we've actually tried various different models. We started with metric organization, I think that was good for a very small team because we want to make sure that everyone follows the same practice. So there needs to be like a centralized voice, in terms of how we should do things.
Syafri Bahar: So basically, what we did back then is that we have a central team, we dispatch them to different products. But we were acting as consultants pretty much back then. But then as the team grows, as the use cases grow within Gojek, that model was not sustainable anymore. So what we do now we are operating according to it's a federal system, I would say.
Syafri Bahar: So each of the head of data within Gojek, or data leaders, they have their own domain that they need to take care of, we are very embedded within the within business team, and also product teams just to make sure that we really feel the heat as well like we are really invested, we have skin in the game. And it's also really like allows us to think bigger than the problem statement that we try to solve.
Syafri Bahar: And that's also very important, right? And just to make sure that we have consistencies in terms of our practices, and consistency of our career path and et cetera, we form a council basically, within data leaders at Gojek just to make sure that we follow also the same practices and standards as well. But so currently, we are at the place where we are fully embedded within business team and product teams, yeah.
Adel Nehme: So do you think as an organization increases in data maturity, the operating model needs to evolve with the organization's data capabilities, and it moves to something that is more hybrid?
Syafri Bahar: Yeah, I would say even though that's inevitable, I would say. Because otherwise, the organization becomes too complex to manage, right? And then for the functional leaders, the leader in the central function, it is also very difficult. Especially if we look at the domain of data science, right? It's like, if you wanted to have meaningful conversations, a data leader needs to have two, three layers, deeper context depth, basically, in order to be able to have a meaningful conversation with the team, and with their business leaders.
Syafri Bahar: And then, if we are putting it that way, it's just very difficult to be able to maintain knowledge here, like having a context, especially within the various domains that we have at Gojek. I think to me that it's inevitable to go to that model.
Adel Nehme: And as you said, skin in the game is really important, because otherwise, data scientists are just creating analysis on an ad hoc or sport basis, and that doesn't necessarily maximize the impact.
Syafri Bahar: Correct. And I think one thing also that I try to encourage, is that I try to encourage that a traditional data people, in general, data professionals in general, are seen as service provider, but what I tried to set as well into the organization is that data people are not service provider, we are thought partners, right? Meaning that we need to we need to be involved at the very beginning when the problems are being formed. Because that way we can actually give a good recommendation. We have a skin in the game, and so on, so forth. I think there's so many benefits of doing that.
Combining Business Acumen and Data
Adel Nehme: Do you think then that data scientists often end up not really forging a sense of business acumen or understanding of the use case they're working on? And do you think data teams should focus on instilling that business acumen within their data teams?
Syafri Bahar: Yeah, I think so as well. I think giving a big picture to our data team is really important for us also to think beyond the predefined problem statements that are given to us. So, yeah, but I think it's really important to have that. And I'm not saying that all organizations work that way. I think it really depends on the type of problems that we solve. I think for a very hyper-optimization type of problems, it makes sense, right?
Syafri Bahar: Then one, we'll just basically stick to that problems, we really go 100 layers deeper into the problem, and then really solve it. But especially for problems with large degree of ambiguity, I think it is really important to kind of like being involved in the conversation, right? Because it can go to multiple directions, especially in the objectives that we want to solve for And so I think that's how I would differentiate between when especially a data team needs to be involved at the very beginning.
Syafri Bahar: Or it can be that the problem statement is very clear, right? We want to optimize conversion, for example, by building our recommendation engine, and then it is very clear that we need to hyper-focus into that problem.
Adel Nehme: So as a data leader, and someone who manages multiple data teams that work on a portfolio of different products and services, I'm sure the COVID-19 pandemic greatly affected your team and the different data science solutions you work on and maintain. Do you mind walking us through how you dealt with concept drift affecting your models, and more importantly, how you managed to sustain high-performing data scientists, despite massive uncertainty and stress?
Syafri Bahar: So I'll basically address it from two perspectives. Number one, is that how pandemic has affected our team. So I think in terms of the effect of pandemic itself, or even more specific, the fact that we need to work remotely, it didn't affect our team that much, because even before pandemic, we are a distributed team already. So our teams are actually distributed in several cities. We have teams in Singapore, in Thailand, in Vietnam, in Bangalore. For example, even some of our colleagues even work also from outside Asia. So it was less of an issue for us.
Syafri Bahar: One thing that we doubled down was that we need to be very good with documentation. Everybody needs to be very good with documentations because the fact that everyone needs to work remotely, meaning that the one powerful mean of communications will be via documentation. So we take documentation seriously, for sure. We update always on confluence, just to make sure that the model has proper documentations, it can be followed. There are links to our data source, to GitHub, and multiple other things, right? So that's number one.
Syafri Bahar: Number two, in terms of the effect to the model. And I think, even before pandemic, we acknowledge and we realize that we work on nonstationary environment already, if you look at the market. And especially the fact that we dominate some of the markets, meaning that any changes that we do to our model can potentially change the behavior of customers, meaning that there are also like a lot of feedback loops, right? Meaning that the market will change anyhow. And then we need to take it as business as usual. Not as a phenomena, right?
Syafri Bahar: But it's just that we always need to make sure that there are no drift in our features, for example. We need to make sure that we have a frequent training. And more and more, we realize that we need to adapt a bit more adaptive learning techniques in our modeling in order to be able to capture the changing market situations, basically. So I think for us, now maybe just to kind of like summarize it, it has been business as usual. But of course, in terms of the business itself, it has largely impacted by pandemic, right? So that is for sure. We saw that. But in terms of how we work, and how we basically create and monitor our models, nothing changed here significantly, no.
Adel Nehme: Yeah, and I think this is where the data maturity of Gojek comes into play. I think a lot of organizations realize this year that they really need to invest in MLOps and the ability to monitor and update models in production. And that may not have been a problem at Gojek because that's a capacity you guys already had.
Syafri Bahar: Yeah, exactly. And I have to give kudos also to the data science platform team. I think they've done a good job at providing us with all the different infrastructure that we need in order to monitor the model performance in real time having a feedback loop, deployment technologies and then so on and so forth. So they have been amazing.
What are the characteristics of a data mature organization?
Adel Nehme: That's incredible. And I'd love to expand our conversation beyond creating high-impact teams. As we discussed earlier in the episode, Gojek is a really highly data mature organization that lives and breathes data. What do you think outside of high impact teams are the characteristics of a data mature organization?
Syafri Bahar: I think it really lies not only on the tangible things that we can see and touch, but it's also, I think, in the spirit as well, it's very important. And I think, one characterization, and this is a bit less also about data team but it's just organization as a whole. We need to live, breathe, and then we need to use the vocabularies in our day-to-day conversation, right?
Syafri Bahar: We need to really asking about correlation versus positions, bias, THE values, what are the Bayesian drift? I think that should be a part of the day-to-day conversation in order for an organizations to be labeled as a data mature organizations, right? So it's really inherently in the culture. I think it's a bit more of meta than just tooling and dashboarding. I think that's really one of the characterizations.
Syafri Bahar: And I think t's really important as well, I can't emphasize it more is for your leadership to set an examples, right? Because everything starts with asking the right questions to product team, to data team. Because that questions actually will trickle down a lot of things. And I could maybe give some examples back in the early days, when we say, "Hey, we want to measure for example, what was the effect of having a certain loyalty membership, right?"
Syafri Bahar: It got us thinking like, "Hey, we haven't had any infrastructure yet in order to measure these things. So we need to develop something more, right?" And then we started to explore and we saw, "Hey, maybe we could approach using instrumental variables. Randomize encouragement, for example, right?" And then that really trigger us to build capabilities around that as well. Like, "Hey, what are some of the instrumental variables that we can generate as a company in order to help us measuring the marginal impact of a certain phenomenon which cannot be handled using a traditional A/B testing, right?"
Syafri Bahar: That was just again, me try to illustrate how important it is to ask the right questions, especially for data mature organization, because that will reveal all the flows that we need to build from data perspective. Yeah.
Adel Nehme: Couldn't agree more on the importance on the spirit and culture when it comes to creating a data-driven organization. I think really that's the main differentiator. You mentioned here infrastructure. And one thing that's quite impressive about Gojek is how geared the infrastructure is towards creating high-impact data science.
Adel Nehme: I've seen you speak about this in other interviews and panels and this is something heavily featured in the Gojek Medium blog. Do you mind walking us through the different technologies and infrastructure-level innovations Gojek has undertaken in order to facilitate high-impact data science?
Syafri Bahar: Yeah. So I think there are a couple of... So we can talk about it from MLOps perspective. We can also talk about it from really like the downstream data engineering stacks that we have, right? So I think one thing that I really like about Gojek is we like to develop our own solutions, actually, especially when we think that the third-party solutions cannot actually cater our needs, right? So we've actually developed a lot of in-house system as well.
Syafri Bahar: Which were essential also like wrapper of some of the recent technologies in, for example, we talk about storage track, for example. So we built a lot of a wrapper around that. Just to give an example, we have built around 20 to 30, data engineering tools within Gojek in order to help us moving data from one place to another place. We want to get our different transformations, for example, data cataloging, we also have a solution for that.
Syafri Bahar: We've also built Optimus, for example, which is CLI for doing data transformation. So we built quite a lot of tools, right? And that's only from data engineering perspective. So 20, 30 tools. And then when we look at MLOps team, the data science platform has actually built quite a lot of tools for us. I can maybe tell two or three of them.
Syafri Bahar: For example, Merlin. So Merlin is a tool that has been used by data scientists to deploy models. So what used to take two or three weeks to deploy a model now, it takes like 10 minutes for us. What we need to do is just like saving the pico file, the binary file. And then, so we stitch different technologies like MLFlow, Kubernetes deployment, Docker, multiple stuff, actually putting them into one creating a simple abstraction for data science.
Syafri Bahar: And then all the things it's just being managed by our data science platform team. So with Merlin for example, a data scientist just need to save the binary file, be that from scikit-learn, or PyTorch, for example, TensorFlow, put it and then now there's like certain API link that we need to call. It will automatically upload it deployed in Kubernetes. cluster, it will create logging system, monitoring system, and drift monitoring as well. Like all in one single go. It's pretty nice.
Syafri Bahar: We also have, for example, Feast is another important product that we have, we co-develop with Google which are also being used by several companies now. So Feast basically allows us to decouple the feature surfing, and also the modal surfing, by basically just creating a one layer for training 00:30:47].
Syafri Bahar: For example, we can use the same abstractions also to serve the model. And it also provides some good discoverability of feature registration, for example, an ability, for example, to do historical surfing or historical batch calculation, or like an online real time surfing, right? Just using one generic abstractions. So it's pretty, pretty cool. It's quite a lot of stuff that we done. We also have Turing, of course, and many other things like Clockwork it's also another product. But I suggest the audience to just check our blogs and see what are the different tools that we've built.
Adel Nehme: Yeah, I highly encourage everyone to check out the Gojek blog, which is truly a showcase of amazing proprietary technology and tools developed by the team. Are these solutions open source, Syafri?
Syafri Bahar: Yeah, I think most of them by now. Yeah. So I think especially our data engineering tools. I think it was just recently, like one or two months ago, I think we started to open source the tools that we've built in-house. Yeah.
Data Governance
Adel Nehme: Obviously, none of the above we've talked about. So far as possible without high levels of data cleanliness, and quality and organization-wide data governance, do you mind describing how important data governance is for scaling data maturity?
Syafri Bahar: Yeah, for sure. And I think that data governance is often overlooked, actually. It is often taken as granted. Well, for us, it is really important because especially, if you look at the Indonesian regulatory framework we are very basically, it's strict. And we really also want to keep the trust from the consumers that we have. So we are actually very serious about like having data governance, even we have our own data protection officer, basically. We have a council of data governance who decides in terms of who should get access to which column.
Syafri Bahar: Whatever NDA we need to sign before we can get access to that. Recently, you also have launch a tool called [Ocean] our data warehouse. New data warehouse tool basically. With Oceans, we can basically separate entity, we can really govern, we can separate between presentation layer and datamark layer, for example, and give a very specific access to that one.
Syafri Bahar: So I think for us, what I want to say basically, it is very important things for us. We do have a council, we do have a governance process in terms of how we make use of the data that we have. There of course, various tools that we also develop in-house in order to help that. And I think the way we work it's always first, we discuss what are the things that we need in terms of having a proper data governance and data protections, and then we kind of like build the tools based on that requirements.
Adel Nehme: And where do you think the data leadership's role stands when enabling high quality data across the organization? Can you briefly describe some of the features of Gojek's data governance program?
Syafri Bahar: It is very important, actually, because I think there are a couple of roles that the data leader plays in this whole scaling up data governance. I think number one is really in terms of setting expectation. I think it is very important, right?
Syafri Bahar: Because without having a clear expectations, again, this is an area which is sometimes being overlooked by people. So setting expectations is one thing. Second is also about education, having a proper education of why we need it, right? Why, for example, a company will need proper data governance before they go public.
Syafri Bahar: So I think those are the second things. And I think the third thing is well there needs to be a bridge between the various stakeholders, basically in the data governance, right? Because we need to bridge multiple stakeholders from regulatory perspective and consumer. Consumers are also our stakeholders basically in this, but also like different functions, different products also within Gojek. So I think bridging is also very important, right? So those are three things really. Set expectation, bridging, and the third one is authentication.
Adel Nehme: In terms of tooling, what are some of the tools that you use for data governance at Gojek?
Syafri Bahar: A lot of these things that we developed these capabilities in-house. We create a lot of the wrappers around it. And also the creating, for example, tools to really be able to show the lineage of the data. And also to understand, for example, whether this comes from reliable tables that we maintain also on regular basis. Because historically, from the hypergrowth legacy that we have, we had problems of data that grew organically, really like it's all over the place.
Syafri Bahar: We create a derived table from derived table, for example. So you ended up to create like a forest of tables. So we are also in the process of cleaning that up as well. So, yeah, but I think in terms of tooling, a lot of these capabilities we tried to develop in-house, right? We might use a third-party tool, but I'm not very informed about that, to be honest.
Data Literacy
Adel Nehme: I'd like to pivot to discuss maybe organization data literacy. As a data science executive, your role consists a lot of gaining executive buy-in, justifying resource allocation, and all of that fun stuff. What do you think is the proper level of data literacy executives need to have in order to be one, productive in these conversations and two, critically assess the success of data projects initiatives?
Syafri Bahar: But I think this is something which we can't really control. I think, in general, if you look at the companies, right? Especially, when you look at the different spectrum of companies, there will be a different type of leadership. Leadership also, within those companies will come from different background, right? So it's kind of like the things that we can't really control. But I think it's really up to the data leadership to be able to articulate the message really well to them such that...
Syafri Bahar: Well, I hope really there are some basics, at least in data, of course. But just assuming that the basic is there, it's really up to the data leaders in order to cater to their audience, in terms of how they want to basically try to put some certain agendas with respect to data. This as an example, if for example, if I want to push more for resources for data science, so what I will do, as long as the counterparts are rational, we can basically have a fact-based conversation.
Syafri Bahar: I will bring my data and say, "Hey, these are like three people team, we've generated or we saved this amount of money, per user, per data science, actually..." So we can actually normalize it to the gains per data science. And then, yeah, I can actually use that as an argument to be able to kind of push for more resources. So that's just within the context of data science. But I think with the context of a project, like in data governance, for example, I think, especially with respect to data governance, this is just something that we need to do, I think. This is the hygiene stuff, right? So there is just really no other ways to not doing that. Not to be honest.
Syafri Bahar: I think in order to have a productive conversations on this, I think definitely the counterpart needs to have some certain appreciation for data. And I think especially for leadership, who were basically conceived in the last century, so I would say, I think their literacy around data I think should be okay.
Adel Nehme: That's great. And you've mentioned multiple times throughout our chat, maybe data culture, data spirit, and how important that is. I'm sure it also relates to creating very strong self-service analytics capabilities within an organization. Where do you think the data team's role is in creating and forging this culture throughout the broader organization and the use of self-service analytics? And what are some of the best practices you've adopted at Gojek to sustain this, and do you have any lessons to share there?
Syafri Bahar: Yeah, that's a very good questions actually. So there are a couple of things that we are trying to do with respect to self-service analytics. So number one is really to have a proper and reliable data. And that's a really number one. So the first things that we did is actually, we tried to fix the basics first. And you're probably remember all these different problems with thousands of tables that grew organically in the past, right?
Syafri Bahar: We tried to tidy that up first, right? Just to make sure that first of all, it is reliable and it was based on a reliable source of data, and then also being able to create the proper data mark on the top of that. So that's definitely the basics that we need to take care of. And then when we go to the information retrieval, so there are a couple of things that we are trying to do basically.
Syafri Bahar: So number one is also encouraging our analysts or business intelligence folks to be able to create more dynamic dashboard. So we use tools like Streamlit, for example, to be able just to kind of create a complex visualizations to be able to cater such that business people if they want to create the story for example, it's just based on like several clicks, and intuitive enough.
Syafri Bahar: And then for example buttons will appear as soon as they are interested in some certain domain. And just to help them navigate with all the various informations that we have at Gojek. So that's another things that we've tried to do. And the another thing is that we tried to develop also like another in-house tools, basically, that maybe I can't really be specific about the name at this moment and what it does, but pretty much the capabilities will be we need just to define dimensions and measurements, and then filters as well. So only three things that business need to know.
Syafri Bahar: So they can just like drag and drop things, they can just add dimensions, they can just add measurements, and then those inside that the needs can just be produced nicely. But it also requires I'm not saying that we're 100% there, but because it requires a lot of standardization. And especially given the fact that we have around 20-plus products, it's not necessarily the easiest job to do. To be able to standardize everything to achieve that state. Yeah, but those are some of the things that we're thinking and are actively doing at this moment.
Adel Nehme: And on the consumer end, where does data education fits within the broader organization? How important is their literacy education is within Gojek?
Syafri Bahar: Yeah, that's an interesting one, right? So because we also realize a lot of our consumers come from various backgrounds. So we are also actively pushing for data literacy. I can maybe mentioned one program. We have a program called [Data Hero]. So this is really a program which aims to educate data consumers basically, teaching them SQL, teaching them the basics of data.
Syafri Bahar: What are the your data warehouse, data mart, data lake, and things like that, in order basically, to help them help themself. So, that is really crucial, I think in terms of you're creating this awareness. And of course, it also helps a lot, because most, if not all of our leaders, they have a high data literacy, right? So they're always encouraged by the top leadership to keep asking about data to their people, such that people realize how important to have a data-driven conversation. And it creates the urgency for themself also to educate themself around how to use data right.
Trends and Insights
Adel Nehme: That's great. And I want to cap off our conversation by discussing some trends and insights and where you view the role of data science and AI playing in the future. So Gojek creates so much impact throughout Indonesia and beyond with its technologies, how do you view advances in AI and data science further fueling value for Gojek customers? And which advances are you excited about the most?
Syafri Bahar: Yeah, I think there are a couple of things right. So I'm most excited about the use of especially the rise of causal machine learning actually. Because a lot of things that we do inherently for example, look at a particular domain on promo-optimization engine, right? It is very natural to frame it as causal problems actually, and just to give you an example, for example, if you want to do churn prevention, it will not be very useful to predict churn, because it will create a vicious cycle right? You predict people who are inherently very difficult to restrict anyhow.
Syafri Bahar: So what we need to be able to do is to be able to predict when they will be churned and then also to understand which treatments work most effectively to prevent them from being churned, right? So to be able to frame it that way. And I think I'm actually very happy especially like in the recent years, we were able to reframe causal inference into machine learning such that can actually take advantage of the how good machine learning in dealing with sparse data like high dimensional data. That is very crucial.
Syafri Bahar: Back in the old days, we needed to manually specify confounders for example, but now with the recent advancement in causal machine learning for example, we just put the data over there and then being able to gain the marginal impact estimation for example. And the algorithm will basically learn which are the most probable confounders that's if you used confounders, for example, in techniques.
Syafri Bahar: So that's rule number one. I think number two, also, I see that there's quite a promising future for reinforcement learning type of algorithm. I'm actually very excited about it. And if you kind of like, go down a little bit also under the hood, especially on the Markov decision process, I think it has a very promising application as well especially within the context of dynamic market. Because we want to be able to have algorithm that can learn online, basically, without us needing to download it first, training on the fly and pushing it again to production. So exactly as much as possible, we want to push into that state where it's just basically like a learning by themself.
Syafri Bahar: And if you look at especially within the domain of like behavioral modeling, especially in within the context of marketing and promo-optimization engine, there's quite a lot of areas where we can at least explore the potential applications of that. So, that I'm also very excited about. Of course, the recent advancement also in GPT-3, you mentioned with AutoML, right? I think that's also a very exciting development, as well within the field. And especially with GPT-3, I think, we just barely scratch of what can be done.
Syafri Bahar: So yeah, I think I see a world where data science solutions will be commoditized, there will be a lot of out-of-the-box solutions, such that, as data scientists, or data professionals, what cannot be replaced would be really the creative parts of it. So I would highly encourage, especially for aspiring data scientists out there to really sharpen the problem solving skills, like creativity, be able to use these different tools. Because again, we might only estimate models at one single click, but still, it requires in-depth understanding of the mathematics and statistic in order to be able to interpret those models, those solutions, and being able to take decisions out of that.
Adel Nehme: Yeah, I know you're someone who's a big math fan, and I've seen you argue for why data professionals should develop deeper technical understanding of the models they're working with. How do you reconcile this worldview with increasing automation technologies for data scientists?
Syafri Bahar: And I think it has a good purpose, by the way, with all the different optimizations which are being created. It has a purpose of basically to democratize machine learning and AI, right? I think for some certain problem areas, that makes sense. Because it will basically especially on the repetitive stuff. And I think I remember also what Andrew Ng quoted so everything that can be done within one second, now is actually like a prime use case for machine learning or AI, right?
Syafri Bahar: But I think there are also more and more, we discover also a lot of different type of problems, which cannot really be solved within one second. These are very ambiguous problems around, for example, distributing vouchers, about how to develop users into more mature state, for example. So it's a very tricky one, right? So again, what I want to say is that it has a good purpose, but it might not be able to take on all the industry problems that we face currently, especially on various domains that can generate a lot of impact. So that's number one.
Syafri Bahar: And number two, I think, even though there will be a large degree of automations in the future. But I think it is also very important to understand the mechanics underneath these automated solutions, actually, for humans to be able to make wise decisions in how to use it right. Being able to interpret the byproduct of those estimations, for example, are also very important; how to, instead of like applying it blindly. But again, there are areas where we can comfortably you do that. But there are also another areas that you can't just apply some mechanics blindly, right? So it is really important to understand the mechanics and to really understand what it does.
Call to Action
Adel Nehme: That's awesome. Finally, Syafri, any call to action before we wrap up today?
Syafri Bahar: Yeah, I think there are a lot of... and I just want to also like related a little bit with this advancement in automations of machine learning. And I think some people might think that, "Hey, our jobs will be replaced, it will not be sexy anymore." But I think contrary I would say that more and more different breed of data scientists actually will be needed in the future; people who are able to solve problem applying first principles and try to combine the various available solutions, stitching them, and being able to decide which solutions actually can solve a particular problem.
Syafri Bahar: So that's a word of encouragement, basically, to still invest yourself in the domain, especially in some emerging countries like Indonesia, we barely scratch of what we can do in terms of impact. Especially if you look at the structural inefficiencies in some of these countries, there's like huge opportunities for data professionals really to create impact over there.
Adel Nehme: Yeah, thank you so much, Syafri, for the insights. I really appreciate it.
Syafri Bahar: Yeah, you're welcome.
Adel Nehme: That's it for today's episode of DataFramed. Thanks for being with us. I really enjoyed Syafri's insights on the data science powering Gojek. If you enjoyed this podcast, make sure to leave a review on iTunes. Our next episode will be with Shameek Kundu, former group CTO at Standard Chartered and current chief strategy officer at TruEra. I hope it will be useful for you and we'll catch you next time on DataFramed.
blog
Seven Tricks for Better Data Storytelling: Part I
podcast
Managing Data Science Teams
podcast
How to Build a Data Science Team from Scratch
podcast
Building High-Impact Data Teams at Capital One
podcast
How Salesforce Created a High-Impact Data Science Organization
podcast