Building a Data Platform that Drives Value with Shuang Li, Group Product Manager at Box

Adel and Shuang explore her career journey, how to build a data platform, ingestion and processing pipelines, challenges and milestones in building a data platform, data observability and quality, future trends and more.

Jun 27, 2024

Guest

Shuang Li

Host

Adel Nehme

Key Quotes

I was always joking with my team, saying, if we look back, building this data platform, it's like climbing a high mountain, right? And you can't reach the top within a day or two. And for Box's case, we didn't reach the top within a year. We spent a couple of years to get where we are today, right? Especially when considering about the entire cloud migration we did in the past couple of years.

Building relationship with your stakeholders across the data platform, when building a data platform or during any cloud migration is super important. Everyone in their different departments act like our partners. So we figured out, they have different use cases on the marketing side because they need to use the data to run marketing campaign. They want to find the right target of customers to run the campaign. And also on the business side, for business analytics, they want to create a dashboard to track our revenue, to do the churn forecast. Those are very different use cases from, let's say, the security product team. They want to just detect anomaly of the content of the customers. Those are very different use cases. So we need to understand all these use cases across the company. Because basically the entire company, all the teams are customers, internal customers of data platform. So I believe in this data platform journey, we really enjoy the partnership with data engineering. They are part of us, and they help us broaden our relationship and the engagement of different internal customers.

Key Takeaways

Building a data platform is a complex process that should be approached iteratively, breaking down the journey into smaller, manageable milestones to ensure continuous progress and alignment with stakeholders.

When making technology choices, carefully weigh the trade-offs between building in-house and purchasing third-party solutions, considering factors like feature set, cost, long-term viability, and security requirements.

Build strong relationships between product management, data engineering, and other stakeholders to ensure that the data platform meets the diverse needs of different teams within the organization.

Links From The Show

Box

Connect with Shuang on Linkedin

[Course] Understanding Modern Data Architecture

Transcript

Adel Nehme: Shuang, it's great to have you on the show.

Shuang Li: Nice meeting you again, Adel. I'm very excited to be on the show.

Adel Nehme: Awesome. So you are the group product manager at Box, where you lead the building of the data platform there. So maybe to set the stage, what got you into this role and how do you become a data product manager?

Shuang Li: In my career, I have been always passionate about what's made possible by data, ML, and AI, including optimizations, insights, and predictions. So when I was doing my PhD in computer science at The Ohio State University I was doing actually a lot of theoretical research. And then the summer the year before my graduation, I got an internship.

At Qualcomm. So that was my first exposure to industry, and I was very interested, actually, in the project I was working on. So it's basically uplifting the user experience of streaming media over wireless networks by developing a machine learning based algorithm. So I got very excited about that project because I was thinking, this algorithm would be very likely running on the cell phones of millions of people.

So that's actually the first transition into my career. I decided to join Google as a software engineer instead of staying academia. And then my second transition actually also happened at Google. So I joined Google fiber. It's like a startup inside a big company.

Our mission was to bring the high speed Internet to households. So it was a small team... See more

and I was actually working with people from, different functions like product managers, business development, marketing, all these different people. And I got into customer costs together with our product manager.

I was amazed about how he asked questions trying to understand the pain points of the customers and then wrote requirements document to solve the problems for our customers. So then I transitioned gradually into product management because I got very interested in this area. So that was my second transition.

So over the past few years, I've been in product management in payment. In electric vehicle charging and also in big data, cloud, machine learning and AI. My third transition actually happened about three years ago when I became a group product manager as Box. So I was hired to build and lead a team of product managers to build a Box data platform.

So it's like over my career journey so far, three transitions, but I've been always very interested in big data, ML and AI. So that's how I got here.

Adel Nehme: So maybe let's jump in into the meat of today's discussion, Xuan. You know, if we first focus maybe on, building a data platform, specifically the why behind it. building a data platform. So maybe walk us through first, what exactly is a data platform, dispel any, you know, myths about it and why organizations should invest in building one.

Shuang Li: So when you talk about a data platform, you ask different people, they may give you different answers. But I think what's in common is, overall, it's the high skill data infra every company builds to solve the business problems for the company, so I hope that's brief enough because, you can have different variations depending on which company or which industry you are in,

so I can maybe share a little bit more about the story at Box, so when I first joined Box about three years ago we didn't have a actually data platform because we call them like data infra different teams, right? They work in data silos. So they had their different data infra.

They had a dedicated a set of people managing the data in front of managing all the data. And it's really hard to scale at that time. And also, there was no way for these different teams to share data. They work in silos. And not to mention, okay, can you guarantee the right performance of your data infra?

And there's cost challenges there, right? It's hard scale, reliability, quality, all these problems. If we talk about the why here, you think about without it, What would happen right to a company? So I think essentially like I mentioned before, we want to have this the right scale, the high scale data infra.

Maybe I can share a few numbers so people can understand a bit more. So for Box, we are a content cloud company. We have millions of users and we manage billions of objects. So when I talk about objects, these are like files, folders, and other kind of objects, customers store in the Box content cloud, right?

And they interact with these files and folders all the time. Think about all the downloads, uploads, editing, deletions, so, Imagine that kind of scale, like you have all the user activity data, metadata, file information you need to manage, right? If we're working data silos, we just have teams managing their own data infra, it's a really hard problem we're solving.

So it's very important to build such high skill data infra to solve the business problems for the company. So hopefully, yeah, that's something people can resonate when they hear about this.

Adel Nehme: Yeah, a hundred percent. It's, hard to imagine as well, a company like BoxNet operating at such high scale of data not needing something like a data platform, right? and then when you reflect on your experiences kind of leading the data platform group. So far, at Box, you mentioned you joined three years ago.

A lot of work has happened since then. What do you think are key components of building an effective data platform? What makes a data platform successful?

Shuang Li: That's actually, I think, the core of data platform. So, I always want to keep things simple. But. Like you said, what are the key components here? Maybe that's where we start when we build a data platform, and then we can build on top of that. So these are like the building blocks of a data platform.

Essentially, we want to gather data in, and then make it available for different teams across the company to consume the data. And then you need to have a pipeline, to gather data in. That's the data ingestion pipeline. But you need to think about all the data sources you have at a company,

Shuang Li: it could be on your platform, like it's other teams. They are, let's say, storing the transactional data, but they need to get data ingested into your data platform for other teams to consume, it could also be some third party tools. Let's say your marketing team or customer success team are using.

So it's off platform data. It's another type of data source. But you need to figure out for those data sources, how do you build the right ingestion pipeline to gather data in your data platform, So I think the first key component is the ingestion pipeline in my mind. And second, you need to be able to process the data,

it could be the ETL pipeline, we work together with our data engineering team to build that. But beyond that, right, there are teams who want to use your data processing capabilities, so probably you gather data in for them, but they want to do some further processing, like application specific objects.

So they probably want to use your batch compute or stream compute capabilities, right? So, at Box, we have streaming use cases for sure. Both on the infrasight and also on customer facing side. So for example, we have a security product team. Their job is to basically detect anomaly of user activities.

Let's say, oh, your company only operates in these regions. But you found some downloads, actually hundreds of downloads within a second from a region you have never seen before. So that's suspicious location or suspicious activity, right? They want to detect right away, or we say near real time. So that's when stream compute come into the picture.

And that team, We'll need to leverage the stream compute capability we have on data platform. So that's the second key component, data processing. So you could go batch, stream, or you could have both. And I think the third key component, I think people talk about that all the time. It's the core of the core, data lake, right?

Once you ingest data, process it properly, you need to store it, manage it somewhere. And then you need to support all the use cases for your customers. Here I talk about internal customers, but of course they build products or they build analytics for external customer or leadership for those dashboards,

so, we need to, choose the right tool, the right technology, and then we need to innovate on top of that for sure, there's performance, there's cost, but of course the feature set you have in mind. So then, to summarize, the three key components I have in mind is data ingestion pipeline, data processing capabilities, and also data lake.

Adel Nehme: Okay, that's really great, and there's so much to tease out here. Maybe kind of focusing on that ingestion pipeline, for an organization like Box, walk us through the complexities of trying to capture all those data sources. How do you approach it, especially where you're building out early in the data platform journey?

Shuang Li: So, I briefly mentioned all the data sources, we have on platform data sources. So it's like, some transactional data we capture in the relational databases. But, they couldn't make it available for downstream teams to do all the analytics, or build products. It's very hard to do it there.

So that's one of the biggest data sources we need to figure out, right? And then there are other data sources, like, there's some metadata we store, in other places. And beyond that, there's some events data, like user activities, downloads, uploads, which we call enterprise events. That's another data source.

So I think we have at least like 10 on platform data sources. But on the, marketing and customer success or data science side, they leverage some third party pipelines. We use SnapLogic, actually, to get the data in. And then you have to manage, like, oh, if you only build one pipeline, right?

It's not like one size fits all. You need to figure out how you get data from different sources. They're in different formats. They probably have different SLA requirements. And we need to figure all these out. And I think I mentioned the scale we are operating at Vox, right? Imagine those millions of users interacting with their billions of objects every day,

think about the scale.

Adel Nehme: Yeah, that's really awesome. And then, you know, the second kind of component that you mentioned is the processing pipeline here, which, you know, I assume the data engineering team here is managing it and building these ETL pipelines. How integral is the relationship between, the product management team here and the data engineering team?

And how do you build out? Those ETL pipelines in a way that satisfy all the requirements that you have, whether for downstream users or, to make sure that, right security levels are in place. Maybe walk me through the nuances of building effective ETL pipelines in such a circumstance.

Shuang Li: I'm a product manager. I have a team of product managers working on a data platform. We have engineers. We call them data platform engineers. But the third party you talk about is this data engineering team. So they are not part of us. They are not part of data platform. But we are in this partnership together.

So for Bob's case, it's probably more complex than we all thought in the beginning of this conversation because three years ago when I joined, we didn't have a decent data platform, but at the same time, the entire company was starting the migration to the cloud. So it's like

we have been building the data platform together with doing the cloud migration. It's a great thing, right? By migrating to the cloud, you can build the right data platform, leveraging cloud native solutions, but it also adds complexity to the project or to the program we are managing, so, Of course, data engineering team, we engage them very early in this journey.

So, we always have, all the discussions together, because we always think they are part of us. We are doing this all together. So even today, after we already finished cloud migration, we are kind of brainstorming together on what's next,

in box. It happened that data engineering is part of our go to market org, so they have a lot of close interaction with our marketing customer success. So those are like the teams, not as part of product and engineering, but they have this good relationship with them. So, building relationship with your stakeholders in all the, building a data platform or in the cloud migration is super important.

So they are like our partners. They work with those marketing and customer success. So, of course, was deeply involved in those conversations together with my team of product managers. So we figure out, probably they have different use cases on the marketing side because they need to use the data to run marketing campaign.

They want to find the right target of customers around the campaign and also on the business side, for business analytics, they want to create a dashboards to track our revenue to do the term forecast. Those are very different use cases from, let's say, the security product team. They want to just detect anomaly.

Of our the content of the customers, those are very different use cases, different SLA and different scale, different you know, for example, when do they expect the data to arrive, those are different things probably from what other product teams are expecting. So we need to understand all these use cases across the company.

Because basically the entire box, all the teams are customers, internal customers of data platform. So I believe in this journey we really enjoy the partnership with the data engineering. So they are part of us, but they kind of help us broaden our relationship, the station the engagement of different internal customers.

Adel Nehme: Yeah, that's very fascinating. I think it's the first time I hear of a data engineering team being part of a go to market team. But it's pretty interesting to see how the interlock works. And maybe kind of switching gears here a bit, let's start at the very beginning, right? Like when you started building the data platform at Box, maybe what were some of the key steps and milestones in building the data platform?

You know, if you were to zoom out outside of Box, what advice would you give for those looking to build a data platform right now? Where do you start? All

Shuang Li: Yeah, I was always joking with my team, like, oh, If we look back, building this data platform, it's like climbing a high mountain, and you can reach the top or two. And for BoxCase, we didn't reach the top within a year. We spent couple of years to get where we are today,

especially talking about the entire cloud migration we did in the past couple of years. So like I mentioned, when I joined, we were in the very early phase. Of the migration to the cloud, so it's not just data platform. It's the entire company. We were doing the cloud migration and data platform is one of the foundational teams.

So we kind of went first in this cloud journey. So after we finish together with other platform teams, our service teams applications team, they could start doing the migration because their services have been built on top of the services of platform teams. So, I think in the very beginning,

we need to of course, identify the key first steps for the entire company. Of course, it's about, platform teams. You guys need to go first and then other teams can build on top of you guys. But if we talk about data platform in specific, we need to identify the problem. We had at that time, think for everybody, it was very clear we were working in data silos. Every team had to allocate resources to build, maintain their own data infra and sometimes, there's data inaccuracy, and they have to figure it out by ourselves. And let's say product analysts, they were expecting data for their monthly active user weekly active user dashboard, but data didn't come through.

Where's the problem? Right? They have to figure out by relying on their own resources. So all these things, there was no single source of truth, at that time. So that's the biggest problem. So that's why, we kind of identified a goal and aligned with all these stakeholders by talking to them constantly,

so we need to build this consolidated data platform, got all the data in one place. Build a scalable, reliable ingestion pipelines, right? Provide the right data processing capabilities to all these teams. So that's, I think, very important in the very beginning. I think you also mentioned Do you want me to share any advice?

With other companies,

Adel Nehme: Yeah, I'd love if you can share advice.

Shuang Li: okay, yeah, I think alignment is very important. That's the most important thing. So, for our case, we're basically dealing with the entire company. So there's a product engineering, marketing, customer success, right? The entire go to market is in this picture as well. And we have product support.

They are also using data platform. Our compliance team. They need to store the data, set a specific retention period for auditing, and then our data science, data engineering team, product analytics, business analytics. So it's a lot of team we're dealing with getting alignment, not just in the early phase, but of course, getting it in alignment in the early phase is super important.

But along the way, you need to constantly talk to them, because sometimes it requires real alignment. Things change, right? They probably have some different use cases, or the way you thought things would work didn't. So those are all the things. But overall, it's all about alignment. I think that's the most important thing.

Adel Nehme: Yeah, and I really like the analogy of climbing a mountain that you use here because, you know, you can extend that analogy and say, okay, like the mountain is very high, but it has multiple peaks along the way. And you want to , arrive at the first peak, second peak, third peak. So, when we're talking here about the different peaks, how do you chunk up, the peaks?

massive journey of building a data platform to these small, iterative, goals that are achievable in the short term. How do you define those over time? Yeah,

Shuang Li: not easy, for sure, it's a company wide program, right? Everybody, everybody is deeply involved this. But then, it's overwhelming, in the beginning. BoxCase, we were not on the cloud, Not many people had cloud experience, not to mention the best practices in the cloud,

if we are like, oh, let's just adopt everything, all the cloud services, all at a time, nobody could do that. It's really hard. So we're trying to make things simple, were trying to building an iterative way. So, we worked very closely with our, our partners. Architect group. So box we have a architect leadership group.

There's all the principal and distinguished architects in this group. So we work very closely with them along the journey. And even today, right after we finish, we have all the architectural discussions with them. So we identified two groups of use cases overall, Think about all those different use cases and stakeholders, but we just divide them into two group.

One is about uplift. Group two is about lift and shift. when we talk about uplift, think about, oh, we want to our end goal, we need to build the right architecture to handle high scale data processing in the cloud, ? But lift and shift is just, we want to optimize the delivery in time while meeting the needs.

And probably we'll come back later and see what we can do. There would be some technical debt, but that's, How we need to in order to meet the timeline of the cloud migration or some other goals set by the company. So I think, trying to make things simple is really important here in this journey.

Adel Nehme: yeah, I couldn't agree more. And you mentioned something here is that if you want to adopt all of the cloud tools at the same time, that's not necessarily possible. Maybe when it comes to technology choices, such as databases, frameworks, which cloud provider to choose, etc. How do you make those decisions?

How do you approach these trade offs?

Shuang Li: So at Box, we always have this debate when we talk about, different choices, tools, technologies or services, build versus buy, right? You need to answer this question every time. So we would look at of course, feature set, these are the use cases we have in mind. Can this tool have all the right features for us to solve these problems?

And cost is another thing, right? If we think about we'll buy this. licensing cost, how much would that spend? You need to have a more or less accurate estimate that, right? With all the forecast of your traffic in the next two or three years or even longer. And then if you build by yourself,

it's your engineering cost. You need to pay your engineers to build a product, right? It's not free. And then there's maintenance cost, you build your own product, let's say, probably from scratch or on some open source. But you need to maintain it all the time. It's your engineering resource, for sure.

But of course, if there's issues, it might be faster to troubleshoot, because it's your own engineering team. They know the code, right? But if it's the vendor, you have to file all the tickets. Think about the back and forth. Sometimes doesn't work well, so I would say if we want to go with a buy, let's say, cost is under our budget, and we like the feature set provided by the vendor.

But we want to look for the long term. We have this much traffic. These are the features we want. We need to think about long term, is this vendor in the right ecosystem, for example, this vendor has great partnership with others, you probably want to expand some use cases in areas or features provided its partners.

What is the innovation speed? Of this vendor, you may go well beyond what you are looking for for this year, and then good partnership is important, right? Not every vendor is easy to work with. You need to look at how easy, to have this good partnership. And of course, so for box, I have one more thing to add.

It's about all the security requirements. So Box, our security office, has very strict security requirements. When we pick vendors, we need to look at all the requirements provided by our global security office. We need to make sure they check all the boxes. So that's another important thing.

So usually when we decide, oh, we probably will go with this vendor, we'll initiate our request to that global security office very early. that review could take. a month or two, we don't want to be like, oh, we decide we'll use this vendor, but it end up getting declined by our security approval team.

Adel Nehme: , great ideas here on the buy versus build. Maybe an additional, component of building a data platform that we haven't touched upon yet here is data observability and quality, right? Which is really integral, to maintaining high trust in the data platform.

Maybe walk me through some of the best practices that you can share here when it comes to, you know, keeping data quality up, having data observability pipelines that monitor when data breaks. I'd love to learn here, how you've approached that as well. And when does that come in the journey?

Shuang Li: well, when we were doing the migration to the cloud, didn't invest much, to be honest, in this area. So actually, we structured our entire data platform and team as Data management versus data transformation. So you can hear from the name, they are both about foundation of data platform, There's not developer experience we were thinking about. So basically management, we are thinking about data at rest. So think about data lake and ingestion pipeline. How do you gather data in, right? But of course, there are related capabilities. just trying to keep this simple. But data transformation, we're talking about data in motion,

right? ETL, the processing, all the capabilities and orchestration we're providing. So that's how we structured our engineering team. And of course, I have my product managers covering providing the coverage for both teams. But once we finish migration, we're like, oh, foundation, of course, we'll keep innovating.

We'll adopt cloud native solutions. We'll adopt best practices. But how about developer experience, like what you mentioned, right? The data quality, data observability, our product analyst. Let's say today at 9 a. m. pacific time Tuesday, they're expecting all the data in the past 24 hours so that they can show the, daily active user monthly active user dashboards and want to present to leadership, But data didn't arrive, nobody told them. And they found out maybe one or two hours later, they had to, oh, come to data platform. Data platform is like, oh, we, build and manage the pipeline. Let's check with the data source. And then we talk to another team, which keeps the transactional data,

that's a very hard Process or no process at all. So we realized the problem because that's just one of the examples on the developer experience side, there are many other things like how to discover data easily, and, can we provide kind of a playground for teams to, have a production like environment to play around with our capabilities on data platform before they go to production, so all these things fall into developer experience. And then last year we decided let's restructure the team here instead of having. Data management versus data transformation, both of which are in foundation,

we have data platform foundation team versus data platform developer experience. So then it's easier to prioritize, because every time when there's a developer experience request, that goes into the developer experience team. The foundation will execute against a separate road map, and then we can prioritize accordingly.

So that's actually, you know, structure goes first, we want to have data observability, everything on the roadmap. But they probably got de prioritized because we don't have a dedicated team investing in this. So actually starting from last year including this year as well.

We were investing very heavily in developer experience. So I think for data observability, I briefly touched upon like data freshness, but don't know. Do you want me to expand more on what we have built?

Adel Nehme: Yeah, I'd love if you can learn more, but then what I actually would like to ask, is what do you think are key components to. a healthy developer experience for a data platform. Because, freshness here is a part of it and data observability is a part of it.

But maybe expand on, what makes a great developer experience for a data platform as well.

Shuang Li: That actually comes back to the North Star for building a data platform, like what metrics are you measuring, because for platform teams like data platform, you're not directly delivering customer facing products, can use a revenue to measure, how good or how bad this is.

But then for platform teams, we have actually teams across the company using our capabilities, then that's your customer and you can measure. So we use time to value or time to market or time to production. So we always use these three interchangeably. That's the metric we're measuring. So let's talk about time to value.

What does this mean, right? Use that security product team as an example again. They want to build a, let's say, near real time anomaly detection product for our customers, let's say they haven't used data platform at all. They need to onboard to data platform. That's the first step, how easy is the onboarding process?

To be honest, when we first started this journey, it was extremely hard. It would take them a quarter, like three months, to just onboard to data platform. So they become a tenant of data platform. They can start using the capabilities. After they onboard to data platform, how can we make it easier for them to discover all the different capabilities of data platform,

we should have the right documentation for them, right? Playbook, tutorials, office hours, but those are maybe some artifacts together with process. So they start experimenting with those capabilities, can we have the right environment for them to play with them end to end, once they are ready to go from death.

To production, how do we have this? We call shadow environment to provide production like traffic. So you have the volume production volume and production level diversity for them to try. And then they have this confidence they move to production. But you know, along this journey, there are different branches, right?

For example, at a certain time point, they want to explore data. Is it easy for them to discover the data, the data sets, data table, even the column, right? Yeah, and then data observability also comes into the picture, they run their jobs. Things happen, right? How can they troubleshoot? Or you can build some monitoring for them.

They get alerts. Right away, and then even better, you can tell them, Oh, this is the issue, or I already auto recovered for you. So that's the next level for sure. But it's kind of like along this journey, for a security product team, if they want to build a new feature. From onboarding, all the exploration, experimentation, and then to production.

And our customers can use it. How long does it take? That's our North Star. And our goal is to shorten this time. So everything we are doing is to make this time shorter.

Adel Nehme: What's really wonderful about what you're saying, this resonates a lot with me because, you know we also have our own data platform at DataCamp, right? And one thing that is really magical about, you know, when the data platform works is just how democratized data can be for the wider organization. I'm pretty data fluent, but I wouldn't call myself a data scientist,

but I do know where the data I need is and I have access to it and I, semi production environment, I'm able to experiment with it, maybe one of the key aspects of making data accessible to non technical users. Like, when it comes to data democratization, I'd love to learn here, Shuang, what are the nuances related to data democratization that anyone working on a data platform should be aware of?

Shuang Li: Yeah, I think briefly talked about that. Like, the restructuring of data platform. Now we have a dedicated data platform, developer experience team. That team is dedicated to features like this. And we can't. Data discovery. So, of course, for both technical and non technical users. Take our product analysts and the business analysts, for example, they fall into the non technical user category.

So they got a request from product teams, sometimes from leadership, like, figuring out, the daily active user? Or feature usage, product usage of this newly launched feature or product, so I was talking to our manager of product analytics last year. I asked him, how much time do you guys spend trying to figure this out?

He told me, they would spend four weeks on average to figuring out where the data is. Because it works today as, let's say, our software engineer from that particular team, they launch a new feature, right? They just write everything in the BigQuery table. No comments, no annotations, nobody knows which table it is except this person, himself or herself.

So there's not good documentation. And then for those product analysts, they got a request. They have to search all the confluence pages, maybe the box documents. No luck, right? Most of the case. So they have to like try out different data sets and figure, figure out which one is the right data field they need to explore.

That's a big pain point. And that's why we decided we have to invest in this area for data discovery. So now we're pushing the teams, like the owners of the data, the tables to tag their table. They could add descriptions. They could tag columns. For example, this column is specifically for this new feature.

And it's about usage. Something like that. And then we can have metadata management built. In data platform and we are leveraging actually data catalog today for data discovery for data observability, data lineage, all these features and data classification as well. So I believe data catalog is a very good tool companies who want to really invest or invest more in their data platform developer experience.

Adel Nehme: Shuang, , as we close up, I'd love to kind of discuss as well some of the challenges that you've encountered along the way that you think are really common to building a data platform, what would you say are the top challenges folks have that they may encounter here that are relevant to building a data platform.

Shuang Li: lot of challenges, right? Again, trying to keep things simple. So maybe I can talk about top three. So first, I think I briefly mentioned this one. , when we build this data platform, while doing the cloud migration, almost everybody was new to the cloud. So in the cloud world, it's very different from on prem.

And everybody's , doing this and learning the best practices from the industry at the same time, right? So we made mistakes, but we moved forward. But it's a big challenge we were tackling back then. So that's the first one. Second one. So for this data platform we're building in the broad cloud migration project,

that's actually the biggest project ever of the company. So it was overwhelming for everybody. But we were able to break this overwhelming work into milestones and got alignment across the company, across different stakeholders. So that's the second challenge, because I also share how we tackle these challenges, otherwise we wouldn't get here.

And the third is about the team morale? So for many engineers, they want to build new things. So there's a balance here between new feature development or we call uplift, right? Versus lift and shift. So we need to tell the right story to them, like, Oh, we're doing probably lift shift for some of the components for now, but we'll come back,

that's actually, in our data platform foundation team, that's exactly what we are doing now. We're revisiting those lift and shift we have done when we did the cloud migration. So these are the three probably challenges I want to share.

Adel Nehme: Yeah, and maybe when you mention on cloud migration, right, like I think a big trade off that organizations face here is the trade off between being cost effective while also scaling. How do you approach that as a function as , you're growing the data platform, how do you best approach kind of being cost effective while you're growing the amount of compute that you're using, the amount of resources that you're using?

Shuang Li: It's a, It's a hard problem to solve. Again, it's like a building a digital platform, so for Bob's, it's very like specific. We're a SaaS company, software as a service. So we're talking about this rule of 40. It's a very important metric for SaaS companies like a box for the, the business health.

So, rule of 40 means The revenue growth together with your profit margin should be at least 40 percent of your revenue, and then you're on the right path to a , sustainable growth. So for data platform team, we contribute to the profit margin part. That's the cost,

that's how it's translated to this business metric. So we do quarterly and also , monthly as well. Cost forecast. So in these forecasts, we need to take into account organic growth, the traffic and also the new use cases. We do budget planning based on that,

but of course, you may go over budget, sometimes. And then we need to think about, shall we pay the licensing fee this year? Or we can build a working solution this year and then probably re evaluate next year when we have more budget. Because it's always like a trade off you need to , think about,

and make the right decision. sometimes Paying for a vendor could reduce your overall cost. show you an example, we're uplifting our logging pipeline, because logging pipeline is also under data platform, Adbox. So we could pay a vendor. The vendor could do the log aggregation for us.

So in that way, we can reduce the volume, the ingestion rate, to another vendor, who is our logging vendor, right? So it's like we're paying for the first vendor, but it helps us to reduce the ingestion, and then we for the second vendor. All right. So in that way, we play around, right? We can still, get this vendor in, but at the same time reduce the overall cost for the company and also consider the scale, we are operating at and the growth.

Adel Nehme: Something else, a challenge that you touched upon is the story, that you have to discuss with engineers when it comes to, let's say, maintenance for foundational work versus, innovation work, right? maybe walk us through in a bit more depth, how do you tell that story so that people are excited by foundational work that may not be, you know, the sexiest thing to add on a resume, but equally as important for the company's bottom line?

Shuang Li: So at Box, we have two sets of metrics overall the company. One set of course every team is adopting that is the business metrics and we call it ladder up metrics. Actually our CTO Benkus came up with this one. So let me briefly talk about what this is. There are to make it simple, three levels.

The top level is the company level metric. So we're looking for profitable growth. That's very simple, right? Everything you contribute to that one, you're moving the needle for the company. And the second level is at the product and engineering level. But of course, for our go to market or other orgs, they have their own metrics.

But for product and engineering, we have four metrics we're tracking. I'm not going to share with you all of them. But you know, I can give you a big example and show you how we ladder up in the three levels. So this is the second level. So the third level, or we call the Bottom level most relevant to everybody.

Every engineer, every product manager in this org, right? So it's our own team. So, for example, let's say for data platform, we're introducing streaming capability on data platform. So we made it work, so that's how we're metric and then the engineer working on that. They can understand,

all streaming. But then how does streaming ladder up to the product and engineering? Level metric. We have this box metric called enable new use cases by introducing streaming, you can build near real time anomaly detection or some other use cases and then that enable new use cases at the product and engineering level.

We'll ladder up to the company level for sure. Profitable growth. so for every engineer, we have a metric, right? Kind of their work is a map to, and gradually they can ladder up to the top level. So then in the eyes of engineering, that's how I can convince them what you are doing, it's making an impact,

on the company, we're moving the needle here. And then at the same time, I think I mentioned, Data platform level, we have our own platform metric, time to value. So by introducing this, we're reducing the onboarding time for our internal customer. We're doing a great job. So these are the two sets of metrics we always tell to our team or to the entire company.

Adel Nehme: what do you see next being for the data platform at Box? And maybe walk us through some broader, data engineering trends that you see or, data platform trends that you see happening this year.

Shuang Li: developer experience for sure. We'll keep investing and then we're also building some tooling and the frameworks to make it even easier for teams to, interact with the data platform for insights or for innovations and skill. we have big group of insights. Products overall across box,

but then those teams, not everybody is a big data expert, so they have to do the aggregation. They have to use like stream, compute, batch, compute and then store the data somewhere and then make it available for query, so we build this framework to help them aggregate. Their data and then they can build business logic on top of that.

So that's one example. I want to share like build some frameworks, Maybe at the data platform level, put them in a common place and the other teams can just, plug in and use it very, very easily, and then lock pipeline. I mentioned that. So we're uplifting our log pipeline. For logs, it's not like a, super shiny topic people are talking about, but it's very important for the company, the developer, the troubleshooting. And I think, for some companies, they even draw insights from their logs for their businesses, the compliance team, they use a log pipeline for auditing those use cases. So we're thinking about some tiered log pipeline where you'll need real time logs,

probably goes through one pipeline. It could be more expensive, but that's the price we need to pay. But if you only need to run some analytics, this could be a separate pipeline. But then there's some code storage use case. You just store for compliance, you retain the data for one year, but you don't do a lot of analytics even.

So that's some code storage we can put in there. So that's the simple idea about building the tiered log pipeline for the company. And of course, AI, right? I put that one last because everybody's talking about AI these days. So basically how AI can help data platform users do their jobs more easily.

I mentioned, data observability, right? AI can do that for sure, right? Detect anomaly in your pipeline, if there's a data loss, , late data arrival, AI can help you figure this out. And beyond that, I mentioned for data discovery, we're building data catalog, to do the metadata management, but even better, right?

I think, you know, these days if you go to BigQuery you can just ask natural language kind of question directly toward your dataset, so that's something we can leverage for sure. So these are basically the, trends I have seen and I wanna share with the audience.

Adel Nehme: that is awesome. Now as we wrap up, Shuang, do you have any final closing words to share with the audience?

Shuang Li: I think , data, ML, AI, these areas, they're evolving so fast, so like the knowledge you have built over the years could be outdated, there's new technology coming up. So keep learning and then learn from others learn from your experience. And I think I'm very much looking forward to what's next,

for, for data, for ML and for AI.

Adel Nehme: I couldn't agree more, and that's a great way to end today's podcast. Thank you so much, Shuang, for coming on DataFramed.

Topics

blog

Employee Spotlight: Building and Iterating on DataCamp’s Products

Q&A with Sue Lai, VP, Head of Product and Content at DataCamp

Joyce Chiu

4 min

podcast

Effective Data Engineering with Liya Aizenberg, Director of Data Engineering at Away

Adel and Liya explore the key attributes that forge an effective data engineering team, traits to look for in new hires, aligning data engineering initiatives with business goals, measuring the ROI of data projects, future trends and much more.

podcast

How Data Leaders Can Make Data Governance a Priority with Saurabh Gupta, Chief Strategy & Revenue Officer at The Modern Data Company

Adel and Saurabh explore the importance of data quality and how ‘shifting left’ can improve data quality practices, operationalizing ‘shift left’ strategies through collaboration and data governance, future trends in data quality and governance, and more.

podcast

The Path to Building Data Cultures

In this episode of DataFramed, Adel speaks with Sudaman Thoppan Mohanchandralal, Regional Chief Data, and Analytics Officer at Allianz Benelux on the importance of building data cultures, and his experiences operationalizing data culture transformation programs.

podcast

Optimizing Cloud Data Warehouses with Salim Syed, VP, Head of Engineering at Capital One Software

Salim and Adel explore cloud data management and the evolution of Slingshot into a major multi-tenant SaaS platform, the shift from on-premise to cloud-based data governance, strategies for effective cloud data management and much more.

podcast

Making Data Governance Fun with Tiankai Feng, Data Strategy & Data Governance Lead at ThoughtWorks

Adel and Tiankai explore the importance of data governance in data-driven organizations, how to define success criteria and measure the ROI of governance initiatives, non-invasive and creative approaches to data governance and much more.

See More See More