Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.
So to me, really ownership mentality means somebody who takes, you know, treats this company as their own and who really don't look at it as, you know, as a, let's say a nine to five or nine to six. They look at it as, you know, how can I really make impact to the, to the end user and the end user in this case could be an or it could be a member, like actually a member on Thrive Market, or it could be a stakeholder. So I think that to me is then, and if you have that mentality, then you will not look at solving problems in a silo, you will not look at solving problems in a limited fashion, you will look at it holistically.
Focus on three things to improve your organization’s data engineering—The most important, thing is to understand your stakeholders and spot common patterns. What is their north star? What is they really trying to achieve? So you can look at the problem holistically. The second big thing is self-service. That's the only way to scale up and that will allow you to be in a place where you could be chilling at the beach in Florida and the stakeholders have self-service access to that data. And even if there are issues that are raised, issues that you've got through your alerts and monitoring, they can be self-corrected by the engine itself. Now we have some ways to get there, but that's the goal, right? Have a very clean, easy and accessible way for stakeholders to access their data and trust their data. The third advice I would say is, embrace managed services. You have tools out there, and we're in a very fortunate time right now, where you don't have to reinvent the wheel, you don't have to start from scratch. Your goal should be, how can I add maximum business value and how do I get myself out of this business of maintaining the infrastructure?
Leverage AI for problem-solving: Utilize AI to proactively identify patterns and derive insights from common errors. Explore smarter retries and automated solutions to fix issues in data pipelines.
Drive self-service: Enable self-service capabilities for stakeholders to access and trust their data. Implement tools like dbt and common utilities to provide abstraction and documentation for data pipelines.
Collaborate between data engineers and analysts: Foster collaboration and encourage both groups to speak the same language. This promotes smart problem-solving and holistic understanding of data challenges.
Richie Cotton: Welcome to DataFramed. This is Richie. While data science and AI are perhaps the most photogenic data roles, because you get to solve business problems and build fun things like chatbots, actually none of your corporate data ambitions will come to fruition unless your employees get the right data, in the right form, at the right time.
Making sure that happens is the job of the data engineering team. So today we're going to have a deep dive into data engineering. As with a lot of the topics we discuss on this show, there are many devils in the details. I want to get into how data engineering teams work. And today's guest joins us from Thrive Market.
So I also want to learn about some of the retail use cases and challenges for data engineering. Mo Sabah is Thrive's Senior Vice President of Engineering and Data. He's got a particularly fascinating role because as well as being in charge of data engineering, he also runs the data analytics and data science teams.
So I really want to pick his brains on how these teams interact. Data engineering is. One of those things where it's easy to do simple things on small data sets, but the real magic happens when you try to scale things up. So I also want to quiz Mo on how he scaled thrives data systems.
Hi Mo, welcome to the show.
Mo Sabah: Thank you, Richie.
Richie Cotton: So, I'd like to get an overview of how you make use of data at Thrive Market.
Mo Sabah: Absolutely. So I've been here at... See more
And the reason I ended up there was I actually built my own startup in the space of personalization. So we were solving the core start problem for people who are new to a platform. it would actually learn from data that we have about you and recommend you and personalize your experience right off the bat.
That company got acquired by Honest Company, so I ended up there. And prior to that, I was in the Bay Area. So I've been living in LA for the last, close to eight and a half, nine years. Before that, I was in the Bay Area. I was running engineering and data teams at Workday. Prior to that at Facebook and I was a data scientist and engineer at Netflix and Yahoo.
So I spent a lot of time in this domain of direct to consumer and that's a space I really enjoy where you can actually learn from the actual interactions of the customers. and learn from the customers and the members directly.
Richie Cotton: That's cool. And are there any particular goals that Thrive has around using data?
Mo Sabah: So Thrive is really a health first membership company for conscious living. And our mission is really to make healthy living easy, affordable, and, accessible to everyone.
We operate on a very environmentally sustainable operating model and also hypercurative catalog. So think of it as a mix of. Amazon and Whole Foods, And so you get on this membership similar to what you would do on an Amazon, You pay a yearly fee to access our hyper curated catalog and also very sustainable.
Now, data plays a very critical role in all of this, As you can imagine, right? Data forms the very bedrock of everything we do. And if I were to break it down we use data for making all strategic decision making. and we're going to dive into it a bit more in the, latter part, but I'll just say, if I look at it at a top level, we use data to make decisions, which means we have very strong, bar on the data availability, data reliability and quality.
The second area would be, we use data to personalize the user experience, which actually makes the user experience and the member experience so much better. We also use it, and the third area would be, we use it for a bunch of optimizations, right? We would optimize our campaigns, we would optimize our spend, based on what we are learning from the way Every dollar is spent and what it does for us.
As I said, data forms the very foundation of everything at Thrive.
Richie Cotton: That's interesting that you're using data throughout the whole business. I'm curious as to how the data engineering team fits into these data goals across Thrive.
Mo Sabah: A few things about the way we function at Thrive. We function in the form of squads, which are these cross functional groups. And that really encourages better collaboration, better data standardization, and also better reusability. Across the board. So the data engineers are embedded within these courts, so they also learn about the business domain and they can make smarter decision making in terms of.
what is the best way to surface that data and that also puts them in a very good position in terms of learning about how the data is used by the rest of the company, So that's the first aspect I'll give you is, is how we operate on a day to day basis in the form of squads and these squads could have.
Folks from, let's say from finance, for a given squad, or they could have folks from the front end team, the back end team could be in operations and so on. The second thing I'll say on this is we have organized the data engineering and data analytics into one organization. And the reason for that is very simple.
Data analysts form The majority of the user base of the data that is generated by the data engineers, right? So data engineers are really responsible for the data warehousing, for the ETL, data reliability, data availability, data quality checks, and all that stuff. But really, data analysts are the ones who are using that data in a big way.
And of course, the rest of the company too, like there's data scientists, there are applied ML engineers, and so on. But the way we have positioned it is We have built a team around data analysts, data engineers working together. So when there is a, and it solves multiple problems for us.
One is the data engineering team is not siloed into its own, which I've seen in the past in some other companies That silo something that hurts you down the road because the data engineers could be building solutions that may not be pragmatic, may not be reusable, and may be very specific to a certain domain.
Having them embedded inside Data Analyst, A, encourages a good discussion. between the analyst and the data engineer so they can look at the problem holistically and solve it holistically. The second big area is what I highlighted earlier on reusability. We don't want to be reinventing the wheel. So if the user access patterns exist, One of the things that I would like the data engineers to look at is as they are talking to these analysts understand what are the common themes that are emerging, What are the common threads? What are the common sources of data that are accessed together? And we can use that to form these core data sets, which can, which are wide and can be accessible, and can serve multiple use cases. So that at a high level, is way we operate in data engineering, and we have done it very successfully, which has really, and we've been able to reduce tech debt, we've been able to move fast, and also build data sets that can be reused across domains.
Richie Cotton: That's really interesting. Think about the data analyst being almost a customer for the data engineers, but then you bring them closer together so they can work more efficiently. I really like this idea you mentioned of reusing like patterns of reusing data sets maybe from one project to the next.
Can you give me an example of how this reuse works? Like, what are these like I think these user access patterns can you give an example of one of those?
Mo Sabah: Yeah, absolutely. And before I begin this one, one thing to summarize what I talked about, which is going to be a premise for this answer is, I see stakeholders and data analysts as partners with the data engineers. and data engineers are not ticket takers. I think that changes the perspective.
So data engineers are actually problem solvers and they work, sit at the table with the analysts and with the stakeholders. So having said that, there's a few examples that come to my mind. I'll give you one example where if you look at one of the bigger projects that we undertook over the last year was migrating a data warehouse and our visualization engine.
So we were on Redshift and Tableau as a visualization engine. Redshift obviously was our data warehouse. We migrated from there to Snowflake as a data warehouse and Domo as our visualization engine. Now, It was not a straight up migration. What we ended up doing is we had, let's say, around hundreds, 400 plus data sources.
that we used in the old world, we were able to condense it to one tenth of that, And that is the reusability that I would like to encourage. Now, that is still higher than where I would expect it to be, right? So the next phase would be how do we go one tenth, not one tenth, maybe one fourth of that, right, number.
And that is the reusability that I'm talking about. What that does is you actually have a single unified definition of metrics, The core data sets stay the same. Now. The stakeholder, whether it is somebody from finance or somebody from ops or somebody from marketing can still access the base code data set, which is certified, and they can still slice and dice it in the way they want, but the underlying definitions, underlying tables and the underlying data will not change, right?
And that to me is the most important when coming out of this reusability or this notion of having partners, data engineers as partners. Because that, in the absence of that, what we had in the past, and I've seen it in other places as well was marketing would have a different definition of a metric.
Finance would have a different definition, and we couldn't agree on the numbers, right? The numbers said different story. Now, with a reusability pattern, as a, let's call it, reusability access pattern just for this discussion. That allows you to have a single definition.
And that also, an offshoot of that is, we have also built good documentation in the form of data dictionaries, in the form of to improve self service access. That's an example that comes to my mind immediately, where reusability is a big win, not just for the business, but for the analysts and for the data engineers as well.
Richie Cotton: That does sound tremendously important. I think that's something that often isn't obvious from a business point of view, that when analysts are reinventing metrics for every project in a slightly different way, that needs to lead to like chaos and, that you're never quite sure what the right number is.
So yeah, that reusability does sound tremendously important. You mentioned that you just going through. Migration from, I think you said, Redshift and Tableau to, Snowflake and Domo. I know that, migrating these technology stacks is a complete nightmare for everyone involved, so I'm just curious as to your motivation about what, made you decide to switch.
Mo Sabah: It is a nightmare, I would say, if you don't plan it well. it is going to be challenging. The big motivation for us was, I would say there's a few of them. The first one I already highlighted, which is we had too many data sources, too many reports in the old world. Now, we wanted to get to a world which is a lot cleaner, a lot more usable, a lot more flexible.
Because you can have wide data sets and smaller number of data sets. So removing tech debt, I would just classify this as tech debt was one of the biggest motivators. The second was really you don't want to be in the business of maintaining or upgrading or keeping infrastructure up to date. You want to be in the business of running the business, Which is, I want to be able to get my stakeholders the data that they need for strategic decision making and be a partner to them and move much quicker than having to maintain and upgrade these servers, right? That was the second major reason and that is one of the themes that I've seen.
Emerge in the industry over the last, I would say, 5 to 8 years, and it's just growing, which is a great thing, which is the use of managed services when you have AWS, you don't want to be or an Azure and so on, or GCP for Google, you don't really want to be in the business of maintaining your own service.
Similarly, if you have a cloud based service, why would you maintain your own infrastructure? So Domo was an example of that. We looked at other vendors as well. They're also cloud based. Domo kind of checked all the boxes for us. Now to answer the second part of your question, which is, yes, it was challenging.
But the way we did it is we looked at it from all angles. We knew where the North Star was. And decided that Even though the North Star has come to the, a clean state where the data sets are really small and the unique data sets and we have reusability, we knew that we couldn't achieve all the way there.
Because at the same time, you know, we are in the middle of this migration. So if you change too many parameters, if you change too many things in this migration, you won't know what broke or what caused something to break. So we were very methodical and we built out a project plan and we were very clear in what are we trying to do here, And the other thing that I would say going back to the motivations was performance, Snowflake is very extensible. It's very flexible. It auto scales and you can add capacity as and when needed as opposed to the old world of Redshift, And we had different kinds of stakeholders and we were able to build different, call it virtual data warehouses, and we were able to build different data mods for different stakeholders.
whether it is operations, whether it is finance, whether it is marketing. And this is another big motivation for us. So the way we executed on it was I would say the planning was a big part of it. The second piece is we decided not to migrate them. One after the other, because what would have happened is if we migrate, just let's say for the sake of argument, if you migrate from Tableau to Domo and leave Redshift alone, then you are building a bridge from, instead of going from Redshift to Tableau, now it is going from Redshift to Domo.
which is throwaway. Now we have to build another bridge from, Snowflake to Domo. So we said, let's do it at the same time. it was a bold decision, but also we did a good amount of planning around it and we phased it out. So we actually out of it, we are out of Redshift, Tableau, we only have Snowflake and Domo.
And so now phase two is beginning, which is now that we have taken care of, I would say, 80 percent 20%. Is what we are attacking now in the form of different projects.
Richie Cotton: That's absolutely fascinating. I'm sure that the planning must have been incredibly extensive but I found it interesting that you wanted to shift everything at once, which is It must go against the sort of agile instincts that I think most data engineers have because you do need to as you say, you would be sort of creating redundant connections otherwise.
I'd like to talk a bit about who's involved in your data engineering team. What sort of roles do you have within your team?
Mo Sabah: First up, actually clubbed the data engineering team and the data analyst team together, So there are three separate roles, and I would explain them within this org. they are very thin lines between them. So they are not hard and fast, right? And I talked about why we decided to merge those two teams, to encourage really good collaboration.
So there are three kinds of groups, three kinds of roles in this group. One is data engineers, and inside data engineers, there are two kinds of Roles, One is more of the ETL engineers who are responsible for cleaning and, building those pipelines, whether it is airflow, whether it is something else, again, they are not clear distinctions, right?
There's a overlap in them is people who are seasoned and good in dimensional modeling. and you can see the forest for the trees, that's the way I would put it, Instead of building a solution that works for a given set of use cases, they would look at it holistically, and see what are the common access patterns, And what are things that go together, what are things that we can put into these core data sets. So that's the data engineering team. So I have, right, data engineers, some are more modeling focused. Or dimensional modeling focused, and some are more ETL focused, But they are still data engineers.
And then we have data analysts, and inside data analysts there are two subgroups. One is focused on what we right? They are focused on the core. Ecom experience, They would work in this squad model and they would look at what is happening at the top of the funnel. How are members getting in, like the funnel analysis and so on.
And also for repeat orders, people who are placing orders, the core Ecom experience card checkout. Then there is the other set, which is would call it business data analysts, right? Who are now embedded inside these orgs, Let's say operations or finance, and they act as a partner to these orgs.
for them, it is like an extension to their team and for us, they are like extension of our team. So it works very well because of this embedded approach that we have. so as you can imagine, right, there is big overlap between these three roles, And so the role that the title that has been floating around a bit is analyst engineer.
We don't have anybody with that title, But on a day to day basis, I would say most of them are doing similar work because of the way we've set it up. and because of this strong collaboration, we've been able to build not just execute quicker, but also build with quality.
Now, I'll give you an example, right, in that project that I just talked about of the migration, why we'll be able to achieve both data warehouse and the visualization engine, swapping both at the same time. And we got efficiency out of it. The reason for that was the strong connection and the strong collaboration between the data analysts and the data engineers, right?
So we were building validation layers or validation suites at different steps of the process, starting with the data lake at the very bottom to the data marts, of course data warehouse and the data marts. So between all these three, we were able to rectify and we were able to, Look them side by side.
What happens in the old world? What happens in the new world? And if the definitions have changed, can we explain it? Or the, and we'll be able to bring the error rate or the divergence to be like less than 0. 5 percent or less than 0. 1 percent in many cases. And that is an example of a strong collaborative team, So I'm of the strong opinion that you have to get the analysts and the engineers working together because that is where you achieve great things. In terms of. Reusability in terms of a single unified definition and in terms of building things that are very performant at the bottom, but also very flexible at the top.
And what I mean by that is the visualization engine should be simple. You don't want to have. 10 page SQL queries, because that is hard to debug, like for you and me, it's even hard to understand what is going on. Whereas if you break it up, and this is the, this is happening because the strong collaboration between data engineers and data analysts, we've been able to build a lot of heavy domain knowledge and business logic at the bottom layer.
the data lake and the data warehouse and the data marts. So the visualization engine is very simple. You just do a select star or select star with some where clauses. That's it. and all the things that are below it are the ones that are highly validated and clean and certified.
Richie Cotton: That's absolutely fascinating. And I do find interesting where you've got this interaction between the data engineers and data analysts. So I'm curious as to do you find that they can speak the same language about these data problems? Or are there any cases where you find they have different opinions or the terminology is different and they get confused?
how do you find they interact with each other?
Mo Sabah: The skill set, if you look at the skill set of the two groups that I mentioned, they are slightly different. And that is there for a reason. As far as the language is concerned, They are beginning to speak the same language because they have been collaborating so well together, Now, at the end of the day, what this really encourages, I would say, is smart problem solving.
At the end of the day, what are we trying to achieve from this? We want the data to be reliable. We want the data to be available. To have quality and trust in the data and ideally we want the data to be self service, right? I really don't want the data team to become the bottleneck So what this encourages is the data engineering team because of the strong collaboration now really understands how is the analyst thinking about it from the business standpoint and the analyst understands What are the challenges that the data engineers need?
So because of this to answer your question more directly. Yes, they are speaking a language that is a mix of the two, which is great because then both sides see things from the other's perspective, and that really allows us to build solutions that are more testable and that are, more reliable and really And to add another point on this, right, in a standard e commerce, right, we are an e commerce company, so you shouldn't have to reinvent the wheel every year. You shouldn't have to create new metrics on the fly, The number of metrics that we, the core metrics that we use should be, like, it's well understood.
Now, a question that comes up is how are the stakeholders and analysts are really representing the stakeholders going to be looking at it? How are they slicing and dicing it? That discussion wouldn't happen organically, but because of the setup that we have, it happens organically, where we solve problems together.
there's an architecture design, both of them reviewed, from different perspectives. And so in that way, we have been able to solve I would say we've been able to move faster and also solve problems, which are I mean, we have been able to solve problems before they appear as problems, right?
Because we know we can actually see, again, going back to that analogy I was giving earlier, one of the big things in data engineering is you should be able to see the forest for the trees. You shouldn't just focus on the problem at hand, but look at the common access patterns, And this setup really allows us to do that.
And of course, in terms of lingo, yeah, they're speaking very similar. is where that analyst engineer is a very interesting thing, a phenomenon that I see in the industry. It is happening.
Richie Cotton: Do you want to talk a bit more about that like, where the analyst engineer would sort of fit in within your team that sort of if they exist or within a sort of broader data engineering scope within an organization?
Mo Sabah: Yep. So I have a few analysts in my team who have expressed interest in transitioning to be an analyst engineer. And I love that. That's a great outcome. So I definitely see that as a good transition point. And the way I would position it is somebody who is still an analyst, their skill set is still an analyst.
I mean, as an analyst, you are strong in statistics, you are strong in understanding the business stakeholders, great communication ability. And at the same time, you want to be more into the nitty gritties of how the, let's say, dimensional model is built, or how the data sources are organized, of the ETL.
I see it as a transition. I don't see it as a final state, Similarly, on the data engineering side, if you want to be, this setup really allows you to see what is the end goal of my analysis, or what is the end goal of this data set that I've created, And there can be a case that I can see where A data engineer would want to do more analysis and on top of maybe 80 percent data engineering and 20 percent analysis.
And as I said, it's a, it's a slider. it's not really a thick line between the two. nobody has reached out yet, but I can see that happening too on the data engineering side. But data analysts for sure, I've seen a couple of them have expressed interest in moving to an analyst engineer.
Richie Cotton: Excellent. And so I'd like to talk a bit about what's going on in retail. And I'm curious as to whether retail data is different. If there are any quirks of the industry that have an impact on what you're doing in data, both in data engineering and in data analytics.
Mo Sabah: Now, having spent time in tech companies, which are also direct to consumer, like Netflix and Facebook, for example, the data needs and the data sets are pretty similar. The problems that you're trying to solve are pretty similar. The problems are around scale, around availability, around data quality, and so on.
In the retail space, especially in the space that we are in, in e commerce, Amazon is in the same space, for example. Instacart and so on. You would see that the problems are A similar to what you would do in a, let's say, an Instagram or a Facebook, but the data, but the actual metrics are different.
The stakeholders are different, To give you an example, the stakeholders that you would work with in retail would be merchandising, for example, in our case, As well, which is people who actually talk to the vendors and bring those products onto our site. You would have the marketing team, which is responsible for spreading the knowledge about Thrive Market and, the two kinds.
One is acquisition, acquiring new customers, and one is retention. Retention campaigns or outreach, or CRM. ANd then you would have operations, which is again, not something that you would see in a tech company, but in retail, you would see it where because your life doesn't end when somebody does some interaction on the website.
You also have to ship that order physical order to them, Similar to an Amazon. like merchandising, for example, and operations, these are different for retail. and of course there are other, like finance is finance, finance is everywhere, right? So then there's member services as well, like people who pick up the phone or chat.
There are things that we can do there as well. Now, having looked at it from that angle, if you look at the actual metrics and that data sets and the What makes retail different? The first thing I would say is it's the metrics and the fact that you are looking at it holistically. As I said, it is not just app that you play with.
It's not a app like an Instagram where you upload photos and you forget it. Here you're making a full transaction. So the most important thing is what happens from the time you onboard to me providing you an experience where you can browse the site, you can search for things, we can recommend things to you, we can personalize it right off the bat, and we can do that much better than if you're walking into a brick and mortar.
because we have so much data about what you did and we can use that really, amazing experience for you. And then cart and checkout, You have to be able to put things in your cart. You have to be able to pay through your credit card or Apple Pay or whatever. And then those products have to be shipped to you.
the entire cycle is different in retail. And you have to have metrics that attack every single part of it, Order to cash is obviously one of the most important problems and order to cash is not just you have to be able to account for every single penny, but also the tricky part here is.
The metrics again, I was going back to some of the metrics, the metrics could be based on shipment date or order day, which are different, I could place an order now on the site or on the mobile app, but the order may actually be shipped maybe let's say in the night and the order may arrive maybe in a day or two days, right?
So depending on what you're looking at each vertical on each domain will have its own set of metrics. So merchandising is more concerned about providing visibility to the vendor experience for all the vendors, like how is their performance, Because they're paying us for these products and so on.
Marketing is concerned with every dollar that they're spending, whether it is acquisition or whether it is CRM. can we track it back to how the customers are reacting. Similarly, you know, operations will be focused on The time that the order was placed to when the order arrived at your home and all the steps that happen in the middle, So they, you have different state transition diagrams, For different teams, and that is different, and that is so exciting about retail and e commerce. And I'll just say I talked about this example, brick and mortar. One of the big things with e commerce is because you have access to what the users are doing, you can actually customize and personalize their experience right off the bat and also in every single session.
And that is something that, that to me is very exciting. Retail and especially e commerce has a nice mix of an app kind of set up where you, let's say an Instagram or a Facebook or a Twitter, and then you have the brick and mortar experience of shipping things and so you can actually cover the entire gamut of the experience.
Richie Cotton: So it sounds like there's quite a complex setup because there are so many different stages to an order and you've got different places you can order and things like that. How do you manage keeping everything in sync, having all that different complexity in a form that is sort of consistent and understandable?
Mo Sabah: this is where the data engineering plays a huge role, Because, This data is collected at different places, You can have data coming from the website, around user interactions. You could have data collected from the shipping, How are the shipping vendors, for example, All this has to be ETL, all this has to first land in a, in a data lake, let's call it. in its raw form, and then we have to transform them, Do multiple transforms to make it available in a data warehouse from there, we have to be able to now surface reports, Let's say in Domo, in the form of dashboards, and then, and then, of course, a flip side of it is you also want to see alerts and monitoring, and smarter alerts, ideally. One of the things that we have been able to do is, if you look at the, Then that pipeline, we have been able to build validation layers at each step, Validation layers at the data lake, at data warehouse, and data mark, And then we've also set up alerts, And these alerts are based on some checks, data quality checks that we have put in place, If let's say a column is empty, a column that is required, or we see a deviation which is more than two and a half standard deviations from the last two weeks or last month, We can capture some of those vagaries and erase it and flag it, And the earlier we flag it, the better it is, You don't want to capture things at the very last day. You want to capture it at the data lake layer or the data warehouse or the data mark, this actually allows us to, be very thorough in the way that the data is landing and also the way the data is used.
Richie Cotton: I like the idea of capturing the problems early. The worst case is like it gets caught by the customer after, after you've made a mistake. So yeah, I think solving the problems as early as possible is, seems like a good idea. And you mentioned that you have these tools for just validating the quality of the data at different stages.
I'm wondering have you made use of artificial intelligence tools at all in order to help you with this?
Mo Sabah: we haven't used AI per se in this process yet, but we are getting there. So we, we actually have plans in our roadmap to start tackling them and start using them. So some of the toolings, some of the tools that we have used are of course, in Snowflake, you can set up all kinds of alerts, in Domo, you can set up all kinds of alerts, so we have Python utilities, and we use Airflow for scheduling a bunch of these jobs.
So we do set up alerts, which then, automatically, they go to OpsGD, which is similar to PagerDuty, right, a vendor that we use, and also Landsup is an email or a Slack, in one of the Slack groups. and these slack groups are also open not to data engineers alone, but also to stakeholders, when the issue is actually downstream.
As a result of these tools, we are getting smarter now. In the future, as I said, in the second part of the year, we are. going to be looking at some data, automated data quality checks. we are looking at vendors. we use AI in a big way on data personalization, on a different layer, like to optimize experience and so on.
On the data engineering side, we haven't used it yet, but that is definitely the direction we are going in.
Richie Cotton: And you mentioned you've been using like PagerDuty and some of the tools. Can you tell me how you've been using these tools how they fit into a bigger sort of data governance program?
Mo Sabah: Going, let's start with the data checks, As an example, if the, let's say, For a given day, you see that the marketing spend went up to 100, 000 in a certain single campaign. And the limit is 1, 000, making it as an example. Then you know that something went wrong. So it automatically triggers an alert in Obscene, one of the pipeline jobs will trigger an alert in Obscene, which will then be acknowledged and, handled by somebody.
Now, that is part of our Obscene and, Airflow. Is and slack, of course, is part of a stack. We also have native alerts and snowflake and doom that also allows you to capture some of these depending on the different layers, We also have integrated these things as part of CI CD, We use concourse as a CI CD platform. So as these jobs are. Even at the time of deployment, we've done some checks, And if I look at the rest of the stack, if you look at what this setup allows you to do is, it allows you to be, to not just capture things before the customers have got it.
but also be more agile. and also look at, look for problems. So going back to that question on, how we would want to use AI, One of the things that I would like to see is if you look at the common errors that we have seen over the last, let's say, 30 days, 60 days, what kind of patterns can we derive from them, What are things that we can proactively use? Right now, let's say we use a two and a half standard deviation. should we get smarter? So those are things that will definitely come into play. And the other theme that I'm looking at are things like, know, more abstraction for the data pipelines.
So there are tools like dbt and others that are floating around that are becoming pretty standard. We will be looking at that. And we're also looking at tools like soda, for example, which will allow you to work with the stakeholders when setting up alerts, And again, going back to my model that I talked about of data engineers being a partner to the stakeholders, using tools like those will allow the stakeholders to also play a role in data quality.
And so going back to your question on data governance and data quality, that is definitely one of the big, big foundations for strong data engineering. And we are definitely focusing on them. And one of the things that they do for you, or the goal, I would say the North Star for me is as the needs of the, or the data needs of the organization increases.
we are using the data everywhere in thrive. But as the analytical needs of the business increases, you don't really want to be growing the data engineering data analyst. What you want to be doing is building tools and governance and self service tools that can then be exposed to these stakeholders, the end users.
So they can be assured that the queries that they're writing or the Data that they're pulling or the metrics that they are, slicing and dicing is actually accurate. And so that is where data governance will play a huge role. So I would say data governance, yes, it's important, but end goal and the north star really is driving self service access and really scaling up the teams.
at that point, if you don't want to be growing the team. With the size of the business, you really want to be growing the tool set and the data dictionaries and improving data governance and certification.
Richie Cotton: Ah, that's interesting. So you're suggesting that the way to scale your data engineering is really just about making self service access as great as possible. Do you have any other tips for this on how you can improve self service access?
Mo Sabah: So self service comes with, I would say, a few things that are, pretty much the prerequisites for self service. The first one is data has to be available and reliable and something that you trust. And that is where data quality is like a given. If you don't have data quality, You cannot go to self service.
The second thing would be data dictionaries and good documentation on and thinking as a stakeholder, which is where working in a squad model helps you. So you want to be thinking of, okay, if I'm in marketing and I'm looking to look at sales in a given, cohort, how will I do that using the existing tables, right?
And really, Increasing our, democratization of the data as well as knowledge of the schemas that are used and the tables that are used. So that to me is the second layer. So the data quality is the first layer. The second layer is the data, dictionaries or data documentation.
The third layer would be... what we just talked about, data governance, Now, as a stakeholder, I could be building some metrics, but I want to be assured that what I'm doing does not overwrite the main definitions, or I'm not doing anything that can actually compromise the quality of that metric, And so that is where governance comes in, And the fourth layer, I would say after that would be As you do this, you should also look at cost optimization, like, if you have unlimited resources, yes, you wouldn't worry about cost. But in this case, efficiency and optimization plays a huge role.
So as a data engineer, coming back from the data engineering lens, one of the most important things that I would say would, I would put it even at lower than these three ladders or the three steps of the ladder that I talked about is efficiency. or optimization as a goal, if, let's say as an example, if you are taking, if you're running a daily pipeline to generate the revenue for the last day and it takes you six hours, what can I do to bring it down to maybe 20 minutes?
Should I run it more frequently, That's the perspective that data engineer needs to bring to the table as well. Because it's not just cost as a dollar amount, but also cost to the business. The sooner the data is available, The better it is, And also, the more reliable the data is. Those are the four angles or four pieces of advice that I would give you if you were to really scale up the data engineering world.
Richie Cotton: That's interesting that you said going from having data that's sort of 60 minutes out of date to 20 minutes out of date can bring you a benefit. And it sounds like is the end goal there, you're hoping for sort of real time data everywhere or how close do you need to get to that?
Mo Sabah: Yes, that's a great question. It depends on the use case, We do have some real time data sets that we use today. And in some cases, we are fine with an hour of delay. We are fine with 15 minutes. In some cases, we are fine with even a day. It really depends on the dataset and the use cases.
I would echo the same point, right? The, you get it closer to real time, it is not just efficient for the stakeholder to make business decisions, in most cases, But also, it's more efficient for the data engineering team and the stack because you may have to just deal with the deltas, incremental loads.
And if you lower the frequency, or if you, sorry, raise the frequency and lower the interval, you're dealing with much smaller data sets, which you can actually be more performant in there. And so I think going back to that other theme, efficiency is one of the main lens that a data engineer should have in their repertory.
Like, it's one of the main things that they should be focused on. How do I get efficient? And efficiency is, as I said, different, can have different pillars. One of them is cost. Like the dollar cost and there are others as well, which is the time to make this data available and even the quality of the data is something that I would put as efficiency, because you don't want the team to be fighting fires, and this is where. AI will also help, right? And that's my goal in the near future. I would like to see how can I, once we know that there is a problem, instead of having a human come in and rerun the pipeline, we do have retries and all that, but sometimes retries is not going to solve it.
and this is where AI comes in, how can I have smarter retries, like fix the problem and then rerun, which normally requires a human. And this, this is something that is not a dream, like it's, it's available, right? It's something that we can do. And that is one of the things that we'll be going in, is how do we self correct?
How do we have the system self correct and rerun? So there is no human involved.
Richie Cotton: That does seem like the dream is okay, something's gone wrong with my data engineering. Let's have the system fix itself and try again and hopefully it gets it right this time. So you mentioned this idea of sort of, efficiency and automation a few times. I'm wondering if there are any areas of data that are easier to make more efficient or easier to automate.
Mo Sabah: I think everything is easy to automate. And, I wouldn't draw a strong, like, it really depends on it depends on the use case, I would say. I would start with the end goal in mind, If you're automating it, what does it solve? What is the, and again, this is where you have to have a holistic Point of view.
And this is where having the analyst and the stakeholders working together will give you that perspective, right? I can give you an example. Like on the off side, we have we built operation side. We built some dashboards, right? But we don't just go ahead and build them. We work with them to understand what are the latency requirements?
What are the different views that that will be needed? And what are the different Ways you will slice and dice it, So that to me is, really the crux of it. Is understanding and doing it holistically. So you don't have to build something that only works in a given use case. And so that is where I would say, when you talk about automation of the data set, it really boils down to the use cases that are driven from that data set.
Any data set can be automated. And I'm actually a, there have been cases, for example, in marketing where we had manual uploads, where we had manually in the past somebody entering the the spend for the last day on marketing. Again, as opposed to that, now we are pulling it through APIs, And all the, let's say, I'm using this as an example, but you can do the same thing for In any case, whether it is merchandising, whether it is marketing, or whether it is some other, there are APIs available. So when you talk about automation, the first thing you should do is see if there is an API available, so there is no human in the loop.
so there are no choices, you can always have fat fingering and things like that, If you have an API in the loop, then that takes care of that problem. And then you have these... Most importantly, the data ingestion is the first step. The second step is data quality.
Now you should have checks on, is my raw data accurate? And as I said, the more downstream you cache the problem, the better it is for everyone involved. If let's say that checks off, then you can look at the next layer. Now, how am I using this data set or this data? And what are those derived tables? Have checks on that level as well.
And so that to me is the way to... Again, think holistically from the use cases, but also drill down into each of the layers and automate all the way.
Richie Cotton: So it's just layer upon layer of checks all the way through just to make sure you've got data quality every stage. Okay. Since there are a lot of people who are curious about getting jobs as data engineers, like to know what sort of skills you look for when you are hiring.
Mo Sabah: Really, if I'm hiring for data engineers, there are two kinds of skills that I look for, right? Or maybe three kinds, I would say. One are hard skills, one are soft skills, and one is how seasoned, how experienced you are. Now, Let's start with the hard skills, right? If you look at data engineering, I talked about some of the stacks.
A high level language like Python is very popular. SQL and Python are going to be your bread and butter. You have to do that every day. Also, knowledge of dimensional modeling and data modeling helps, Because you can now see the forest and look at the common access patterns. And experience the data warehouse.
That's the hard skills part. Soft skills, I need a very strong ownership mentality, Somebody who really thinks of it as their own business. things of the workplace as their own business and see how can I think like a stakeholder, How can I look at it, the goals behind this data pool or this data set or this report as opposed to just doing what was asked.
And that is very communication. Two way communication is very important and strong communication ability and strong collaboration ability is important. The third area would be, strong seasoning in one of those, let's say, tech companies or e commerce companies and stuff like that. when I was in Netflix, we were not looking for people fresh from school. We are doing the same thing here. We really need people who have done a stint in some other companies, right? Five years, eight years, ten years, depending on the level of experience. Because we are a very fast paced startup, Going back to Thrive Market. And we really want ramping up is going to be quick for us. Now, we have a good process in place. But we really want people who are experienced and have done that before. that's my advice to people who want to get into data engineering.
The third section around seasoning is really, what I'm talking from my, putting my hiring manager hat on in Thrive Market. So those are three things I look for. But for people getting into data engineering, hard skills and the soft skills are things that I would... I would recommend you, you need both of them.
Richie Cotton: Yeah, and so I mean that to sound like the seasoning thing is going to be the hard part for people who want to break into the industry because, you can't have experience until you've got experience. It's always a bit of a catch 22 situation. But you mentioned ownership as being one of the important sort of soft skills.
Maybe can you give me an example of how ownership works within your data engineering team?
Mo Sabah: Going back to the example, let's say a metric is broken, let's say a metric is wrong, revenue. I would like to understand what are things that I can do so it doesn't break again, And looking at it, the ownership mentality comes in, in that you think of it as...
iT is, it is important for you as a data engineer to understand the business value of that metric and the business value of the pipeline being up to date and the data being up to date and reliable. To me, really, ownership mentality means somebody who treats this company as their own and really don't look at it As a, let's say a nine to five or nine to six, they look at it as, how can I really make impact to the end user and the end user in this case could be an analyst or it could be a member, like actually a member on Thrive Market, or it could be a stakeholder.
So I think that to me is then and have that mentality, then you will not look at solving problems in a silo, you will not look at solving problems. In a limited fashion, you will look at it holistically, And again, I've used this term a few times, common access patterns, or the common scenarios or the common problems that I've seen.
like, what are other things that I've seen? I'm not, okay, I'm looking at this particular problem, but what are other things that I've seen? in the recent past, that could be related to this. So can I look at it holistically and solve all of them together or a big chunk of them together?
That to me is ownership mentality. It's going above and beyond and really thinking and asking a lot of questions. you may not know all the answers, which is fine. But having that mentality will allow you to, really question, like, even for a metric that you're developing, does that exist somewhere else?
What is the end goal? that the stakeholder has, What are they trying to achieve? That to me is ownership mentality, going above and beyond in terms of The scope, the requirements, and the solution.
Richie Cotton: nice. So it's really about being able to sort of, think about problems even before you've been asked to try and solve them. Just think about what's going on in data engineering right now. Is there anything that you're particularly excited about?
Mo Sabah: Yes, I'm very excited that we have finished this migration. Uh, Of course a challenging project, but we did a good job, an excellent job at that. Now, the next thing, again, remember my Northstar is always self service access, because that is the only way you can scale up. Now, in there, we talked about those three pillars, right?
Data governance, data quality, And data documentation. We are making good progress on all of them. And to me, if I were to pick one, I would pick that self service as one of the big things on my mind and the big things on my data team's mind. If I were to pick a second one, it would have to be consolidation or, really looking at Consolidation of the data sets and remember we talked about going down to 1 10th of the number of data sets in this migration.
Now, how do I go from there to 1 4th of that And really have like data sets that can have better self service and that is where data dictionaries is also going to be a big chunk of our work. Very excited about that and I'm working, and the team is working closely with the stakeholders and, the feedback has been very positive.
Uh, and depending on the level of familiarity of the stakeholders. Some people may want to do self-service at the Domo layer at the visualization layer. Some may wanna do it one layer lower or familiar with SQLs they wanna look at at the snowflake layer, right? So that to me is very exciting because A, there is certification, there's data governance in place B, there is data quality, and C, there is data dictionaries and tooling in place.
So once you have these three, self-service really becomes more than a dream. It becomes a reality.
Richie Cotton: That sounds great. And I do like the idea that you need to provide access at a cell service level for people don't want to write any code, just want to point and click or just stare at a dashboard. And you also have that other layer where people might want to drill in and write their own queries and explore the data themselves.
Alright, so you have any final advice for organizations wanting to improve their data engineering?
Mo Sabah: So if I were to. Summarize it into like three points. I would say the first important, most important thing is understand your stakeholders. And when I say understand, see the common patterns, see the. What is their North Star? What are they really trying to achieve? So you can look at the problem holistically.
The second big thing is self service, That's the only way to scale up and that will, you really want to be in a place where you could be chilling at the beach in Florida and the stakeholders have self service access to that data. That's how your world should be, And even if there is, there are issues that are raised issues that you've got through your alerts and monitoring, they can be self corrected.
by the engine itself. Now, we have some ways to get there, but that's the goal, right? Have a very clean way for clean and easy and accessible way for stakeholders to access their data and trust their data, And this is where we talked about data governance, common utilities, documentation, and so on.
So that's the second one, drive self service. And the third advice I would say is embrace managed services, just like we did. You don't want, and we have done it not just in the data team, but even outside of it, You have tools out there and we're in a very fortunate time right now where you don't have to reinvent the wheel, you don't have to start from scratch, Your goal should be how can I add maximum business value, And how do I get myself out of this business of maintaining the infrastructure? So those are the three pieces of advice I share, I give to data engineers. Which is again, as a reputation, understand your stakeholders and the requirements, build self service or build towards self service.
And the third is embrace managed services. That's your friend.
Richie Cotton: All right. All sounds like great advice to me, and I really do love the idea of chilling on the beach and all the data services still working. Uh, that sounds like a dream. Um, okay. Uh, Thank you very much for your time, Mo.
Mo Sabah: Thank you, Richie. It was great talking to you.
The Top AI Certifications for 2024: A Guide to Advancing Your Tech Career
Announcing the "Become an AI Developer" Code-Along Series