Skip to main content
HomePodcastsArtificial Intelligence (AI)

Building Trust Through Data with Prukalpa Sankar, Co-Founder of Atlan

Richie and Prukalpa explore challenges within data discoverability, the inception of Atlan, the importance of a data catalog, human collaboration in data governance, the future of data management and much more.
Jun 6, 2024

Photo of Prukalpa Sankar
Guest
Prukalpa Sankar
LinkedIn

Prukalpa Sankar is the Co-founder of Atlan. Atlan is a modern data collaboration workspace (like Github for engineering or Figma for design). By acting as a virtual hub for data assets ranging from tables and dashboards to models & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Slack, BI tools, data science tools and more. A pioneer in the space, Atlan was recognized by Gartner as a Cool Vendor in DataOps, as one of the top 3 companies globally. Prukalpa previously co-founded SocialCops, world leading data for good company (New York Times Global Visionary, World Economic Forum Tech Pioneer). SocialCops is behind landmark data projects including India’s National Data Platform and SDGs global monitoring in collaboration with the United Nations. She was awarded Economic Times Emerging Entrepreneur for the Year, Forbes 30u30, Fortune 40u40, Top 10 CNBC Young Business Women 2016. TED Speaker.


Photo of Richie Cotton
Host
Richie Cotton

Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Key Quotes

That was kind of what started the journey of what turned into Atlan. We were like, how do we solve this problem for ourselves as a team? And some of the inspirations that we took were actually other teams, right? Uh, let's look at engineers use GitHub and every, like your code has a profile on GitHub. Uh, sales teams, you know, you're like every lead and opportunity has a profile on Salesforce. Uh, why can you not do the same thing with your data assets? Why can't you have a simple profile that tells you like.

Where does it come from? What does it mean? Who owns it? What's the lineage of it? Like, how has it changed over time?

What does it take to build trust and collaboration across these diverse people? And I think the first step is knowing that there is diversity and embracing that. The second is then, I think investing in systems that create trust and collaboration. So what creates trust and collaboration? Typically, transparency creates trust and collaboration. When people see things, they believe it more. So for example, if at that point, when the number is broken, I just have a proactive announcement that says pipeline broke today as a business person. I'm going to trust my data team a little bit more. So the best data teams that we know actually measure this. They measure an NPS, they measure a score, they measure themselves on proactive versus reactive communication. That becomes the one foundation for trust. And the third is then you find a way to get, once business understands what the core challenges of data are, and once they believe and trust that you're doing the right thing, then they become a lot more involved in actually enabling data teams with like, hey, this is the metric, like I'm changing the metric definition. And so I think some of the most empowering stories that we've heard have been when we've had these, like, you know, data teachers come to us and say that, like, you know, before we used to be a service organization and like people just used to ask us for data. And now we are a partner. Like they look at us as a partner. They're calling us with their strategic problems.

Key Takeaways

1

Utilize automated tools to construct data lineage, ensuring a detailed map of data flow and transformation within the organization to facilitate easier debugging and compliance.

2

When starting data governance initiatives, focus on projects that offer high business value with low implementation complexity to quickly demonstrate the benefits and gain buy-in.

3

Embed data governance tools within existing systems and workflows (e.g., Slack, Jira, Teams) to improve user adoption and reduce resistance to new tools.

Links From The Show

Transcript

Richie Cotton: Hi Prakalpa. Great to have you on the show.

Prukalpa Sankar: Hi. Thank you for having me. I'm super excited.

Richie Cotton: I think this is gonna be fun. Uh, but I'd like to start off with one of my biggest frustrations. So as a data scientist, quite often I get to ask to analyze some data and then it's like, hey, go and have a look at this database.

And I have no idea where to find the data that I need. So do you have any advice on how to make data more discoverable?

Prukalpa Sankar: Well, I mean, the easy one is use Atlan, but, uh, you know, I mean, we heard this so many times in terms of like. the frustration that most data people have, right? Like if you think about it, what seems like the fun part of the job, which is let's work on this data and analyze it and find interesting trends for the business and like solve big problems, which is why all of us like do what we do, right?

I mean, like, that's why we decided to pick up this, like, it's like almost like techno business role in, in the companies that we work in. And the sad thing is like, most of us don't actually ever get. To spend most of our time doing that stuff because you're spending like 80 percent of your time firstly finding the data that you are like you need for running an analysis and then you're trying to actually understand it, you're like, Oh, what does this metric mean?

Oh, there was this new Salesforce field that was created by the sales ops team, but we have no idea what... See more

it actually means. Then you're trying to like understand what, you know, like what the actual metrics are and how do we define annual recurring revenue? And, you know, what does a customer really mean?

And, and most of your, your job is, is doing that rather than actually solving the business problem. Uh, you know, at Affluent, we started as a data team ourselves. We were data practitioners. I was a data leader running a team that was doing this. Uh, and on the other side, there was this massive frustration that I had as a data leader, which is, I felt that we were so fragile.

Like, I remember this one time where, uh, we had this really big project that was due, um, and I had like looked into the eyes of my client and I was like, we will make sure this happens and I'm going to deliver it. And, um, exactly a week before the project is due, our lead analyst on the project quit. And I remember spending, like, hours, like, I took, I took him out, I was, like, three hours, I was, like, I, I, I, I literally begged him to stay back, like, I was, like, please don't leave because he was the only one who had all the context about the data, uh, and if he left, like, I was, I did not know how I was going to deliver.

the project of my client, which is a pretty fragile place to be as a leader, you know, and uh, that was kind of what started the journey of what turned into Atlan. We were like, how do we solve this problem for ourselves as a team? And some of the inspirations that we took were actually other teams, right?

Uh, let's look at engineers, Use GitHub and every, like your code has a profile on GitHub. Uh, sales teams, you know, you're like every lead and opportunity has a profile on Salesforce. Uh, why can you not do the same thing with your data assets? Why can't you have a simple profile that tells you like, where does it come from?

What does it mean? Who owns it? What's the lineage of it? Like, how has it changed over time? Uh, and so we started saying, like, what does a data asset 360 look like? Um, and then around that, how do you build an ecosystem for search? I mean, like, it's so much easier for, like, today on Google, you and I can search to find what we need.

And, like, on, um, uh, on Amazon, it's so easy to browse for stuff. So how do you kind of bring that ecosystem around your data assets inside an organization? Yeah. Um, and you know, we think of this as sort of like, there's been some form of this in the industry called data discovery, data catalogs. We, you know, they've traditionally failed and there's a lot of reasons they fail.

We failed three times over implementing a data catalog and ourselves as a team and like the fourth time around we built Atlin. Uh, and so we think of, you know, that we call ourselves sort of like third generation, uh, offering a catalog to, to actually make this work. Inside an organization. And we can talk a little bit more about what it takes to make that happen.

Richie Cotton: Okay. So that idea of like, uh, fragility is really interesting because. It's all very well. Your data team's performing in good times, but when the times go bad and something goes wrong, you still need to be able to deliver something and avoid those disasters. So you mentioned the idea of a data catalog, and I'd like to dig into that further.

Just to begin with, can you explain what is a data catalog?

Prukalpa Sankar: Yeah. So, I mean, the origins of the data catalog kind of sit in this principle of, if you treat your data, the way, you know, an organization would treat any other asset, right? So if you would just look at like, like the best example that I would think of is on Amazon, when you're going to be able to go browse for like a specific, um, you know, for shoes, and then you can go and look at the types of shoes that you want.

I mean, those shoes, all the context that you want, so you can make a decision. on, Hey, here's the shoe I want to buy. You take that same example and you bring it to data. Uh, the, the most basic first step for data catalog inside an organization is if I'm a business user and I'm like, I want to understand why our customers are churning, I should be able to go search for customers churn, and I should be able to find the metrics, like, you know, churn rate, and then, Hey, this is a certified dashboard.

Someone's already done analysis on. If I want to dig in a little bit more, I can dig in. I understand what, what each of those metrics mean. I understand, you know, how I can use it. Um, and then I basically hopefully make my decision. But, you know, when I go to my second order, third order, fourth order, all of those things are available.

So the way I think of it is, is it is the home for data inside an organization. Um, how do you build that one home interface for, for data people?

Richie Cotton: Okay, I like that idea of, um, shopping for data. It's in the same way that you'd shop for shoes. Um, so, just carry on with that analogy. So, if I go on a shopping website, I see a pair of shoes.

There's like a product page that tells, you know, it's got pictures, it's got descriptions, it's got reviews. Um, what's the equivalent for data? And, in fact, who would create all that information?

Prukalpa Sankar: And that's where, like, the nuance becomes really interesting, right? Uh, because if you think about data, first, what is a data asset or a data product?

Uh, and it's, is it just the table? No, not really, right? Like, if you think about today, data assets are not just tables. Metrics are data assets, dashboards are data assets, code itself is data assets, pipelines are data assets, like each of these things are data assets, and ideally they should be reproducible and reusable, um, and so if you think about the, that's where the nuance becomes really interesting, uh, where depending on who is looking for churn, and like, let's pick, if I'm, if I'm a data scientist versus I'm a business user, if I'm a business user, I just want the dashboard.

And ideally, my system should just show me the dashboard. And then, you know, when I go into the dashboard, I should be able to see what is this, like, it's almost a profile page for my dashboard, right? The first step. This is the dashboard. Each metric. Here's what this metric means. Here's who owns it. Um, here's like, you know, like, you often have this question, like, when you look at the data, you're like, you know, I've had data business leaders telling me this, like, I look at the data and I'm like, that doesn't look right.

And the thing is, what I don't know is, is there something wrong with my business or is there something wrong with my data? And in an ideal case, the profile should answer that question for you. You should be able to say, yeah, like, you know, there's nothing wrong with my data. There's something off in my business.

Let me dig in. Let me understand why there's something that's changed this week versus last week, right? Um, so that's, you know, the proliferation of this for an ideal business user. And for example, we actually just delivered this into the experience in a dashboard. So they don't even know they're using a data catalog.

They just get this context in like a Tableau or a Power BI when they're looking at it. Versus if I'm a data scientist and I look for churn, now I'm probably like trying to understand, like I'm trying to do a deep dive churn analysis. So the first thing I need to know is what are all the other projects that have ever been worked on in the organization?

Uh, what are the tables associated with it? And so when I go into a table profile, I probably want to know every column. What does it mean? Uh, audit logs. I want to know version history. I want to know who owns this. I want to know the lineage, like how was this created? Uh, I want to know, is this, when was this last updated?

I want this whole 360 context that is, that, that I need to be able to figure out, okay, like, Should I use this data for my project or not? Um, and so that's where it's, it's very interesting where if you take consumer patterns, I think the ideal solution for this and data is actually a mix of Google and Amazon and Netflix, uh, where there's an element of Google and Amazon, but in the thing about like shoes is that if you and I are looking at shoes and we're making a decision on shoes, we probably want to see the same information.

That's not true for data because, and that's where like the Netflix analogy comes in, because if you're a data scientist and I'm a business user and we're looking at the same thing, we want to see different pieces of information to help us make decisions on it. And that's where like Netflix kind of personalizes the experiences for you and me.

I think that's the last leg for making this really work inside organizations.

Richie Cotton: Okay, that's really interesting the idea that different people want to see different things from data. And so you've got to cater to all those different audiences. Um, I'd love to get into that in a bit more detail later, but you also mentioned a term data lineage.

Um, can you just talk me through what that involves? The best way

Prukalpa Sankar: to explain it in my mind is the flow of data inside an organization, right? So typically, um, when you're thinking about building a dynamic real time data system, largely. Um, most organizations are, the data is actually coming from a source system.

So it's something like a Salesforce or a NetSuite or like an internal tool. Um, that data typically goes in and do a larger scale compute system, right? So that's typically where a data warehouse, like a snowflake comes in. Once you go there, you're typically, you have your raw data, you're transforming it, and then you're turning it into your final goal state data, which is like business data site.

Which is then getting pulled into a dashboard or then getting, which is a completely different tool, right? So if you think about this process, your data is hopping through at least five or six, sometimes in like large enterprises. I've seen like, you know, hundreds of hops, uh, before something makes its way down to a final business user on a dashboard.

Which is why, when you ask this question, why is this number on my dashboard wrong? It seems like a really simple question. When you're a business user, you look at an Excel sheet because, like, your, your, your analogy is Excel. And you're like, it should be easy to figure this out. It's actually a very hard question to answer.

Because why this number could be wrong could be because the data ingestion pipeline from the source system to the warehouse failed today and didn't update the data. Or it could be because some human updated the definition of annual recurring revenue, and so the way we measure this metric changed. Or it could be because the data quality check in the transformation pipeline failed.

Uh, and so there could be like, you know, like hundreds of reasons and like at least like 10 people involved in that simple question of like why is something broken uh and so the foundation for this is the map if you want to solve this problem you need a map and you need to have a map of exactly what i described down to the individual field and column level to be able to say hey This is how data flows across the organization.

Uh, and that's what data lineage is.

Richie Cotton: Okay, uh, so it just seemed like, well, there's the common sort of idea that, uh, debugging anything is twice as hard as building something in the first place, and it just seemed like once you've got sort of half a dozen tools or even, like you said, hundreds of tools, debugging what went wrong in the dashboard is going to be an absolute nightmare, so understanding that process does seem, uh, incredibly important.

So, um, what, what would you need to do to set up a data lineage? Like what, what goes into building this?

Prukalpa Sankar: So the way we think about lineage is, lineage is a fairly complex problem. And the reason most people don't have lineage is because it's, you kind of have to construct it automatically. Otherwise, you'll never really construct lineage.

Um, so the way we do this is like three or four layers. First is read everything that's happening in So for example, the first step is like, we'll read through like your SQL query history, anything that's written in SQL, reverse constructs and create it. So the first step is whatever transformations are already happening, whatever like is, is written in code, reverse engineer the code to deduce how data is flowing inside the enterprise.

The second layer is integration systems and integration points. Uh, this is where we've worked very closely with, uh, BI tools. to expose more API endpoints. We've worked very closely with transformation tools like DBT, data ingestion frameworks like, you know, Fivetran, Airflow, to again, expose more API endpoints through which you can construct lineage across the ecosystem.

So that's what I would think of as the second layer. And the third layer then ends up becoming almost the last leg. It's likely that there are going to be. internal systems that don't have either of these or you know, you've written your, you've written your, um, your script in like Scala or something. And like, you know, it's just like, you can't, you can't really reverse engineer this.

That's where what we see with customers is this idea of like shipping, like integrating into the way that developers code. So we wrap, uh, we have a customer that calls this, um, ELTP, extract, load, transform, publish, uh, and publish is the last step where you actually publish a transform, like the change that's happening in your data.

Uh, and you integrate it into the way that developers work. Uh, and what's been really exciting, I think, now with generative AI is, and we, we made one of the earliest investments into AI in the space, uh, is you can actually, like, your AI can do, like, a first level of deduction, uh, directly as a part of the way people are coding.

Uh, and so then a developer kind of gets, like, a recommendation, and then they just kind of have to say yes or no. Um, and make some changes and publish. Um, so that's sort of like what I would think of as the third layer of constructing data lineage. Uh, which is what then creates the map across the enterprise.

Uh, vis a vis, just to give you a sense of how this used to happen like five years ago. Um, People used to manually construct this. Like I have seen people manually build maps, uh, at like a system level. And like they've, they've shown us like these manually drawn maps, uh, which just doesn't scale inside organizations.

And so if you want to do this, this has to be automated. This has to be system first. Uh, otherwise it's never getting done.

Richie Cotton: Yeah, I can imagine that's an incredibly tedious job and you're probably just hiring lots of interns because no one wants to do this sort of thing, just like working out what system connects to what.

Prukalpa Sankar: Yeah, I have a funny story. So recently we were working with a large Fortune 100, uh, you know, they plugged in the system, they got their lineage out in like a couple of weeks. We go in, we show them lineage and this is like across like complex third generation systems. They see this, people are like, wow, this happened and they were like, what did it take to do this?

And then our solutions engineer basically said 20 interns over six months, um, and everybody And then, um, it was a joke that nobody got. And then my, like our sales leader, he was like, just to clarify, that was a joke. You just plugged in the software.

Speaker 3: But yeah, it's, um, that's just the reality of how things have been done over time.

Richie Cotton: Nice. Uh, yeah. So, um, sad for the people who wanted to be an intern, but at least it's not a terrible job. It's good that this is happening automatically. Um, all right. So, um, I like, uh, your idea that, um, you can have several different layers of data links. You don't have to build everything at once. Um, maybe it's worth just, you know, Talking through like how you get started with this, like, um, what do you need to do?

Like, do you build the data catalog first and then worry about lineage? Like, what's the order of this?

Prukalpa Sankar: The way I think about this is using what the category is. All of these problems are very interrelated. Uh, and there used to be a word. Where, you know, these were almost like separate, separate software category products, and they would be like, you know, there's like separate data catalog, separate data, linear and separate data governance, separate data, stewardship, like these were all.

But if you think about, if you fundamentally break this down, the fundamental problem you're trying to solve is you're trying to create across your data ecosystem. And that single source of truth translates into all of these different applications. One use case of that is. making it really easy for people to find data.

Another use case of it is like prevent breaking changes, uh, or root cause analysis of dashboards. Another use case of it is compliance and GDPR. And that's just the, those are the way I think about it is that kind of like use cases more than You know, like categories, right? Um, and so the way we see this or the pattern that we see is you invest in building your metadata foundation.

And we think of this as almost like your control plane. Like the data frame needs a control plane. Uh, and the control plane needs to be in the metadata layer. And so what you want to do is basically invest in building a foundational metadata layer, which is building this. data 360 of sorts across your enterprise.

Um, and then on top of that you almost enable use cases. So the pattern we see is that, you know, like we, we've mapped 40 use cases for this, each of which have business value, et cetera. Typically what you want to do is you want to invest in platform, but enable the first or the second use case. Uh, and that's dependent.

And then you kind of, you know, grow over time. And the first or the second use case is a function of what's most important to your business today. Is the biggest bottleneck that you're having root cause analysis and, you know, um, or, or breaking changes in dashboards? Is the biggest challenge that you're having compliance?

Is the biggest challenge that you're having democratization and self service? Depending on that, you kind of prioritize which use cases you want to enable and then go along.

Richie Cotton: Okay, so I mean, you mentioned a lot of sort of common business problems there. I'm wondering, is there, do you have advice on how or which bits of data you should get started with?

Like, do you start with the easy, simple data sets, or do you start with like the business critical data sets? Like, where do you begin?

Prukalpa Sankar: That's a great question. And the way I think about this is from an, yeah, you kind of break it down into Implementation complexity versus implementation value. Uh, and in an ideal case, you are going after something that is high value, low effort.

Uh, and the reason for this is just, you know, you want to as quickly as possible, get to business value. Uh, and you want to start proving as much business value as possible. Uh, so the best examples of this that I've seen have been actually, like we have, we have teams that actually have this quadrant and it's basically Based on value and complexity.

Um, and they're basically like. You know, there's, there's dots on it and they're prioritizing new use cases. Um, so, you know, it's not a straightforward answer because for example, if the most business critical problem to solve is something that's going to take you a year to solve. And this is not because it's hard to be like this, the technology systems part of this today is actually the easy part of the problem, uh, or the easier part of the problem today, like if you pick the right tool, uh, I don't think that's the hard part of the problem.

Uh, the hard part of the problem is the human last mile that's important. Like, for example, things like metric definitions, like what does a metric mean? How do we define customer? No tool in the world is going to define a business, like how a business defines customer, right? So if it's, if it's a use case like that, and it's going to take you six months to drive through, change management then you know you could argue that maybe there's a slightly lesser high like slightly lower impact higher like lesser effort use case that you want to enable such so what we typically do is you know i think what people get wrong with this is strategy uh and they don't invest the amount of time it takes to get the strategy right Uh, and you know, we, we highly recommend, like in fact, like we now in our university have this, we have like a video on why strategy is important and we'd be like, don't buy, like, unless you see this, like you have to like align to this.

Um, because at the end of the day when you're thinking about building a single source truth, and this is true for any CRM system like Salesforce. Lattice for people management. Like if you're the minute, it's not just a tool, but just switch on, switch off. Right? Like for example, data ingestion is a switch on switch off.

Uh, anything that sits one layer above, that's a little bit more strategic in nature across the organization needs people buying across the organization. Uh, and so if you get the strategy wrong, everything else breaks. Um, and so that's, for example, been a very, very big investment that we've made in like saying, how can we be a strategic partner in the journey that we, that we help our customers take?

Richie Cotton: It's interesting that it's like, technology is easy, it's your colleagues that have a problem with that. Um, so one thing you mentioned was, um, that like, no one can agree on definitions of business metrics. This is certainly something I've seen, like, um, at Datacamp we, you know, we argue about like, what, what is the definition for customer lifetime value or how do you do marketing attribution or something like that.

These things change a lot. So can you talk me through what you need to do to get to clear definitions of these critical business metrics?

Prukalpa Sankar: I think of these as like two step problems. The first is a human collaboration problem, uh, which is what is, what is the metric, define the metric, and define the change management process around the metric.

And so what's the workflow when the metric changes? Uh, and so the way we think about step one is, Let's, let's figure that out. Uh, and, uh, you know, you all, we almost think of this as like, if you have a sign off a project management, what is your Asana for metrics management? Uh, and because it has so many different stakeholders, right?

Cause it has your business leader. It has your business ops, it has your data person, it has your compliance, like there's a whole layer of people who need to be involved around that process, both in the definition, as well as the change management. The interesting thing is actually the definition is actually, is maybe the easier part of the problem, right?

Like you get everyone in the room the first time you get it right. The challenge is it keeps changing because business changes. And if business changes, the metric changes. This is a. So I think that's the first step of the problem, and how do you, I think, how do you enable effective tooling to enable the true human collaboration challenge that exists when something changes?

The second layer is tying this into execution. And the biggest challenge actually, like, so there used to be, Gartner used to have a category called information stewardship. Like there should be an entire category of tools called information stewardship and, you know, have you ever spoken to a data leader who's like, we've solved this problem?

No, it's never like, you know, I've never heard, I've never literally spoken to a data leader who said that this problem is solved. And I think that's because this doesn't get tied into the way people work. Uh, so let's think about like when something changes, typically you're moving very quickly. So someone in business is telling someone in detail.

And at that point, nobody's actually like, you know, you're, you're just fixing something and shipping. That's the reality of what happened. So if you're not integrating the way you've defined your metric to the way you're defining your code and the way your code works, you're This breaks. Uh, and so we think of this as like, you know, governance is code.

Uh, So we call it like shift left governance take it back into the way developers work data engineers work analytics engineers work So that when you're shipping, uh as a part of that ship process, it automatically updates your metric It automatically so there's there's a full flow there, which is what takes the biggest connection between the way your data ecosystem works and your developer work, uh, developer workflows work.

And if you get both of those things right, then you actually have a relatively seamless system, uh, in making this work.

Richie Cotton: Okay, so just to make sure I've answered this correctly, what you want is a situation where if you make a change to the code, the official business, um, Definition of the metric is going to automatically change at the same time.

Uh, stay in sync.

Prukalpa Sankar: Exactly. And I mean, and you can, in some ways, like the, the way we've tried to define this, we've also realized the workflow of this is different depending on the organization, right? Because in some organizations, business defines the metrics and data services it. In some organizations, data defines the metric and business consumes it.

Uh, and so I think the actual workflow of the way this works, like who, where is the approval could change and where is the definition could change. The definition, like, and so I think it's important to build this in a way that's flexible. Like what we've, the only thing we've realized is that the only reality in data is diversity.

No two teams are the same. And so you want to be able to build this almost as a workflow systems, but ideally the end state is the way people are working and making changes in their metrics. and code is directly related to the way business is consuming those metrics and those definitions and your change management integrated into the way the organization.

Richie Cotton: Maybe let's talk about more about like what needs to happen with the people and the teams in order to make this work. So, uh, you mentioned a few job titles like data steward or data governor, uh, but then there's also the business people. So in terms of, um, making sure that all your data is set up right, like can you talk me through like which teams and roles need to be involved in these decisions?

Prukalpa Sankar: So the core owner of this has to be the data team. It needs to be a core layer in the data team. Um, the ideal case that we've seen is we almost see these structures that we call like, sort of like the, the modern data governance team or the modern like, and the way we think about that is you, you kind of have to, like the best analogy that I think of is if you think about a sales team, typically sales teams have two, two key leads outside of sales leaders.

One is SalesOps and the other is Sales Enablement. SalesOps is typically working on your systems and your tech. Uh, Sales Enablement is typically working on people and change management, human change management. Uh, to make this work inside an organization from a data perspective, you need both data and data.

tooling and engineering, and you need human change management to work hand in hand. So the ideal structure of this team that I would think of is like a tag team between a data ops engineer and a data governance lead, uh, who are both working together to own the project. From there, actually, I think the, it's pretty decentralized, right?

So, uh, the, the, the key is how do you work with your data engineering team and change the way like data engineering team needs to go from publishing data to shipping data products. Uh, your business teams need to start owning their data. Uh, and you know, so there, there are ways that you want to like empower all the other teams and embed into the way that they work.

But the best case scenario that we've seen is if you, if you have these two kinds of personas driving this inside an organization, you almost always win.

Richie Cotton: Okay. Um, so I, I like the, um, the sort of parallels for the sales ops and sales enablement. Um, data ops, I certainly think, I don't think I've heard of anyone with a job title data enabler.

Do you think that should be a real role?

Prukalpa Sankar: I wish there was, like, you know, like, I, and in some ways I actually think that's what data governance is. Like, you know, like, like a lot, like, if you really think about what a data governance role inside an organization is. It exists specifically for the purpose of bringing people and process and tools together to drive people to use data better, which is a fundamentally enablement focused role.

I'd actually written this article called Data Governance Has a Serious Planning Problem, because, you know, when people think governance, they think compliance and risk and process and bureaucracy, like that's what they, that's what you think. Um, but actually, like, there's a reason government exists. You know, like there's a reason like, you know, like we don't live in anarchy and like, you know, like there's a, if you think about the parallel to the word governance, like it's not, it's not all bad.

Um, and you know, it, it, it kind of probably helps us as a society work a little better than, than we would otherwise. Right. Um, And so, sometimes I think like maybe we should rebrand this function to call it data enablement. Um, or sometimes you should like, maybe you should rebrand this function to call it data product management.

Or some, some version of that just because it has a bad rep. Uh, but foundationally, I think that's what, like, if data governance is done right, helps people use data better and enables people to use data better.

Richie Cotton: All right, super. I think we've just solved the data governance branding problem. From now on, data enablement.

And so, related to this, um, what sort of skills do people need in order to do this data governance better or have that smoother data workflows?

Prukalpa Sankar: I'd say two skills need to come together. Two important skills need to come together. Um, One is change management and people. Uh, and the interesting thing about this is there used to be a time where it was okay to deploy this stuff top down.

Like there was like top down rigid, like, you know, like you have to do this. The reality is like, nobody works like that anymore. So we kind of need something that's like a little bit more bottom up, collaborative. So to give you an example, one of our champions, Emily, um, you know, and when she was working at WeWork, she was the change management data enablement person.

Uh, and, and she basically, um, when she launched, she launched with a data personality and it was basically this like personality quiz for, you know, a person. All these different types of people, business people to find out what their data personality is. It sounds like a very simple thing, but actually like what would have otherwise been a survey?

became, which no one would take, suddenly became this, like, exciting thing. People were really excited about being part of and they were sharing their data personalities and like things like that. So it became a thing, right? Change management is hard. Data people, well, honestly, we, we don't like doing it.

Cause like, that's why we closed our careers, right? We were not product marketers and we're not like, you know, that we're not community people. Like that's not who we are. Uh, but that skill becomes super, super critical, um, in driving. Helping diverse people come together. The second though, is that, but doing that without deep data engineering experience means that you're basically doing a bunch of human stuff that doesn't get automated and run at scale, which means that you're, you're not going to focus on the right problem.

So the best parallel to that, I would say is like a data architect or a data engineer, or like, you know, data governance engineer, whatever we want to call them, uh, who is able to scalably automate wherever is possible. So that, uh, human at the end of the day, change management is hard. So you don't want to do change management if you can avoid doing change management.

Like if you can, if you can automate it, like don't do it. Uh, I think both of those skillsets and personas need to come together. to do this effectively.

Richie Cotton: Okay. So, uh, this sounds like, um, it's going to be the tricky part then is that you've got to have data people speaking to business people and they're both gonna get along and communicate.

I suppose this is especially true once you start like trying to, um, like label your data and say, well, the business person knows what the data thing means. And then the data person like wants all the technical details. It's going to be very different who writes what. So how do you make sure that both of these teams can work together?

Prukalpa Sankar: I think in some ways it kind of comes down to what makes a great team, uh, right. Uh, the reality is that the, all the, the only reality in data is diversity. It might be maybe the most diverse team ever created, uh, like, because if you need to solve a data problem. You need business, you need analytics, you need data, like, you know, data engineers, you need science, machine learning researchers, now there are prompt engineers, like there's just a variety of people and skill sets that need to come together to make a data problem work.

And none of these people can speak the same language. None of these people understand each other's work, which is very different from any other team. Like, let's think about a sales team, right? Like most sales people kind of have always, they've done the same job. So like they know, but like, for example, if you, if I remember when, uh, I was a data leader and I'm a little bit more on the analytics business side, I'm not a data engineer.

And I remember when like a number on the dashboard was broken and like, I had to answer a question to my client. I was like, is, did the pipeline really break or is my data engineer not doing his job? And, you know, that, that was a very real question. Uh, and that breaks trust. And when that breaks trust, and people stop believing the data or they stop, the worst case, they stop believing the data team.

And then there are these like shadow analysts. It's just everything breaks and then you have to restart it all over again. And so then the question is, what does it take to build Trust and collaboration across these diverse people. And I think the first step is knowing that there is diversity, uh, and embracing that.

The second is. Then I think investing in systems that create trust and collaboration. So what creates trust and collaboration? Typically, transparency creates trust and collaboration. When people see things, they believe it more. So for example, if at that point when the number is broken, I just have a proactive announcement that says pipeline broke today.

As a business person, I'm going to trust my data team a little bit more. So the best data teams that we know actually measure this. They measure an NPS. They measure a score, they measure themselves on proactive versus reactive communication. That becomes the one foundation for trust. Um, and the third is then, you know, you, you, you find a way to get, once business understands, uh, what the core challenges of data are, and you know, once they, once they believe and trust that you're doing the right thing, then they become a lot more involved.

And actually enabling data teams with like, Hey, this is the metric. Like I'm changing the metric definition. And so I think some of the most empowering stories that we've heard have been when we've had these, like, you know, data teachers come to us and say that, like, you know, before we used to be a service organization.

Um, and like people just used to ask us for data. Uh, and now we are a partner. They look at us as a partner. They're calling us with their strategic problems. They're not calling us with, I want data. Um, and that transformation, I think, is a function of transparency, trust, and then finally enabling collaboration on top

Richie Cotton: of it.

I do like that idea of, um, just being proactive about saying if there's a problem up front, rather than, you know, some business person calling you saying, hey, I think you've screwed up the dashboard. Uh, yeah, tell people up front and that's gonna, uh, help with the trust. Um, I'm curious, um, from a product point of view, how do you design a product that's going to be suitable for very different audiences, like for a business audience and a technical audience?

Prukalpa Sankar: Diversity needs to be baked in fundamentally into the way the product is built. Uh, so I, I talked to you a little bit about like third generation or why generations of tooling here have failed over time. It's mainly been because you've had tools that were built for one type of audience. And the reality is that like, data is diverse, people are diverse.

So for example, we think a lot about these systems of how do you create personalization as a default. So when you think about the UI, like you are seeing a different, um, different view versus me versus somebody else, depending on the context, depending on who I am, what projects I'm working on, what my role is, uh, my entire experience.

So for example, I'll give you one simple example. Something that data engineers really, really want is they want to know pipeline updates from Airflow. Like, they're like, did the pipeline update or fail or not? Uh, they want to know, did the data quality test fail or not? A business person doesn't want to know any of those things.

They just want to know, can I trust it, can I use it? That's it, just give me an announcement, red, green, yellow. You're actually confusing me by giving me more information. So for example, if you're a data engineer and you log in, you see that stuff. If you're a business user, you don't see that stuff. That's a very simplistic example for like, how do you build systems that work for this diversity?

And the second is how do you interface in points that people already use? So one of the biggest challenges with rolling out systems that touch all different kinds of users inside and the reality of data is it touches all different kinds of users across the organization. Is integrate into systems they already.

So for example, we have a Chrome extension that sits on Salesforce. It sits on top of BI tools. Uh, we sit in Slack, we sit in Jira, we sit in Microsoft teams. So for example, there's typically some Slack or team channel where people will be like, What does customer mean? How are we defining it? Imagine a bot just answers that, right there.

And you have a workflow built right there. Uh, or let's say you're in Jira. You directly get a ticket there that is connected to the data asset that's broken for that day. Uh, and you can auto trigger it. You can actually say pipeline in Airflow broke, auto trigger Jira ticket. So when I start my day, I just start in Jira and I get all the context that I need to work.

Uh, so one key element is we call it kind of like activating metadata and taking it back into the daily workflows that people are already using, uh, and just integrating this context and trust into the way that they're already working. That isn't forcing them to move systems.

Richie Cotton: Yeah, I like the idea of not forcing people to use yet another tool, explicitly building things into what they're using already.

Certainly, I find like, even like, I'm kind of a techie nerd, but I get overwhelmed by the number of different tools I have to use. And I say, Oh, I have to go to this app to see this thing and go to the app to

Prukalpa Sankar: see a new thing. Yeah, 40 percent of people at login flow. Get them to log into a different system, that's it.

40 percent

Richie Cotton: is gone. Nice. Okay. So, um, it seems like a lot of the stuff we've been talking about is dealing with things that go wrong. So this data pipeline is broken. This dashboard is not right. Is, um, is the main benefit just around, um, Making it easier to deal with problems, or are the benefits for when things go smoothly as well?

Prukalpa Sankar: The way we think about benefits, and the way, you know, people have described this to us, when it works well. When it works well, so when it works well, your, your organization is able to make quick decisions confidently across the ranks. Uh, and I'll break all of those three points. Great. We live in a world where things are changing extremely fast.

Uh, and you have to respond extremely fast to external stimulus. Like just let's pick the last few years, like war, supply chain crisis, COVID. There's like, you know, like you're, you're responding extremely fast to external stimulus. Um, so first, how do you move quickly? And how do you, how are you agile from a decision making perspective?

The second, when you're making the decision, how do you make it confidently? How do you make the right decision? Uh, and that's where trust comes in. Can I trust the data? Is something wrong with my business? Or is something wrong with my data? Um, and the third is, um, is, you know, can you do it across the X?

So right from your board, to the last person on the ground. Like, last front line. Can they all make decisions quickly and fast, uh, confidently? I'd say those are the three broad, so when done well, that's what it enables. Uh, and there's this very cool, I was, I was looking at this the other day, there's this like top 10 companies by market gap.

And you look at the companies in 2010, you look at the companies in 2022, and they're very different companies. And the difference between the companies in 2010, companies in 2022, the companies in 2022 are digital fast. Tech first, but data first. Uh, and you, you know, they're, they're, the way they use data is celebrated across the world.

They've invested in these automated systems. They're very good about the way that they do this, right? Um, and I think that's the best case scenario. To get there, you have to be able to get out of reactive mode to proactive mode. And I think that's the trap that most data leaders fall into or data teams fall into, which is there's a lot of foundational stuff to fix.

Thanks. And so you're constantly servicing, fixing the problem. I mean, I think the chief data officer tenure is the shortest that I've ever heard. Like, I think it's like 18 months or something like that because you jump in and you're constantly fixing stuff and you don't have the time to get to proactive mode, just to this end state that I was talking to you about.

Uh, and so that's why getting the first set of things. Fixed is important. So you actually have the time to do the second set of things, but ideally done well, that's what it leads to.

Richie Cotton: Okay. Um, yeah. So I like the idea that if you trust the data, you don't have to waste time second guessing yourself and, you know, sending out that message to the data person saying, is this number right?

And that's going to speed up your decision making. So I can see how that's going to be a productivity benefit, even if things are actively going wrong.

Prukalpa Sankar: I mean, productivity, but also like productivity. And this is the hard part, right? Like, I think data teams suck at, like, finding the business value of data, but like, for example, our CEO, Omar Krupa, tells the story in his previous company, he was running a sales team, and he was like, we, you know, like, I brought in an analyst who, within the first two months of being there, identified this gap, helped us improve our conversion rate in this process, and we made an extra 10 million.

10 million. that quarter. Um, and like, it would, like, there was nothing I could do that could have been more impactful than what this analyst did. And so what is the value of a confident business position? We're not very good at, uh, measuring and quantifying the value of that, especially in short periods of time.

Cause like, you know, even in this case, was it, was it the fact that we identified this gap or was it the fact that we executed well to actually improve the conversion rate? Both of those things actually needed to happen for the business to improve, but both of those things were, they did impact it. Right.

Uh, and I think that, To me, I think a lot of data people fall into the trap of thinking about their job from a productivity lens, but that's not the value of a data team. The value of a data team is you're helping make the right decisions confidently. Uh, and there is a value to business. comes with that.

Um, how do you define it? Harder question.

Richie Cotton: Yeah, definitely. And I think, uh, most businesses wouldn't, uh, mind too much if they got an extra 10 million a quarter. That's a nice little bonus there from a change in, uh, just finding some new data. Um, okay. So it seems like, um, This sort of better data governance and better observability of data could help you out for industries that have a lot of compliance requirements and regulations to deal with.

Can you talk me through how it's going to help?

Prukalpa Sankar: So typically I think most organizations today are dealing with A ton of regulation, compliance, um, and it's actually amplifying now in the AI world. Uh, and so, you know, like we traditionally have been dealing with things like, or like in the last five years, GDPR, CCPA, local regulation, which country residency acts, right?

Like there's, so first step is like, as you think about this, there's like, what's the policy and. Second, how do you, again, execute the policy across your data estate? Um, so the way we're seeing customers that, you know, leverage this is to say, Okay, like, let's define our policy. This is how we use GDPR. But then we have to connect that to across my single source of truth in my data estate.

What is PII? Uh, and then I use lineage to propagate PII. So if one column is PII, I propagate that everywhere else. And I say any column that's derived from this all the way to my dashboarding layer is also PII. Um, and then the last step of this is enforcement. So you say, Hey, let me make sure that wherever in the ecosystem there is PII data, it's must and it's available.

So that's the foundation metadata control plane that can enable that across the ecosystem. Um, and what we're starting to now see is with AI, this is even becoming. You know, we're seeing customers who wanted to deploy different different bots for different use cases. So I had a customer the other day, say, um, so when I'm deploying a bot for my HR team, it's okay if we query like if payroll data feeds into the answer.

But when I'm deploying for the rest of my organization, you probably don't want a bot saying this is my colleagues. Salary, right? Uh, so how do I enable that? And you know, that's a relatively simplistic use case, but like custom, like you're buying data today and there's terms and conditions that are associated with buying data.

We had a customer that said we buy LinkedIn data, but there's terms and conditions that's associated with what, what I can use this for. So how do you. How do you. First, define policy, the human collaboration problem, and then how do you execute it across the entire stack so that you're actually adhering to policy, you know, when something's broken, you're able to monitor it.

Um, and how do you kind of like build that flow across the ecosystem? And then the use cases become real time enforcement, data access governance, uh, but also audits. Um, you know, there are specific, like, how do you build out audit reports for specific use cases? How do you trust the report that you're sending to a regulatory authority?

Um, all of those become use cases, uh, associated with this.

Richie Cotton: Okay, certainly, uh, a lot of those things seem like fairly grim tasks that I'm sure most humans don't want to do, like saying, Hey, is this table or is this column actually PII that I can't use for some use cases? And then I'm sure nobody loves writing reports for auditors.

So having all that done automatically does sound incredibly useful. Um, all right. So before we wrap up, what are you most excited about in the world of data management and governance?

Prukalpa Sankar: I think what I am most excited about is we are in this very interesting moment where we can get it right, right now. Um, and a few things have come together.

This year, Chief Data Officers Gartner did a survey. The highest spend, Budget that people have is for data governance this year. And that's mainly because as you think about where we are in the ecosystem, many companies have implemented the cloud, they've implemented data warehouse, they've implemented, there's this last mile.

And that last mile is data usability. And so most like we, like, we, We've had customers tell us this where, like, our business users have gone through three generations of data warehouse implementations, but it doesn't actually matter to them because at the end of the day, they can't like find or trust or use that data.

But with what's happened with the cloud and the data stack migration, I actually think that we've gotten it right. Like foundation of the modern data stack as it's called is there today. And we're now at the last mile problem. Today it's possible to solve the last mile problem because it's way, it's possible to automate things because of what's happened with the cloud and companies like snowflake and Databricks and, and, you know, in the last.

you know, five years, that was never possible before. Uh, so for example, we could never solve this last mile problem if we needed like hundreds of people to be drawing manual linear scripts. Like, it's never gonna happen. We're now at a point where we can solve it. Um, and so that's sort of like the second wave that's, I think, most important.

Uh, where we can solve this problem. We can do this right. Um, and I kind of see this as the, you know, if we do this right over the next two or three years, we can kind of hit that. Final, you know, I, I think the, the Mecca that we've all tried as data people been trying to get to, right? Like, you know, we have, you know, data driven decisions across the organization, like everyone's using data, like we are, like we're truly there.

Um, and this time, you know, I've. I've seen many, many waves of data and technology over, over the last few decades, and it feels like this time we are finally a few years away from as an ecosystem solving data management challenges once and for all.

Richie Cotton: Okay. I think that's the most optimistic take I've ever heard on this because normally it's like you speak to the chief data officer and they're like, Data quality will never be a solid problem, this is a disaster.

And a few years away from success, that's, that's wonderful. So, I think we'll end on that note. End on a happy story. Uh, so, uh, yeah. Thank you so much for your time, Prakalpa.

Prukalpa Sankar: Thank you so much, Richie. It was lovely speaking to you.

Topics
Related

podcast

[Radar Recap] Scaling Data Quality in the Age of Generative AI

Barr Moses, CEO of Monte Carlo Data, Prukalpa Sankar, Cofounder at Atlan, and George Fraser, CEO at Fivetran, discuss the nuances of scaling data quality for generative AI applications, highlighting the unique challenges and considerations that come into play.

Adel Nehme

41 min

podcast

Building Trust in Data with Data Governance

Laurent Dresse joins the show to discuss how data leaders can succeed in their data governance journeys.
Adel Nehme's photo

Adel Nehme

40 min

podcast

The Path to Building Data Cultures

In this episode of DataFramed, Adel speaks with Sudaman Thoppan Mohanchandralal, Regional Chief Data, and Analytics Officer at Allianz Benelux on the importance of building data cultures, and his experiences operationalizing data culture transformation programs.
Adel Nehme's photo

Adel Nehme

31 min

podcast

Data & AI at Tesco with Venkat Raghavan, Director of Analytics and Science at Tesco

Richie and Venkat explore Tesco’s use of data, understanding customer behavior through loyalty programs, operating a cohesive data intelligence platform that leverages multiple data sources, the challenges of data science at scale, the future of data and much more.
Richie Cotton's photo

Richie Cotton

42 min

podcast

[Radar Recap] From Data Governance to Data Discoverability: Building Trust in Data Within Your Organization with Esther Munyi, Amy Grace, Stefaan Verhulst and Malarvizhi Veerappan

Esther Munyi, Amy Grace, Stefaan Verhulst and Malarvizhi Veerappan focus on strategies for improving data quality, fostering a culture of trust around data, and balancing robust governance with the need for accessible, high-quality data.
Richie Cotton's photo

Richie Cotton

39 min

podcast

Monetizing Data & AI with Vin Vashishta, Founder & AI Advisor at V Squared, & Tiffany Perkins-Munn, MD & Head of Data & Analytics at JPMC

Richie, Vin, and Tiffany explore the challenges of monetizing data and AI projects, the importance of aligning technical and business objectives to keep outputs focused on core business goals, how to assess your organization's data and AI maturity, why long-term vision and strategy matter, and much more.
Richie Cotton's photo

Richie Cotton

61 min

See MoreSee More