Manage Your Data Better with Shinji Kim, CEO at Select Star

RIchie and Shinji explore the importance of data governance, challenges in data usage, active metadata and data lineage, improving collaboration between data and business teams, data governance trends to look forward to, and much more.

1 ago 2024

Guest

Shinji Kim

Shinji Kim is the Founder & CEO of Select Star, an automated data discovery platform that helps you understand your data. Previously, she was the CEO of Concord Systems (concord.io), a NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led building Akamai’s new IoT data platform for real-time messaging, log processing, and edge computing. Prior to Concord, Shinji was the first Product Manager hired at Yieldmo, where she led the Ad Format Lab, A/B testing, and yield optimization. Before Yieldmo, she was analyzing data and building enterprise applications at Deloitte Consulting, Facebook, Sun Microsystems, and Barclays Capital. Shinji studied Software Engineering at University of Waterloo and General Management at Stanford GSB. She advises early stage startups on product strategy, customer development, and company building.

Host

Richie Cotton

Key Quotes

Active Metadata is one of the critical parts if you want to automate more of your processes, if you want to do more streamlined, data governance, data documentation, providing this context, I would say, is really a core part of making that happen.

I would say the key to managing data better is in the metadata. Looking into metadata, whether you're doing it manually yourself or using a tool, it's all completely fine. But I feel like as you start diving into more metadata, you'll realize that there is a lot of insights that you can get around managing the data.

Key Takeaways

Utilize tools like GenAI and SelectStar to automate the documentation process, reducing manual effort and ensuring up-to-date, accurate information for all data users.

Use data lineage to trace data origins and transformations, which helps in understanding dependencies and assessing the impact of changes on downstream processes and dashboards.

Focus on capturing and analyzing metadata to gain insights into data usage, popularity, and relationships, which can help in optimizing data models and improving data governance.

Links From The Show

Select Star

Connect with Shinji

[Course] Data Governance Concepts

Transcript

Richie Cotton: Hi, Shinji. Thank you for joining me on the show.

Shinji Kim: Thanks for having me, Richie. Excited to be here.

Richie Cotton: Data governance is just such a huge topic. Maybe beginning with, can you just tell me why it's important?

Shinji Kim: Well, it's becoming a lot more important now because of GenEye. But to go back in the history, most of data governance has been implemented and been considered mainly for organizations to meet with data privacy regulations, GDPR, CCPA, or even before just to protect the individual's privacy and to make sure that your company confidential data And customer confidential data are kept in tact and is accessed or utilized correctly. I would say that that is a part of data governance. At the same time we also see a lot of needs of data governance because people want to utilize the data. If the data is not organized, or managed well, then people will say, Oh, we don't have a governance. We don't have a good way of saying what's this data or how is data actually getting processed and be available.

That's trustable inside the system. So I think this like a term, quote unquote, govern also encompasses that. And. Today, it is also being utilized for a lot of data management side of the issue. And that's why it's a big topic and why it's also very important. I would say today's data governance is really around providing the right way of accessing the data.

And when we say right way, there is still the part of protecting the data... See more

and making sure that it's used for the right purposes by the right people. , but also the right way of using the data is about giving you the indication of what data is the one that I can actually use and all the possibilities that I can use this data for and the utilize those situational data, data is starting to increase.

I would say up until the whole wave of ELT and cloud data warehouses, cloud data lakes. Data has been collected a lot, but not necessarily have been utilized a lot. But I think we are now going into an era where more and more enterprises are leveraging and utilizing their data. They can finally run the 360 data.

Degree analysis of their customers because all of the applications are starting to provide allowing users or customers to export the application data into one place to join them together and analyze them together. So when you're trying to utilize the data, it's really important to understand.

which are the right ways to join these tables together and which are the right curated or prepared data sets that I can actually use. And this is where also the governance comes in.

Richie Cotton: That's a really broad range of things. And from compliance, making sure that you're not breaking any laws through to actually making sure that all your data analysts, data scientists actually have access to the data in order to be able to utilize it and do stuff that's going to make you money or help your customers.

So, I like there's such a broad range there. And sorry, I think I interrupted you.

Shinji Kim: No, I was going to say and then I was going to move to GenAI because there is a lot of usage of data by the data teams and almost like an explosion of leveraging data into a lot of day to day business decisions. Now, we are trying to plug our enterprise data into a GenAI so that machines can help make us decision or help us automate operations.

And if the Gen AI module is just looking at the data as a raw data set, because it does not have the context of which data is the right one to use for which use cases. That's where the governance should really come in to ensure that there is a good data quality and integrity in the data that is provided or is accessible by GNI or others.

But also there is a good amount of context associated with it so that the data can be used for the right purposes. purposes that it's designed for and that is collected for.

Richie Cotton: That's really interesting that Generative AI is providing another use case for data governance. it seems like all these large language models are the very much hyped, cool thing at the moment, but actually a lot of the secret to their success is that underlying data governance stuff that's going to provide the extra performance.

You mentioned the idea of data quality and context of data. Can you go into, How do you define data quality? Like how good is good enough?

Shinji Kim: So to me, there are a few things about data quality. I think whenever somebody says data quality, the first thing that everyone thinks about, especially if you're technical, is, is my data fresh? Is this data that was collected the right way, in the right format that I wanted? I like that. Does it have the cardinality or is it filled with a lot of like blanks?

So there are these types of technical data quality that you want to get through. But that is I would say, like, one half of the data quality. The other half of the data quality is to ensure that the data is that context of data. Where is this data right, to be used for, and how it should be used.

Because When somebody says, Okay, I look at this dashboard, but this data doesn't seem right. Our data quality is not great. That's what one person may think. But that might be because that the analyst put it together, did not necessarily realize that they were using the wrong data. The data itself is totally fine, and it is up to date and looks great, but may have had some field or filters or different things added.

that I would consider as, you think about organization data quality, it's important to consider, what is the correct data to use? Because that is encompasses of the technical data quality. And you mentioned it can be very overwhelming to think about, oh, I just need to have all the data out to date.

At the same time, if you understand where each data sets are being actually utilized and how often and used by whom, you may realize that the SLAs and the technical data quality that you want to keep track on, you may not have to do that for every single data set inside the company. It is very expensive to have that type of, I guess, checks and qualities and everything else, you do want to have them, however, for your most used data most critical data, powers and is being utilized for automation, like your pricing engine, , or your recognition systems, for example.

Richie Cotton: yeah, I can certainly see how if there's a data set that isn't really being used by anyone, or maybe not being used by anyone at all, you don't care whether it has 99. 99 percent uptime or not. But if it's, you know, some people are like, how much money does your company have? Like, what's your current revenue?

Then yeah, you really want to make sure that number is correct all of the time. You mentioned the idea that it's often quite hard for data scientists know they're using the right data set. This seems like something that's incredibly important. You want to be using the right data. And that's an important part of data quality.

So why is it such a big problem for data scientists to find which is the right data set to use for any particular analysis?

Shinji Kim: to put it simply, we have too many data sets, so in the past when we didn't have all these data sets. A lot of data analysts and data scientists were able to just export the raw data and then prep it to create the reports. And that was fine. And a lot of this data analysis has been done in like Excel spreadsheets in the past, and we live fine. But now we wanna have live dashboards, data that is enriched from multiple data sources that is up to date and that, does provide that richer and more accurate. results for our decisions I would say in the past, the complaints and this could still be the complaints today.

But in the past, the biggest complaints that I've always heard from the data team is that they spend too much time on prepping the data. Because they were accessing the raw data. Today, there are all these transient tables, you have all dbt and others to make your own jobs to prep the data automatically.

And that's all great. Now, but you have created a lot more tables and joins and views inside. It's a matter of understanding which one is the right model to use. That's becoming more important, but it is hard because you have to now consider hundreds or thousands of different tables that are created for analysis.

And you need to figure out which one is the right one to use.

Richie Cotton: once you start looking through like hundreds or thousands of different tables, I can see how that becomes very difficult problem to discover what's going on. And certainly, even in like a fairly small enterprise or small company, you're going to get a pretty large set of data there.

Shinji Kim: Yeah, I would say in a way there are two parts. One is if you've been in the company for like, I don't know, more than six months, then you might already have a pretty good idea which are the trusted data sets and which are the main data sets that I can work off from. another part that I see as an issue that happens a lot is that you have a data scientist or data analyst or engineer that have built a lot of these models.

One day, if that person leaves the company. It's very hard for the next data scientist or the rest of the data team to parse through what has been the main sources of truth that the person was using. And there might have been documentation or whatnot, but it's always really hard to dig through, like, all the work that they've done because a lot of it is embedded inside of their code in SQL products.

I would say overall the phase that we are in, it's, the, industry is starting to mature as we are adopting more software engineering practices overall. In the past and still in many companies, a lot of data teams work in their local environment.

They run different SQL queries and then embedded it for their specific dashboards. So, a lot of the tribal knowledge and then the understanding of using which data sets, how to join them, what to filter on. A lot of these are embedded as like a tribal knowledge in organizations today. And this is why it's hard to find the right data.

to use.

Richie Cotton: , I can certainly see how people leaving the company or having comments about how the data set was processed, being stored in a code comment somewhere. That's not really ideal for understanding what was going on. So you mentioned the idea of documentation. What does good documentation look like and who would be responsible for creating it?

Shinji Kim: I think documentation is one of those assets of the company that helps you operate more efficiently and effectively. As organizations grow, you cannot just have a onboarding or transition meeting every single time. You would want to have a documentation that people can read and contribute and make it so that you're kind of like a knowledge.

That is being shared. In terms of data, a lot of that it trickles down to what are considered as your core tables, like in your data mart? How are these data sets related to your current business processes? , or your business operations? Because most of the data is a reflection of the tools that you are using or how the business operates internally. It also has a lot of the I guess, indicators. Of let's say if it's about product analytics, then a lot of the data is about the user behavior and activities and so on and so forth. If it's more about financial analytics, it's, about the sales data and operations data that comes from different ERP systems and so on and so forth.

So, being able to map those in a high level picture, I would say is one of the more important documentation, but it could also be very obvious to people. I think the part that gets a little bit trickier is mapping that internal data internal either business process or analytics model with the data sets.

So that's one part. The other part of documentation is more of like, your data dictionary. So, your important tables. Now you know that these are the core tables or denormalized tables that everyone should use if they were to an analysis for a user or a customer or an account.

For you, it is going to be important to actually document the fields. So all the attributes and the way that all the different columns that makes up of that data set. So like what this is represented in or what like maybe it could be a sample data. What's the grain of the table? What is the unique key that you can use?

Things like that I think is important as part of the data dictionary or the table column presentation. And then last but not least, what a lot of companies really also try to have is a business glossary. So this helps mapping the terms from the business processes with the data sets, but also to share the notation of, what has been marked or as a field X.

In businesses, this can be also called you know, Y, Z, or something else. a lot of that I think comes into the acronyms that the business uses. And it gives you an understanding of that linkage between whenever we have this acronym there is a definition.

And then you can have the definition into how that is represented in data as metrics. So I think that that linkage is also very important. So, kind of three main parts that I would say I would consider as important in data documentation.

Richie Cotton: I love that idea of a business glossary, which just explains all the nonsense that your colleague talked to you about and you don't quite understand. Certainly, I have to say, like moving from I've been in a lot of sort of different teams, I've done data stuff, and then I've worked with salespeople, now with marketing people, and they all have their own different sort of ecosystems of terms, and it's having those things written down, that sounds incredibly useful.

In general, I think people don't like writing documentation. So, who ends up being responsible for this? And are there any kind of shortcuts to make this easier? And

Shinji Kim: Yeah in a way people that are responsible for usually becomes quote unquote the data team. and, it makes sense because they are the ones that have created the data models. So I believe the main in a lot of companies, they also call them data stewards or data owners.

You created this model to transform the data. From this raw sources to your transient tables or to the reporting tables. You have your own idea of how that data should have been prepped or aggregated or transformed. So that needs to be embedded into your documentation.

so, if you think about it that way, then what about the source data? So a lot of the times the source data is I believe it should be documented by Either you can pull the documentation from the source like an application, like if you have Salesforce, a lot of the APIs that have documentations of the fields, of what data is coming through. But then either the product manager or the developer that have created the initial data model can also document the data. So it's actually both the producer and also in a way consumer because consumers sometimes have a lot of say in how the data model should be structured, And then the intention of collecting or transforming that data. And the way to make this work or make it easier for for the team. Now there are multiple options, but here are the some of the things that we've seen and that have worked well in the company. So first and foremost having an ownership of these data models or data tables defined, I would say is quite important.

And you can do this by who has been querying this table the most. Probably because they requested that table to be created. So, they can document that table. for SaltXR, we looked that up by if you are one of the top users. That table the other way on documentation not just by asking people to own the table but to more or less help the documentation is to try to automate the documentation. The two ways that we've been doing this is one, by looking at the data relationships that is in the pipeline, if there is a raw field that came from a source. That's been already documented, and then there are other transient tables that goes into the reporting table, but it's actually using the same data.

We can trace that field through a lineage and propagate the documentation. And there are two benefits onto this. One is, you don't have to worry about documenting it from the first place. And it will use the same documentation that somebody else, which is the original writer have written. The second part is that using lineage, you can actually keep the documentation up to date.

Because if I were to, let's say, make a different change, that new documentation can be propagated to other fields that it's linked to. So, I think utilizing these models to associate, which could be the similar or related documentation. And actually propagating them throughout the data model increases the data documentation rate much higher than before.

We personally have seen, just with Lineage, about 300 percent increase on field documentation fill rate across our customers. Now that is only if you have documentation somewhere. If you have no documentation at all something that we recommend our customers to use is GenAI. So GenAI, by looking at the and having the schema information can, put out some documentation for you.

For us at SelectStar, we also utilize and feed the GenAI with other types of active metadata, It's source columns it's downstream dashboards. So that , that object there is a full context there. actually being fed into GenII to create more richer, accurate documentation. A good example of this is that we have a customer that were documenting their I believe it was a Looker Explorer field which is like all Looker views, right?

But still none of their view fields were defined. But because we fed in metadata and also the downstream dashboard metadata, We were able to give them a unique documentation for 30 different count fields.

Richie Cotton: these are all supposed to be the same field,

Shinji Kim: No, they are all, they were all different because the actual view or the raw field it was coming from was different. Um, And in. Looker there is a lot of different ways to create your own measure. It could be like sum of x or count of x. But for that customer, they all had those field names as count.

So providing that context of what, is it counting? What is it actually counting? For three different fields uh, was actually very helpful. So, Gen AI is definitely coming in a really handy tool. And it really also helps gives an encouragement for the users or the owners and the stewards to not necessarily feel like, oh, I have to start from scratch and I even start?

They can just review what Gen AI have generated for them and edit as they need to. And I think that's a huge uplift for a lot of data teams.

Richie Cotton: I mean, I couldn't imagine having to scroll through like 30 different fields, trying to work out which one was the right one that I wanted. That seems like a horrible situation. So I like that. documentation side of things automatically taken care of by AI, and then you can be more productive. So that's incredibly helpful. you mentioned the idea of data lynching. I want to come to that in a moment, but before we get to that, I want to go back to something you said at the start, that having context around data is incredibly important. And this seems like, Metadata is going to be an important part of the context.

So, can you talk me through, what sort of metadata you need to capture about your data?

Shinji Kim: Well, metadata is defined as a data that describes your data. So I would say anything that we interact with as a data, whether that is a file or a record or anything, there is always some type of metadata that's associated with that object. Inside of database. There is a system metadata tables called the information schema.

that keeps a record of all lists of databases schemas, tables, when they were created, when they were last updated, who created them how big is the table how many rows there are. A lot of these are, I would say, the part of metadata. Even your video files. have a metadata has your video file name is x and how long is the video for that so on and so forth.

So that's kind of like, the metadata there. And when you look at metadata, metadata tells you about that, like when I think about, data in general, data can be just like a one point or one fact even if it's a data set, it is a one point of thing. But if you use metadata, and if you are starting to analyze metadata, you can look at and try to figure out what the correlations are, , or you can lay out, how many, let's say, tables were created And then if you start looking at the activity logs, which we consider it as part of the metadata, but different type of metadata because talks about what actions has happened on the data set. You can gather a lot of information to build a context around usage of that dataset.

So I would say combination of those in a buzzword, active metadata makes your metadata that describes your data. That can be its own. But as your data changes, which your activity logs are going to tell you, right? You can make your metadata to be quote unquote active, where the metadata that describes your datasets are up to date and has more information that it's associated with.

Because for any datasets, the dataset itself can be useful, but if you want to join it with other datasets, Because it's related or it can be augmented with more context. That's a lot more value. So, I would say active metadata is one of the critical parts. If you want to automate more of your processes, if you want to do more streamline.

data governance, data documentation, providing this context, I would say, is really a core part of making that happen, which we basically all really what we do at SelectStar mapping and making metadata to be very active in our customers data ecosystem.

Richie Cotton: I love that idea that just looking at metadata properties, like, how old the table is, that's going to tell you, is this going to be useful to me or not, or Maybe this was created by someone in marketing. It's that field that was just labeled count is probably gonna be something to do with the marketing metric.

I also like this idea of active metadata where you just. Making use of the metadata to do further analyses and get some value out of it. how does this use of metadata then lead into the idea of data lineage?

Shinji Kim: There are, I guess, three parts of like active metadata that we consider as quite important. First is data lineage. So data lineage is about tracing where the data has originated and has gone through different pipelines to be consumed by different users or applications. This allows you to see the dependencies of datasets.

if there are data pipelines and other data assets that queries or brings data from one place to another. The second part is what we call popularity, which is data usage. So if you want to monitor which are the most important data sets, Or if you want to try to find out which are the important data sets that we have in that side of the company.

And if we have thousands and tens of thousands of data sets, do we really need all this data? And usage can really tell you that answer. We've also seen a lot of different use cases when you combine usage and lineage together. When you want to do some FinOps. analysis or if you were to do any data migration or wanting to, , let's say, implement like a data contracts both the usage and lineage have been really helpful.

The last part around the active metadata, I would say is really important. I would say is that understanding the entity relationships. So in the past, in the relational database world, we're to come from more like a transactional databases. We have worked with primary keys, foreign keys, and we can see how the data models look like by knowing where to join.

In a lot of data lakes and data warehouses, there is no such restrictions or implication of that. So, understanding the join keys and understanding which are the related data sets that I can use with this data set that I just gathered it's quite important. And I would say the uh, entity relationship diagrams if you can refer it.

And we refer this from a lot of the select queries that run within the data warehouse. This is also really important to be able to understand all the other possible ways to look at the data. And so I think those are the three main things that are quite important. Going deeper into data lineage, data lineage is more around the, you can think of it as , you know, I have a dashboard.

Where did this data come from? I have this table. If I were to change anything in this table, will this break anything downstream? I think those are the two main use cases that we see the most on data lineage. At the same time, we are also seeing more interesting use cases when you are starting to combine uh, lineage of multiple tables to understand what are the shared data sets or fields that are being currently queried or being utilized versus not.

Which gives you an understanding of if you were to try to prioritize during your data migration or if you were to try to remodel your data because you feel like, , we have a, this is like a giant denormalized table how many of those columns are actually being utilized?

I think this is another part that We've seen customers splitting their tables or remodeling their tables just overall. And when you use the understanding of data lineage as well as usage metrics like the popularity, this just becomes a lot more handy. And. One for playing the work and also to communicate other teams if you need to rely on others to change their models.

Richie Cotton: some interesting problems that I hadn't really thought about before. So that first one about If you've got a data lake house, normally in the traditional sort of data warehouse setup, all the tables is obvious kind of how they can be joined together, but in a data lake house or in a data lake, it's maybe less obvious.

You don't know whether you can do an analysis involving multiple tables. So that's it. Yeah, that's a crazy problem, and I can certainly see how having that lineage that describes how things should fit together, that's going to make people more productive. But the second one was also very interesting, that if you've got a very complex data pipeline, then if you change something, you're probably going to break some dashboard somewhere, and then everyone's going to hate you, I suppose.

So, you need to make sure that you're not going to break stuff. Can you talk me through how the workflows now change for people developing data pipelines or developing data products how that's going to change once you have the data lineage?

Shinji Kim: So one is that you can look at the data lineage to understand if you want to make a change to a column or table if there are which other tables that may need to migrate from using the old type, old column to a new column or for them to be aware that this new data set is available you know, things like that. So one is the planning the change and then being able to communicate that change to the downstream users. The other way that this is also being utilized in the workflow for Select Star is through our GI Actions integration. So it can be integrated into your ci cd with our Lineage API, so that every time there is a PR that makes a change to the column or the table.

Definition query, like a DDL or DML that creates a query or updates that refreshes the table. Basically we can post an automated comment, PR comment that lists out all the downstream impact that PR could have. and this is not just about a, here are our five tables. And two dashboards. That is definitely one part of it.

But then the other part would be like, for each table, there are two users of this table that is like a monthly active users. There are five dashboards consumers that is on the downstream of this. So having an understanding and before the code gets merged. Is a also an automated part that we see as a game changing workflow for customers today.

So, a good example of this is that one of our customers, Zometry , they mentioned that, that integrating the lineage, column level lineage into their CI workflow has made them to sleep through a night and allow their engineering teams to focus on more proactive work. rather than reactive work of trying to fix the data pipelines when their data models on the production side breaks.

say production, I'm talking about the production data before it comes into their data warehouse. So cross team collaboration perspective, like the data teams with the product engineering teams, this can really also be super helpful the workflow side of it.

Richie Cotton: I can certainly see how that could be a real game changer if you know what you're going to break before you break it, rather than just receiving a phone call like 9 p. m. on a Friday night saying, Hi, we don't know how much revenue our company has because the dashboard has broken the data pipelines down.

So yeah,

Shinji Kim: Work. It could be as a more critical and for this company isometry. It was very critical because they were maintaining a data pipeline that feeds into their pricing engine. So 24 7. If this pricing engine does not operate correctly. they can lose a lot of revenue.

Richie Cotton: one thing that seems to be a recurring theme in this episode is that things the data scene does is going to have an impact on what the business team does and maybe the business teams want to have some kind of say over how the data is defined and things like that. Can you talk me through how data teams and business teams can work efficiently together?

Shinji Kim: The more that we work closely with data teams, the more that I see that lot of modern data teams and where data teams should go overall. Is acting as a partner with the business teams. And many times I think this is starting to become more true. But in contrast, a lot of the traditional and in the past, many data teams still do act as if they are like a service organization or like an I.

T. team to the business teams. And I think there are a few ways that we've seen how this can be changed is first and foremost, having the data teams to be aware of the business goals and the context of the business so that they can ask back the questions. How important is this analysis?

For what the business is trying to achieve and to be able to suggest the different ways of looking at the data because they are experts in how the data is modeled. and can apply their business knowledge on perhaps providing even better analysis than what the business has asked for. And I think the process of this in a way does require some type of embedding of data analysts or business analysts to be alongside with the business teams.

And I think a lot of companies are making that decision. Trend and I think that's really great to see. It's also on the other hand, important for the business teams to be aware of the data models that are underneath the dashboards. So because it tells them how the number has been calculated.

And it is always the frequently asked question when they get this number. Oh, so how did we get this number? so I think on both sides and a lot of teams are making more strides to make that happen where data discovery, data governance and data documentation are all very helpful to make that happen.

Richie Cotton: It's interesting that you say, well, business people need to be able to just dive in and see how things were calculated. And it seems like this is a really tricky design problem is designing things so that both the technical data people and the maybe less technical business people are happy. Do you have any tips or tricks on how you can design so that less technical people can understand how something was calculated?

Shinji Kim: So I would say that a lot of the times the business teams are accessing data through their BI tools. And within the BI tool, there may be ways to share like the fields that were chosen to run. And the formula that were created , for that dashboard. So that is definitely one part that they can see it.

So that's one. And then the second part is the documentation, right? Like we talked about the business glossary. They're very familiar with all those business terms. If that business glossary includes the metrics and the calculations of that metrics, then it is very much , on state and that everyone can refer to instead of thinking, well, I calculate my revenue this way.

Did you use something else you can always it to be pointed into the documentation. So I think that's another part. But at the end of the day, especially for business teams, To be more self-service and can be rocking more of the data models. They do have to have that interest and curiosity around data and the way to make that happen.

And this is just overall maybe the human nature. I believe that one of the most important things to make that happen is for the business users to be able to use data. for some type of analysis or some type of thing that they weren't able to do themselves before.

this kind of comes into like training and enablement and everything else, but giving them the tools, space, and the support to be able to play around with data, explore the data that the organization has within. think it is super critical to kind of get more people up to speed.

Because I believe that a lot of people do have this interest. a lot of business teams, people are very good at Excel. Or they want to utilize more data, but they just don't know how. And I think giving them these examples of, okay, well, , here is a dashboard.

Now if you want to see it , sliced by this, like you can answer this question or what other questions can we answer? And I think that there is a lot of this interaction that data team can also have with the business teams, which might happen already. But having the business teams to be able to utilize the tool themselves.

I think it is also a key part of getting them up to speed on to data models.

Richie Cotton: So I like that idea. It seems we've gone back to the idea of business glossaries against . Having the calculations made transparent so that everyone can see how they're done. That just seemed like a good way to help different teams collaborate better.

Shinji Kim: Yeah, and in a way, because it's hard to document everything, one of the things that, , we exist and part that we are trying to provide is that if you have anything that's embedded in the tools, Like code, SQL queries, things like that, like we'll parse them automatically to take it out as your documentation or to your lineage so that everything is laid out clearly and you can have that transparency with other data teams or team members, as well as your business teams.

Richie Cotton: Okay. Yeah, I do like that idea of automatic documentation because again, so better than having having to write it myself. Wonderful. Okay. So, before we wrap up what are you most excited about in the world of data governance at the moment? Cool.

Shinji Kim: I mean, right now, I think this market is really interesting how it's evolving, but with the introduction of, and , a lot of the usage around Gen AI, using Gen AI and AI overall for data governance. as well as leveraging that data governance for AI. I think both are very interesting. But for me, I gotta say, yeah, being able to like for SelectStar, we were able to get a lot of benefit and acceleration by using GenEye for our customers.

And this is something that I'm very excited to keep forging on.

Richie Cotton: Yeah. Get AI to solve all the nasty problems that you don't have to do yourself and make data easier to work with all of it.

Shinji Kim: It's, it's, yeah, it's really converting a lot of manual processes. to more of automated fashion. The other part is so we talked a lot about documentation but another part that is also very important in governance is tagging. Semantic level tagging, whether, like how sensitive this is.

data is, or , if you were to organize your data, what are the gold, silver, bronze tables, things like that, or what are the tables that you should mark as do not use, or to be deprecated. A lot of these now can act as the semantics. for your search results, your documentation, or your chatbot and assist, , or SQL query generation.

lot of this like, the fact that we can , decouple more into, , the semantic tagging documentation, and then utilizing those on top, to create and automate more workflows , is why this is very exciting to me.

Richie Cotton: Absolutely, yeah. So you can see how having something that says, okay, this table is about to be deprecated. That's going to be really helpful advice for anyone who's, thinking of using it. And that's going to make it much easier to find the right data.

Shinji Kim: And then the other part about governance, the traditional side of data governance around regulations and complying with the rules, the data privacy and so on. There's a lot of, I would say, the legalese and the ways that can be now processed. through Gen AI to build a better and more refined policies for different types of data that comes into the picture.

So, , that's another part that I, feel like will be the step change from before.

Richie Cotton: And do you have any final advice for anyone who wants to get better at managing data within their

Shinji Kim: I would say the key is in the metadata, looking into metadata, whether you're doing it manually yourself or using a tool. , it's all completely fine, but I feel like as you start diving into more metadata, you will realize that there is a lot of insights that you can get around managing the data.

So, yeah, that's my main advice and we are happy to help. Well, Richie, thanks so much for having me here. This was a lot of fun conversation and great questions here. So thanks again.

Temas

Data Governance

Business Intelligence

Data Leader

Relacionado

podcast

How Data Leaders Can Make Data Governance a Priority with Saurabh Gupta, Chief Strategy & Revenue Officer at The Modern Data Company

Adel and Saurabh explore the importance of data quality and how ‘shifting left’ can improve data quality practices, operationalizing ‘shift left’ strategies through collaboration and data governance, future trends in data quality and governance, and more.

podcast

How Data and AI are Changing Data Management with Jamie Lerner, CEO, President, and Chairman at Quantum

Richie and Jamie explore AI in the movie industry, AI in sports, business and scientific research, AI ethics, infrastructure and data management, challenges of working with AI in video, excitement vs fear in AI and much more.

podcast

The New Toolkit For CDOs with Adrian Estala, VP, Field Chief Data Officer at Starburst

Richie and Adrian explore the modern data stack, agility in data, collaboration between business and data teams, data products and differing ways of building them, data discovery and metadata, data quality, career skills for data practitioners and much more.

podcast

Making Data Governance Fun with Tiankai Feng, Data Strategy & Data Governance Lead at ThoughtWorks

Adel and Tiankai explore the importance of data governance in data-driven organizations, how to define success criteria and measure the ROI of governance initiatives, non-invasive and creative approaches to data governance and much more.

podcast

[Radar Recap] From Data Governance to Data Discoverability: Building Trust in Data Within Your Organization with Esther Munyi, Amy Grace, Stefaan Verhulst and Malarvizhi Veerappan

Esther Munyi, Amy Grace, Stefaan Verhulst and Malarvizhi Veerappan focus on strategies for improving data quality, fostering a culture of trust around data, and balancing robust governance with the need for accessible, high-quality data.

podcast

Aligning AI with Enterprise Strategy with Leon Gordon, CEO at Onyx Data

Adel and Leon explore aligning AI with business strategy, enterprise AI-agents, AI and data governance, data-driven decision making, key skills for cross-functional teams, AI for automation and augmentation, privacy and AI, and much more.

Ver más Ver más