ข้ามไปยังเนื้อหาหลัก

AI's Impact on Databases with Shireesh Thota, CVP of Databases at Microsoft

Richie and Shireesh explore how AI agents are reshaping data stacks, why unified platforms like Fabric matter, how semantic models and ontologies reduce confusion in metrics, SQL and NoSQL choices on Azure, Postgres to Cosmos DB with guidance for builders, and much more.
13 เม.ย. 2569

Shireesh Thota's photo
Guest
Shireesh Thota
LinkedIn

Shireesh is the CVP of Databases at Microsoft. He leads product management, engineering, and cloud operations for Azure Databases as well as App Development for Microsoft Fabric. The products in his team’s portfolio include Azure SQL Database (on-prem, Hybrid and Cloud), Azure Cosmos DB, Azure PostgreSQL, and Azure MySQL.\\n\\n

Previously, as the Senior Vice President at SingleStore, Shireesh was responsible for end-to-end engineering and product vision of the company. Before moving to SingleStore, Shireesh was a founding member of Cosmos DB, where he architected, designed, and directly contributed to multiple key pieces of the services.\\n\\n

Shireesh has 20+ years of experience on large scale, big data, scale-out, relational and schema agnostic distributed systems across SQL, Azure Cosmos DB and PostgreSQL/Citus.


Richie Cotton's photo
Host
Richie Cotton

Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Chat with AI Richie about every episode of DataFramed - all data champs welcome!

Key Quotes

We have basically invented this unified data platform. We call it fabric. And so fabric becomes now the one stop. Shop for everything. Data in Microsoft, it basically has all the things that I talked about in, in, in terms of going from data integration, data science, data engineering, real time analytics, power bi. All of those pieces are already stitched together. And you have one security, one business model. It's one lake. Which basically has the consistent data across the board. Most importantly, we've moved from unified analytics to unified data.

I truly believe that this industry needs to embrace autonomous experiences in the data layer, we will , we're very deeply committed. What happens with that is that then you really liberate the app developers, the database developers, DBAs, and all these data professionals to go meet the needs of the apps much better and like really scale better. Once we do that, then they can 10 x their experiences, they're still the conduit between converting their sort of business objectives into the data layer.

Key Takeaways

1

Treat AI agents as an added reasoning layer, not a replacement for data fundamentals: keep investing in resiliency, security, lineage, and data quality because agentic analytics only works reliably on a trusted source of truth.

2

Before enabling natural-language analytics, build a semantic model that encodes shared business definitions (e.g., what “revenue” means for finance vs. sales) so agents generate consistent queries and avoid silently returning different answers across teams.

3

Keep SQL and data modeling skills in your team even as agents write queries: humans still need to validate intent, choose the right consistency/scale trade-offs (SQL vs. NoSQL), and prevent downstream performance and governance problems caused by poorly modeled data.

Links From The Show

Microsoft Fabric External Link

Transcript

Richie Cotton: Hi, Shireesh, welcome to the show. 

Shireesh Thota: Great to be here, Richie. 

Richie Cotton: Yeah, great to have you here. I wanna kick off by you talking about the data stack. And I'm curious now we've got all these AI agents, has that changed what you need in a data stack? 

Shireesh Thota: Yeah, so there's definitely some things will change and some things will change and things that don't change in the data stack.

The fundamental resiliency, security, quality of data, et cetera, the things that we've been working on for decades those don't change. And the ability for us to really go get to the source truth and just. Enhance the amount of data that you'd get to gather insights. There's a lot of tooling that's available now that would make it much, much easier.

But the end of the day though, these are tools that basically can help you with your tasks. The one thing that massively changes is the ability to reason with data. If you have the right context, the amount of reasoning that you could put in, into the, naturally when we went from Chad GPT to the next evolution of the reasoning aspect of it and now to the agent experiences data is going through the similar kind of journey where we had a little bit of, you ask a very deterministic question, you have the right predicate, and you just ask that question, you get the right answers.

Now you could ba basically go and do reasoning on the data and you could apply a agentic experienc... See more

es as the next ultimate evolution of. And that's really what Microsoft is doing, what we think of as a data stack. But at the end of the day, the fundamentals do remain the same. We went from modern Data stack to a unified data platform as a paradigm Fabric, for instance, is one of those things that basically made that happen.

But that paradigm of going from modern data stack, actually before Modern Data Stack, it was all fragmented. You'd have to go do your own bespoke architectures. We went from that place to modern data stack, but you still had to stitch a lot of services. You go from there to a unified data platform, which makes it easier by pre integrating all those things into a salike environment.

Then now we are evolving further into unified intelligence layer on top of data because data enables you. So while the fundamentals don't change, the data stack really is moving towards a point where it can help you reason with it and deeply get to the insights and intelligence with the, all the latest, greatest tooling that's available.

So that's, in the sense the changes that we will see more and more of. 

Richie Cotton: Okay. Lots of great points. And I do love the idea of having more reasoning. Allowing you to ask different kind of questions about your data compared to simple determinants questions. Maybe we'll get into that later, but let's just maybe back a bit.

So you mentioned the modern data stack and then it being simplified and becoming a unified data platform. Do you wanna talk me through what that shift was and what the different components are in each case. 

Shireesh Thota: Yeah. One of the biggest challenges that you'd see from across pretty much all of the data practitioners is that they'd have to deal with a lots of different pieces.

So when you're trying to achieve anything meaningful really in the data platform, in the data space, any data project, it's not. The case that you just deal with an operational database one at a time, or it's not that you just with a data warehouse one at a time and you'd be done. That's rarely the case.

You'd have to deal with your data being born. Typically, it's born in operational database, OTP systems, system of records and that kind of stuff. And then you'd have to take that data, do all kinds of manipulations, EDL. Pipelines, then bring it into an analytical stack. When you bring it into analytical stack, you need to marry that with various kinds of data sets.

You may have to deal with real time. You then have to deal with an analytical piece where you have to go, how do I store this? Data lake, what formats, open source formats, proprietary formats, what engine do I have? Or you have different kinds of options. How do you serve the data back through that engine?

And you have different kinds of, again. Some kind of real time analytical layers, operational layers that can help you serve that. And then finally, you are doing probably a data science work and data engineering work. And then ultimately it'll all lead to some kind of bi where you have to report all that stuff into a dashboard that, that is really useful for the business users.

Now, when you think about all these pieces of the stack, it's incredibly hard to get it all right. Get it simplified to a point where you really can sleep well and not have to worry about the SLAs that you can promise. But that was the biggest challenge. So unified data platforms are effectively trying to simplify that by having consistent experience across the board.

One of the most important mechanisms to do that is to have a lake where all of the data is consistent. It's typically open source format, so you don't have to feel like you're locked into anything. It is really exactly one copy for anything and. You decouple the writers from the readers. Writers all can write into the common format into one leg.

Readers will all be operationalizing on that stuff. They can, all the computer engines can be working on those on the same data format, independent of who wrote it. So you are now, you're like really going from a multiplication to. It's not m by n kinds of failure points. You have M plus N where em, readers, end writers, somebody's write, but that's all possible when you have a lake that is basically open source format, consistent and it's one copy, et cetera.

But it's not just that, it's more than that. It is about stitching everything together. So you have one security, it's one lineage governance kind of model. All the engines basically have same business model, so you could really mix and match different kinds of capacity provisioning and all that stuff. So it goes deeper and that's the reason why you need a unified data platform that understands all these pieces and not think of it as a piecemeal typically a SaaS experience works better than just giving out all these tools and let the developers handle them.

So that's gonna really. The evolution here and we are seeing good success with that. 

Richie Cotton: I love that idea of the flow. If I've understood this correctly, it's basically gather all your data from almost everywhere, shove it into a data lake, transform it into something that's usable for other people, and then you can serve your analytics to all the business users.

Very simple flow. And it sounds like a lot of the architectural questions, like we, we've been trying to figure out like what should be in this stack for a couple of decades now. What are the sort of the problems that people are still trying to solve by evolving this technology? 

Shireesh Thota: A lot, so obviously the traditional data projects are still there, right?

The traditional data projects of trying to really create a a place where you can gather insights. Immediately, as soon as the data is born, as soon as the stream is born. The challenge that whenever you have data that's been born in operational system, how much time does it take for me to really gather insights so that then I can complete the feedback loop?

Perhaps you're building some kind of a small e-commerce retail system or something. I just take an example. There's a lot of flow coming in. There are a lot of eyeballs from the customers who are trying to do some shopping or whatever. You go from that point to the point where you really have to go create a, pricing adjustments perhaps, and that requires you to really go do a lot of things and gather insights and then push it back to the pricing. Maybe you are having to intersect with the inventory supply chain and all those other pieces. How do you do all that stuff really efficiently, effectively, those problem?

That's just the nature of one of those problems, those kinds of problems, not going away with. The evolution of ai, though those problems got even more the opportunities, I would say, not even problems. The opportunities got multiplied. You could gather a lot more insights a lot more efficiently. You really can go to the point of researching on the data and deeply analyzing the data, which massively enhances your operational.

Footprint just, all up your application space, et cetera, can get much richer. 

Richie Cotton: I love the way you ly rebranded problems to opportunities. It's a good mindset, I think to have 'cause yeah, there's problems everywhere. It's no, we've got opportunities. I love 

Shireesh Thota: it. Yeah, no, I genuinely think that is the case.

'Cause there are definitely more, you gotta do more work, you gotta embrace new technologies. But just the, about the amount of opportunities that it opens up is just, some of them you couldn't have even thought about just a year ago. 

Richie Cotton: Absolutely. And it just sounded like a lot of the challenges around like integrating stuff and making sure that all the data is in the right place that fits your business needs at the moment and maybe your business needs change.

So you then gotta shift around all the data once again just to, so yeah. It's a never ending task, but I guess again, opportunities for excitement. Alright we talked a bit about the data. Second general, I'd love to, get a bit of help with navigating Azure. 'cause you've got, what, two, 300 services or something?

Now I get confused about what's what. So maybe do you wanna talk me through what's available for data on Azure? 

Shireesh Thota: Yeah. So you know, that's exactly really the motivation for us. What you said is true to. To a good degree now, as a hyperscaler, as a provider of technology. We are a software factory, effectively as a company where we have to really go serve the needs of not just the high-end enterprises, not just the sort of hobbyist mid-market, like anybody who's just coming up, et cetera.

We have to serve the needs of the world, and that's really our mission to empower everyone. And so in the process, you'd have to go through different kinds of offers, different kinds of options for every different persona, every different. Opportunity that may exist, et cetera. But we realized that it is complex.

So what we have today in a very nutshell. I run databases and in the world of operational databases we've been in the business of all TP systems in on-premises for as long as 1989, right? We've announced SQL Server first version, back in the day. And we just announced a new version of it in 2025.

We totally recognize that there are customers who were. To run SQL Server in private data centers, in other clouds, whatnot. We offer that. But the best place to run that is in the cloud. And in the cloud we offer ias where you could really bring your database, run it in a vm it's good to go. We have pass offerings in the past offerings for databases.

We have databases that are born in the cloud, like Cosmos DB, for example, which is a non-relational elastic. Great for geo replication. Great for R-P-O-R-T-O kind of engagements. Is born in the cloud, only in the cloud, it's a pass offering. We also have embraced open source databases, so we're very big on MySQL, even more bullish and like we're contributing a lot more to post press.

So there's a lot of pass offerings there. SQL of course runs in PASS as well in, in the form of Azure sql now. For the rest of the stack. We've of course evolved from shipping Power BI as a standalone entity. We had analysis services that kind of helps with the BI tasks, building cubes all the way from that to dashboards of Power bi.

So we've started with that. We have basically a data factory that can help you do the pipelines of data transformations and migrations from one piece of the data to the other, et cetera. But we've, what we've noticed is that, it really is not easy for developers and engineer and customers to really go reason with all the variety of data issues and for all the reasons that I was explaining earlier about Modern Data Stack and all the, integration challenges.

We have basically invented this unified data platform. We call it fabric. And so fabric becomes now the one stop. Shop for everything. Data in Microsoft, it basically has all the things that I talked about in, in, in terms of going from data integration, data science, data engineering, real time analytics, power bi.

All of those pieces are already stitched together. And you have one security, one business model. It's one lake. Which basically has the consistent data across the board. Most importantly, we've moved from unified analytics to unified data. The Delta is bringing in operational databases also into fabric, so you have.

The place where the data is born, the place where the insights are happening, everything is going through one leg. Everything has one, one security model. It's truly integrated. So fabric becomes a one-stop shop to, to answer your question. So we really have a very elegant answer that arguably we didn't have a few years ago.

So now we do you really go to fabric, think about it as set the tenancy and, pick your workspace and you can go interoperate with various pieces. 

Richie Cotton: Okay. Wow. In theory it sounds like things might be getting easier, but there's a lot of different things there. You mentioned you talked about Sapa, so software as a service and platforms.

You mentioned like a lot of different databases with different properties. So some of 'em have the geo replication. Some of 'em were like. Super robust. Some of them are models of small data. Does fabric take away all the needs for deciding what infrastructure goes underneath or do you still have to make these decisions about which database to or wanna run stuff?

Like how do I decide what should go inside the fabric? 

Shireesh Thota: So I think we will get there eventually. There is definitely a need for many of the developers who really know what they want and we don't wanna take anything away from that that need. There's a lot of power where the. A customer knows what they really want.

What kind of a database do they need, what kind of application that they're building? If you, because there are some. Some concrete and really hard trade offs that you have to make when you're picking a NoSQL database versus a SQL database. You trade off asset guarantees to a certain degree because you want scale and high availability.

You're, these are physics rules that you have to bend in terms of giving you a high RPO, high RTO. Then you need to really give up on some of the, how do you do unique constraints on on a petabyte data because the data is sharded across multiple instances, so you're gonna. The rights will slow down significantly.

So there are some really hard trade offs again tied to the speed of like problem at the end of the day. So you do need to make sure that, our philosophy is that we don't wanna really simplify. We wanna simplify, but not simplify to the point where it is it. So simpler that it is not useful.

We wanna not remove the power of that choice. What we wanna simplify is the ability for you to, or the complexity of really integrating all these pieces, trying to think about different services in terms of what they have to. Offer, how do you provision them? How do you manage them? Do each service have a different kind of security posture?

Do each service have a different kind of Dr posture? How do you really do SLAs differently? How do you do business model differently? How do you provision capacity? All those kinds of challenges we wanna remove because that's not really where your energy should go into. Your energy should go into the architectural trade offs that you want, where you wanna, Hey, this is a workload where I really want to have.

Immense scale. This is a workload where I want all the richness of SQL and I want full asset guarantee because consistency is everything for this application. That change and that sort of trade off. We don't wanna, we don't wanna get into the, in the way there. So that's really our philosophy.

Now as we go up, as we build more and more intake, applic. Agents and AI will certainly make those choices also easier because, you are then talking to the agent in language in English instead of really going and making those decisions. So we'll get there. But ultimately as a platform though, our goal is to simplify where we should be simplifying.

And but without removing any of the power of the platform that you as a developer, as a customer would absolutely need it. 

Richie Cotton: Absolutely. Yeah. I can certainly see how there's different uses gonna want, have like different opinions on how much control they want. So if you're a software developer or your database admin, you're really gonna want like that fine grain control over what's your database doing.

Whereas if you're a business user, then maybe less so you don't care so much about. The specific architecture you just wanted to work. You mentioned a lot of the sort of different types of database there. You mentioned like you got the SQL database versus the NoSQL databases. Of course, effective databases are like the more recent sort of hot new thing, but a lot of databases to be incorporating a lot of these features.

Is that a trend you've seen where rather than having. Distinct databases for specific purposes. You've got databases that encompass a lot of different things. 

Shireesh Thota: Yeah. At the end of the day though, we basically have two kinds of databases. The databases that really care about asset operational kind of characteristics, where you really want automatically consistency, isolation, durability.

The asset properties that we've learned and we care about it deeply, that the, and those are generally relational databases. You have relational algebra. They really know how to model the data in a certain way, rows and columns. And they basically offer you an enormous power very deep, very rich sql.

And so there's a lot of a class of databases that we really need that now. These kinds of databases are incredibly good and they started scaling as well, and they're getting better and better. But there is a point where. Ultimately what we call these databases adopt to what we call as shared data architecture, meaning any amount of compute that you provision, they still need to have one data copy it.

They're sharing the same kind of view of the data. They're not partitioning the data. There is the other family of databases, which are non relational databases, so quote unquote no SQL databases where they partition the data. So you have. One piece of data with one node and our set of nodes with replicas, another set of data with another set of computes, another.

So you can keep sharding and keep horizontally scaling them as much as you want. You could go into petabytes, really, and when you have those kinds of databases, you basically would get a lot of elasticity. You can. Position each of these charts in different geos. So you get a lot of geo resiliency, geo affinity, and lots of great benefits.

The trade off there again, is that you wouldn't really get the same kind of automatically consistency, isolation guarantees that you'd get there. Now, both these paradigms are definitely generally, moving in from right to left the other way around. And they're gonna meet at some point because you would see in relational world there is an, there's a notion of.

Scalability with disaggregated architectures. We are doing this with SQL Hyperscale. We have a new database called Postgres Horizon. They scale quite a bit. You can go from gigabytes to 10 hundreds of terabytes easily. Compute can scale, but you still have one writer and one copy. In the non relational databases, we are slowly adding asset characteristics, but the key difference is gonna remain.

So our thesis is that, we think of our portfolio into two by two world, where you have SQL databases and no SQL databases. And then the other that is the. Columns in the row. We have homegrown industry leading kind of databases. We also have great fantastic o Ss databases. So in the relational world we have sql, SQL Server.

That's what word is in the No SQL World. We have Cosmos DB in the OSS space. We have both my SQL and Postgres. We vari quite a lot of Postgres databases there in, in fact, what we have recently announced something called Horizon db. And then on the OIS side for NoSQL, we also have something called document db.

So when you have, the needs for. Sql, no sql, homegrown mi, Microsoft First party. And then O-S-S-O-S-S is also first party. We've fully embraced Postgres by the way. In fact, we contribute a lot more to Postgres than many of our peers. This is our portfolio is when you look at many of the.

Prolific applications of the today's world, namely things such as chat, GPT, for example. They rely on Cosmos DB for all the messages and they rely on Postgres for transactions. Many of Microsoft's large scale mission critical applications like SharePoints Dynamic, they rely on sql. These straight offs will remain.

But our portfolio is very, well designed for this two by two metrics. 

Richie Cotton: Okay. I just. Splitting out is your the relational SQL stuff, non relational, no sql. And then I guess you've got your internal products versus the open source stuff. And actually on, on that last point it seems like there's been a bit of a shift in terms of mindset Microsoft in terms of embracing the open source over the last few years.

You go back a couple of decades and that was just not a thing at all. Do you wanna talk me through what's happened there? 

Shireesh Thota: No, this is a great question and your observation is spot on. It. It is true that we have basically come around and we've embraced open source databases fully. If you look at what we've been doing with fabric our one lake formats are all open source, right?

This is iceberg parkade Delta Parquet. So we just making sure that we are really fully embracing that kind of formats. And anything that we're doing in terms of developer extensions vs code extensions, they're all open source. The major one that I work with is Postgres. Postgres is obviously very well regarded and it has one of the strongest communities.

I would say this is probably the number two biggest open source project after Linux. I would think, and the community is getting stronger and we are fully invested into Postgres, to the point where we have a group of committers who we are nurturing, who are encouraging them to commit code to Postgres upstream.

And their goal is not to really help. Microsoft Azure Postgres. The goal is to help Postgres. Obviously everybody benefits, we will benefit too with that, but we are committed to advancing the art of Postgres. And thereby, we are supporting, nurturing our committers. We are very deeply invested in that space.

In fact, if you look at. Postgres 16, 17, 18. The last three. If you combine all that stuff and look at the amount of changes that went in there, most amount of changes came from Microsoft, and that's a surprise to most of the community members. We have more committers. We commit most code to Postgres and we'll continue to do that as well.

The reason is because we recognize that many of our customers want that flexibility and. Our affinities to help our customers succeed. Not to a specific one specific tech. We have points of views in terms of when to use what, but we embrace it and Postgres is a great database. So we definitely have been embracing that.

One other example that I wanna point out is document db. When you look at no SQL World, the amount of open source efforts in the no SQL world have significantly diminished. Quite a bit in the early days. We used to have Cassandra, for instance. Mongo started as open source, but it's no more open source.

And I think the effort in the NoSQL space is completely diminished. So we've recognized that and we do have many customers who would say, Hey, I wanna, I want, I wanna have the. The optionality to take this database elsewhere, run it somewhere else, even though Azure is a great place to do it.

So to do that, we have invented DocumentDB. DocumentDB is a Mongo compatible, but a fully open source no SQL database. And so we've committed, we've donated it to Linux Foundation. We in fact, partnering with many of our cloud friends out there, hyperscalers. It's a true open source database.

And we just begun there. We look forward to really. Investing a lot more deeply in advancing it. Any of the new things that we are doing in terms of vs code extensions, a lot of AI work that we are doing, that all gets open source quite a bit. So we've embraced it and whenever it makes sense for us to open source, we.

We really lean towards doing that. Yeah. And it's become a theme. Okay. 

Richie Cotton: Yeah. That's very interesting. 'cause I think in the early days of NoSQL, there were a lot of different open source projects. Yeah, you mentioned Cassandra and Monga used to be open source and then I mean at this reds and like the, there were a ton of them, so it just seemed maybe that's been, it is had less excitement in the last few years in terms of the open source progression for NoSQL.

I'd love to go back to the thing you measured at the start, which is about how you've got fancy reasoning generative AI that lets you ask more interesting questions of your data. Do you want to add a little cult of that? What are the sort of interesting questions that that you can solve with these reasoning generative 

Shireesh Thota: ai?

I wanna really set some context here and context is a word that I'd used a lot more here in terms of getting there in solving, in, in answering your question. Data is gonna be as effective as it basically has the context around it. It, of course, is important to have clean data. It's a cliche to say that, but it's really important to understand that quality is a lot more important than quantity.

And it's said multiple times, but it's true. The way to think about reasoning on data is that you have to. Annotate the purpose of that data, the context of the data and there's more questions around the data. So imagine that you basically are given a set of tables with some amount of information and it basically sees bunch of numbers on the left side, bunch of numbers on the right side.

Of course, you do have some schema, you have a little bit of context around it, just brother, just looking at the schema, but it's often not enough. You may have comments, you may have some texts. Not very deeply prescriptive kind of names in the schema. And applications may use like entity frameworks or whatever that can really murle up how schema looks, et cetera.

So you may end up with a bunch of numbers that may look like we don't know unless you have context, whether these are social security numbers or these are credit card numbers. These are just numbers, a bunch of numbers are floating around. That's just a simple example I'm taking, often they generally get annotated with some degree of information, but as you go deeper and deeper, this information is not sufficient.

So what ends up happening is that AI can really go and look at these things and try and infer a few things and can easily hallucinate, easily come up with a query that's not what you want. So your natural language to query language interpretation, you know what you're trying to connect with the data.

If you just use an agent, it can easily hallucinate the answer. To solving that question, the simplest one is to make sure that you really have the right context of the data. You need to explain what is an entity, how are these entities related to each other? What is the hierarchy of information? And then you can define some measures of Hey, when I ask these queries in my dashboard, this is how you really should measure them and come back with the right answer, et cetera.

And different departments may have a different kinds of interpretation. Even simple thing like asking for revenue from a data set. Product may have a different interpretation than finance. Finance may have a different interpretation than sales, subtle differences. All those things need to be annotated.

So that's step one. And, we've done this in Power bi, we call this semantic modeling. So having the right data, quality data, clean data, step one. Unifying every piece of data is important. That's step one. Step two is to really build that semantic model. And then step three is that now that you have the nouns of the data, you need to put some words, you need to really have some kind of actions and policies that can really act on the data.

You could create the policies of your data, of your company, of your organization as to let's imagine that you're trying to model a fleet of cars or vehicles. That you wanna manage. You could have informations like, Hey, a car is an entity, an operator is an entity, A trip is an entity. You could have some such definitions.

You need to annotate them, explain what are the boundaries, what are the limits of those nouns, and how they're related to each other, et cetera. That's somatic modeling. And then on top of it, you can then now start building some kind of policies, which says that anytime you see a truck's fuel level going below a certain level, you need to.

Alert, a nearby operator to go refuel it. Just a simple example there, right? That is a policy tied to an action. And if you could, you need a mechanism to define those things. We call that as an ontology. And that's built on top of the semantic model. So that would be my third level, which is like you go from clean, curated, unified data in a lake.

Effectively, you build a semantic model. Then you build these policies and actions. Once you have that, then you can apply deep research agents, which can really research your data better and can come back with lock. More effective answers than simply just asking a direct lake or a direct database or a warehouse or whatever, because they don't have the right context.

You need to annotate that is work that needs to happen on top of it. We make it very easy with fabric. We call this as fabric iq, and that's exactly the approach that we have taken. To go to the point of, just looking at the data, but getting to the point where you can really.

Reason about it, deeply do some research on the data. And the good thing about is that the layer at the bottom is a unified data layer. So you could intersect the data from SQL databases, no SQL database. There's some streams, maybe even SharePoint data. You could even bring S3 buckets from other clouds.

You could shortcut from there without even moving the data. You just need a virtualized pointer back to your buckets. In S3, we don't mind that. So unified data layer is important. Building semantic models is important. Then you build ontology. So that's these are the building blocks, the ingredients to get to that kind of decreasing.

Richie Cotton: This is fascinating and I think you maybe just inadvertently solved like an argument that came onto my LinkedIn feed the other day. I've seen quite a few companies who are building out like a semantic layer software that seems like one of the new sort of hot things. And as you mentioned, this is about like making sure that every single team, if you're asked like.

How many users do we have? We all give the same answer rather than giving slightly different definitions. But a lot of people in their sort of I guess data steward community seem to think that ontologies are a better thing than semantic layers and that semantic laser are much simplest thing.

It was a whole weird argument, I think quite good. But it sounds like you might need both. Is that right? 

Shireesh Thota: I think so. And I think it's a little bit of like splitting hairs between the two. One could say that ontology is just a natural extension of a semantic model instead of thinking them as completely different things.

But it is, the way you define the ontology is that it's very business specific. You have the right rules and policies and actions that you could apply in the ontology. And there are lots of industries who have embraced ontologies for a long time. If you look at supply chain for instance, there's deep vocabulary and grammar to how they think about, lots of different pieces that, I'm not an expert at.

Every industry has those kinds of ontologies. You need to really apply those things on top of semantic Information about the data are expanded, one way or the other. It's like splitting hairs between these two things. The idea though is. Simple. It's about creating the right cortex on the data.

Richie Cotton: Okay. I like that. Making sure that everyone understands like what the data really means and you are not got different definitions of things. You're doing 

Shireesh Thota: different Correct. And having something that the organization is aligned with.

Richie Cotton: Absolutely. That sounds incredibly useful. And this brings us back to the original question.

So once you've got all the kind of data in place, you've got the context for your agents, what kind of cool stuff can you do with 'em? And the relevance of data. 

Shireesh Thota: We've announced something called Database Hub just recently. I'll pick that as an example. And I'll walk you through what we are doing here.

Obviously you could do a lot more with once you have these layers. But let me pick on database hub and database agents as an example. One of the problems that we see with a lot of our vendors is that a lot of our customers is that they generally don't have one environment, one database.

It's not like that, right? It's typically messy, it's heterogeneous. You have, you're dealing with an on-premises databases is pass various kinds of databases, different kinds of environments and multiple types of databases. You wanna have one view that basically helps you monitor, govern all those pieces.

And then what we've done is basically built a hub, which kind of gives you all that view. On top of that, we've added agents to help you monitor vigorously and thoroughly, continuously all the databases state so that any performance issues that you may have can be highlighted, can be corrected and work with the DBAs.

And we're not replacing, but we are enhancing the experience and really helping the DBAs to go focus on the higher things that's all possible because there's a richer telemetry that we gather. It, it's like meta application. Of what I'm talking about. But it's really an important meta application because it's we use the same kind of stack to give very rich information to our our customers, DBAs database developers to go deal with.

Security issues, compliance issues and how each database is interacting. How an application is interacting with databases, because ultimately databases are here to solve the needs of application. So that's an example where we have internally used it, to benefit from the same stack externally.

There are a lot of customers, one of the coolest ones that we talked about in public is when you think about airline operations, there's a lot that's going on in, in a in in running an airline operation. How do you really simplify all of that stuff? Simplification is hard, but making sure that everybody agrees on the same challenges, everybody's working on the same things, and really go deeper and deeper on, Hey, what is that which is causing the delay?

And how can you trace some of these backlogs, et cetera. So all those pieces can be created into an ontology. We are effectively helping them create a digital twin of the world first. It's a graph and then on top of it, like you marry it with the data and then you can go really good reason about it.

There's one thing that I do wanna add here in, in the context of these examples is that we believe obviously that most of the work happens in your chat messages, in your emails, in the, in your documents, et cetera. Colleagues at organizations, they. Interact with each other, not through a database directly.

They interact through a teams channel, slack messages or whatever, and a lot of the information is actually captured there and most of the operations are really happening there, right? So what we've done in Microsoft is that we have all the stack, right? We have work iq, just like how we have Fabric IQ that I just explained.

We have something called Work iq. And this work IQ basically is the same kind of idea. It's intelligence layer, contextually inte, inte intelligence layer on top of your office or office related engagement. So you all the productivity information that you may have in chats email, SharePoint, all of that stuff.

There is something called an off Microsoft graph, which basically. Understands all these pieces. It really has very rich information re marrying that with that of the fabric business strata that we have. In fact, we also have something called Foundry iq, which helps you bring in institutional knowledge.

But all these pieces put together is the Microsoft IQ stack that is. Unique because you go from all the way from like having chats in your teams to understanding your data, going back and forth. You can write, read, and different people writes and read and they, you of course need a data layer, but we are married with the interactions that are happening in the work layer.

So we have many customers who are taking advantage of this full stack. 

Richie Cotton: That sounds very cool having that kind of, all these sort of stuff, things integrated with your email, with your slack, with your, like your teams or like whatever it is that you using to communicate because you don't have to go away from your day-to-day like interaction with people to think about data at least in a lot of cases.

Okay. Alright. Raising this, we talked about how AI agents are dramatically changing the data work. Is SQL still worth learning at this point? When agents can or well in, in Ger General can write SQL for you? 

Shireesh Thota: My personal recommendation is that absolutely, and this is one of those things where it's less the case about really not having to gain deep expertise about how, could you really run could you write a CTE expression yourself?

It's less about that. It's more about. Trying to really understand what your data model is. It's trying to understand the fundamentals of how you gather insights. What is the efficient structure and what happens when you don't have the right efficient structures in place. Data modeling is a, is incredibly important.

SQL is an extension of that thing, right? You, if you don't have that ability, then you would end up just accepting whatever the agent may have recommended. And down the line there would be. Tons of insig significant performance issues, et cetera. What AI can help you with is the ability to do that efficiently and scale you better.

You don't have to deal with the grunt work of really figuring out what is exact syntax of how do you write it, but you need to understand sql. You really need to really figure out is this generating exactly, is it converting the intent that I have? Is it converting the business model business objectives that I have into the right data model?

That conduit is not going away, by having these tools, it liberates the DBAs and the database developers to focus on those things. And I think that's a significant value. So my recommendation is that please continue to learn SQL SQLs SQL has been there for a long time and it'll be there for a lot more time.

'cause it's truly the language of data and in many ways it's been proven over and again, that there's no better way to do that than sql. There will be a point in time where, just like how. Compilers, abstract some of the information and then abstract. Like you start writing, you're still coding, but you're writing it, you're coding at a higher level.

You're not coding it machine assembly language, right? You're coding it at a higher level managed language. So there will be abstractions that can help you get there, but the understanding of. How do I optimize, how do I convert my business model into the data model? How do I really think of all the SQL data entity relationships, et cetera?

And so SQL is very important in that context. So without that, I think we, we'll have a lot of challenges. 

Richie Cotton: Absolutely. There's got another, SQL is still worthwhile learning about, even if you can get there, the machines to write it for you, but certainly understanding it. And it also sounds you said data modeling is maybe even more important now.

Do you wanna talk through. What are the basics of data modeling that you think everyone should know? 

Shireesh Thota: The usual basics are very important. The way to normalize schemas, the way to really understand what is the efficient way to convert your business model, your business objectives into the right data.

A lot depends upon that because the speed of rights, the efficiency of queries where do you index, where not. To index those things will define the performance. It'll also define the efficiencies and ultimately the consistency and the governance aspects that the travel flow through that.

Because you, if you don't have the right permissioning models you put the same data in multiple different places you may have given, you may have missed some amount of sort of allocation of permissions, then, it can lead to a major attack. And it's truly the fundamental piece of it. AI can help accelerate that.

I'm not against it, of course, you should really use AI tools. You should. A hundred percent do that. But you as a human need to be in the loop in terms of converting the business objectives into that exercise. The goals of really trying to convert the business objective into your data is all data modeling.

In my mind. 

Richie Cotton: I, so out of the idea is, yeah you're translating like, I've got a business problem. This needs to be representative for data somehow, and this is an important thing for humans to think about. Okay. Alright. And I guess more generally. How do you feel like the role of data professionals is changing with with the introduction of powerful ai?

Shireesh Thota: Yeah firstly with the unified data platforms that we've been building with all the databases coming into fabric, with all the SaaS experiences that we are building, and also just embracing all the all the greatest and the latest, including open source, we are trying to move to the point where we actually empower, application developers to succeed without having to look under the hood. They want to be higher up the stack. They wanna spend all their time trying to think about how do I build my next generational application? How do I really bring value to my business? Not necessarily dealing with the nitty gritties of every data layer.

So that problem needs to be solved and we are very committed to doing that thing. Whenever we talked about autonomous experiences for databases or data platforms, et cetera. I think. Maybe we have timed that maybe we peaked a little too early. The timing was probably a little off because anytime we had those kinds of ambitions, what ended up happening was that the results weren't that as spectacular as the claims were in the past.

And I think we as an industry need to acknow acknowledge and accept that. I do think that it's not a reason for us to not go embrace it again, because we have one of the most powerful tools the tech industry has ever had, humanity has ever had. And we have to go embrace that. And I. Truly believe that this industry needs to embrace autonomous experiences in the data layer, we will we're very deeply committed.

What happens with that is that then you really liberate the app developers, the database developers, DBAs, and all these data professionals to go meet the needs of the apps much better and like really scale better. Once we do that, then they can 10 x. Their experiences they're still the conduit between converting their sort of business objectives into the data layer.

But these agen experiences is the new sort of vocabulary, new medium with which they interact with the applications. Everybody wants to really have the SaaS experience. The time to value needs to be significantly reduced. They're really thinking about higher up the stack. It's really how somebody who owned a car in the early 20th century to what they do now, like somebody who had to own a car in the early 20th century had to be an expert at how the car engine worked.

Not anymore. I really don't even know. Most any detail and that sort of, it took hundred years to get there with ai, I think it's gonna be much, much shorter. But the data professional world, the application world are gonna are gonna come together much more closely. I think the objectives are not changing.

The mechanics are getting simplified significantly. 

Richie Cotton: I love that analogy with the car. I'm just thinking I can just about change the oil there, windscreen, wash it fluid. That's kind of it. Yeah. And the rest you don't need to touch most of the time unless you're a professional mechanic.

And maybe it should be that way with databases like. There's only a limited number of people get very excited about, I want to play around with database settings 

Shireesh Thota: and it is hard, I don't wanna pretend that it is simple. Databases are complex but cars are even, equally complex. We got there.

Richie Cotton: Absolutely. Okay. And I guess before we wrap up I realize like Cosmos DB is your your baby. I've not really given you a chance to tell us about that. Talk me through what's Cosmos db? 

Shireesh Thota: Oh, thank you, Richie. Yeah, so I, I've worked on Cosmos DB ever since we formed the product and it's been long time.

So Co Cosmos is one of those products where you, so firstly, it is a no SQL database. It's a non relational world. It has its trade offs in terms of making sure that you get. Enormous scale. Great high availability. We are geo distributed. So you basically can go from one region to any number of Azure regions, all number of Azure regions and you can have a copy in all these regions, and you could do that both in a sync mode, async port, meaning if you wanna have a global strong.

As soon as you write the data in one place, do you want that data to be replicated everywhere before you before you even acknowledge to the client, or you could serve in the background replicated to the other regions? There are lots of different trade offs to both these. We all support both of them, it is the, it is a scale system. And so it trades off a few of the asset query capabilities, but it is a phenomenal system for basically going from gigabytes to petabytes. I mentioned earlier that open ai OpenAI is charge GPT runs on, on Cosmos db. In fact, many of the mission critical applications inside Microsoft and Azure run on Cosmos db including teams, which is one of the most prolific chat applications, professional chat applications.

All the messaging goes through Cosmos db. It's really great for a system of engagement whenever you're trying to have these interfaces, any. E-commerce applications are trying to manage the shopping cart. For instance, Walmart uses Cosmos db. So all these kinds of experiences are really designed for Cosmos.

The fundamental value of Cosmos db is that it basically lets you interact with the database with Js, ON Bs, ON as a primitive instead of that's the rule, basically. Instead of. Emitting, like inserting individual columns and having relational algebra, you are ingesting a full JSON. So as soon as A-J-S-O-N object is born, you could just take it and put it into Cosmos db.

And by default, it indexes everything. So the data modeling exercise is very minimal. You do need to think about partitioning the data effectively, but once you do that decision, the connection between. The point where the data is born. And typically most of the interwebs data is JSON data. So as soon as the data is born, it can go into the database, you can start querying right away.

There's very little pipeline. There's very little ingestion, challenges, conversions, et cetera. And most every popular application framework knows how to deal with JSON. 'cause JSON is that ubiquitous? And that makes it very easy for it to be applicable anywhere. Headless APIs, HTTP APIs, they all work.

Great with Node Python, all of the new ecosystems that are there. Great. With T net. Java, of course, we got great drivers, but a lot of our innovation is about partitioning the data. Every piece of data is independent and it can scale multiple partitions, sharded data. This is what like, systems like OpenAI do.

They have petabytes of storage, so they need to make sure that we have shards. Each shard has, independence in terms of. How there's a replica of such that can manage and it's highly available. And then the other great thing about Cosmos DB is GODR. Ha. And you have, we have two capabilities where one region can be active, the other regions can be passive, meaning the active region is the place where all the rights go and the other regions where all regions can go.

The right region can also read or can also provide reads. But every other region is read, and if the right region is down, it basically automatically fails over to one of the read regions to become primary. So you don't have any any downtime except the failover time. There is another mode called active where every region can take rights, not just.

Reads, every region is both rights and reads. Of course wherever a right is there reads happen. Now in that mode you could be writing local to wherever you are. So let's say, one of the writers in the us west, the other writer is in Australia. Let's imagine they're both writing independently to their databases with.

And we are synchronizing and doing contract resolution. So there's a very complex and advanced mechanisms makes it very easy for the developers to do those kinds of things. And most importantly, if a region is down, there's no failover there. The other region will automatically take the rights from the other.

You, your client can just directly talk to the other region. Rights will flow. And this is important for applications who want RTO meaning downtime to be almost zero. Basically, there's no downtime at all because you. It doesn't matter. Even if your region is down, the rights will continue to flow.

So these are a few modes, great for internet scale, large scale applications, great for getting started schemaless, you don't have to worry about upgrades of schemas, et cetera. So these are, again, different trade offs, but one of the most mission critical ready databases out there.

Richie Cotton: Okay. So I see a lot of your earlier examples now, suddenly clicking to place the magazines. So this is all about. If you are an application developer and you don't necessarily want to worry so much about your database, it needs to be like you, you got a a relational database that just works for your application.

Is that a reasonable description of Cosmos? 

Shireesh Thota: Yeah, I think, I would say the relational database and are very distinct and they. Great for what they do, and they're getting really stronger every day. We very committed to sql, of course. We, first and foremost a SQL company.

We embrace Postgres database, et cetera. They're getting their own sort of like thing, but it's all about do you want. Functionally deep rich query with deep, rich data modeling with the right normalizations, knows exactly how data consistency is managed, et cetera. And you have like enormous amount of SQL power.

If you want that, then relational is great. If you want something that's like json and json out I don't wanna know about the application upgrade problems, but I want really great schema, I'm okay to let go of SQL richness. Then no SQL is great. So you have different pros and cons there.

Richie Cotton: Yeah. Okay. Alright. So many choices. 

Shireesh Thota: I love all my databases. 

Richie Cotton: No, it's amazing. And at this point, there are just hundreds of different databases. Like whatever use case, there's gonna be a database for you somewhere, I think. So finally, I always want more people to learn from. Who's working you most interested in right now?

Shireesh Thota: There are a lot of people that I'm very excited to follow them and I learn a lot from them. Fortunately being at Microsoft, we have lots of great leaders. I always look up to my own leaders and my own chain of leaders Aaron, Scott, and Satya, of course. There's a lot of great work that they do and I always look and learn a lot whenever I meet them and talk to them and or just listen to them when I think about externally. Andre Cari incredible thinker, a systems thinker. I think he's the one I love the work that he is doing in the recent times with auto researcher, for example.

Just phenomenal. Like he has this. Brilliant framing of software 2.0 how he thinks the evolution of programming and how he thinks about how to apply l LMS from ground up, first principles approach. I really love his thinking. I would also throw in one other example of community work that I'm always.

Impressed and inspired by, which is Postgres community. It's not a person, but it is it's the organization. But I'm I always look to what they are doing, how they're thinking about working together. It's it's a miracle when you have these people across the globe working together for one cause they're not really getting paid to commit to Postgres, right?

They all have their own jobs, of course, independently. But the fact that there is that kind of attention, that kind of care that. It's a commitment to craft that's always inspiring. 

Richie Cotton: Okay. Yeah lots of great churches there. Obviously Andre Kapa responsible for he's just ridiculously prolific in terms of coming up with cool new ideas.

And yeah. And of course the Postgre community is just absolutely amazing. That's why it's one of the world's most popular database right now. It's a purely community driven effort. Wonderful. Alright. Thank you so much for your time ish. 

Shireesh Thota: It's a pleasure, Richie, and really thankful.

Thank you.

หัวข้อ
ที่เกี่ยวข้อง

podcasts

Not Only Vector Databases: Putting Databases at the Heart of AI, with Andi Gutmans, VP and GM of Databases at Google

Richie and Andi explore databases and their relationship with AI, key features needed in databases for AI, GCP, AlloyDB, federated queries in Google Cloud, vector and graph databases, practical use cases of AI in databases and much more.

podcasts

No More NoSQL? How AI is Changing the Database with Sahir Azam, Chief Product Officer at MongoDB

Richie and Sahir explore the evolution of databases beyond NoSQL, enhancing developer productivity, integrating AI capabilities, modernizing legacy systems, and much more.

podcasts

[AI and the Modern Data Stack] Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake

Richie and Sridhar explore Snowflake and its uses, how generative AI is changing the attitudes of leaders towards data, the challenges of enterprise search, management and the role of semantic layers in the effective use of AI, a look into Snowflakes products including Snowpilot and Cortex, advice for organizations looking to improve their data management, and much more.

podcasts

The Data to AI Journey with Gerrit Kazmaier, VP & GM of Data Analytics at Google Cloud

Richie and Gerrit explore AI in data tools, the evolution of dashboards, the integration of AI with existing workflows, the challenges and opportunities in SQL code generation, the importance of a unified data platform, and much more.

podcasts

[AI and the Modern Data Stack] How Databricks is Transforming Data Warehousing and AI with Ari Kaplan, Head Evangelist & Robin Sutara, Field CTO at Databricks

Richie, Ari, and Robin explore Databricks, the application of generative AI in improving services operations and providing data insights, data intelligence and lakehouse technology, how AI tools are changing data democratization, the challenges of data governance and management and how Databricks can help, the changing jobs in data and AI, and much more.

podcasts

The Data Team's Agentic Future with Ketan Karkhanis, CEO at ThoughtSpot

Richie and Ketan explore AI agents for analytics, why “self‑service BI” often fails, using agents to answer questions, build dashboards and automate data modeling, how analyst and engineer roles shift toward governance and agent design, and much more.
ดูเพิ่มเติมดูเพิ่มเติม