Salim Syed is a VP, Head of engineering for Capital One Slingshot product. He led Capital One’s data warehouse migration to AWS and is a specialist in deploying Snowflake to a large enterprise. Salim’s expertise lies in developing Big Data (Lake) and Data Warehouse strategy on the public cloud. He leads an organization of more than 100 data engineers, support engineers, DBAs and full stack developers in driving enterprise data lake, data warehouse, data management and visualization platform services.
Salim has more than 25 years of experience in the data ecosystem. His career started in data engineering where he built data pipelines and then moved into maintenance and administration of large database servers using multi-tier replication architecture in various remote locations. He then worked at CodeRye as a database architect and at 3M Health Information Systems as an enterprise data architect. Salim has been at Capital One for the past six years.
Adel is a Data Science educator, speaker, and Evangelist at DataCamp where he has released various courses and live training on data analysis, machine learning, and data engineering. He is passionate about spreading data skills and data literacy throughout organizations and the intersection of technology and society. He has an MSc in Data Science and Business Analytics. In his free time, you can find him hanging out with his cat Louis.
When it comes to cost optimization, it's never a once and done deal. You have to be vigilant. It's a constant battle. So you need to be vigilant. You need to be watching for a new users, new usage, new usage pattern. Workloads are always introduced in the system. So don't focus on reducing cost.
The more companies I speak to about data management, they all realize that it's going to be very hard to manage it from a centralized point of view, right? That's the shift that's happening, but it's not happening as fast as I'd like to see. It's clearly evident from the strategies of different companies. The way they execute it is a little different. They may keep data across multiple platforms, but what's interesting is you have a central catalog across all of central access policies across all your data. And that those are the key part, right? That way you break the silos and you have a single way to get data, even though it's owned by different lines of businesses.
In cloud data management, cost optimization is an ongoing process. It requires constant vigilance to new users, usage patterns, and workloads, focusing on reducing waste and inefficiencies rather than just costs.
When starting a journey to the cloud, prioritize thinking about data governance and management. Addressing these aspects early on helps avoid challenges and inefficiencies later in the process.
Effective cloud transformation and data management require strong leadership buy-in. Leaders should champion the shift towards efficient, cloud-based data management and support the necessary changes in tooling and skills.
Adel Nehme: Hello, everyone. Welcome to DataFramed. I'm Adel, Data Evangelist and Educator at DataCamp. And if you're new here, DataFramed is a weekly podcast in which we explore how individuals and organizations can succeed with data and AI. I think it's safe to say that effective data management is essential for any organization looking to succeed with data.
And most of the time, organizations look to storing their data in the cloud to begin that journey of data management. However, things can get easily out of control as you're transitioning to the cloud. Costs can balloon as you set up your infrastructure, data governance becomes more and more challenging, and a lot more.
So how do you make sure you navigate that cloud journey effectively? Enter Salim Syed. Salim is the VP and head of engineering for Capital One Slingshot. which is a tool Capital One's software developed to enable organizations to optimize their data in the cloud. He led Capital One's data warehouse migration to AWS and is a specialist in deploying Snowflake to a large enterprise.
Salim's expertise lies in developing big data lakes and data warehouse strategies on the public cloud. He leads an organization of more than a hundred data engineers, support engineers, database administrators, and full stack developers in driving enterprise data lake, data warehouse, data management, and visualization platform services.
Throughout the episode, we explore cloud data management and the evolution of Capital One Slingshot into a majo... See more
If you enjoyed this episode, make sure to subscribe to the show, give it a rating, share it on social, we'd love to hear your feedback. And now, on today's episode.
Salim Sayyed, it's great to have you on the show.
Salim Syed: Yeah, happy to be here.
Adel Nehme: So you are the VP and head of slingshot engineering at Capital One Software. So maybe first to stage. Walk us through Capital One software and Slingshot and your role as its head of engineering.
Salim Syed: So Capital One software is a enterprise B2B software. business of Capital One it's dedicated to providing data management solutions to companies that are operating in the cloud. And our foray into, software business, it started a long time ago with building so many different softwares, in house softwares.
At the time we were in the cloud, we just didn't have those abilities. How do you manage cost? How do you manage efficiency? How do you do data governance in the cloud? All these we had to build ourselves. And then we also realized that, we're not unique to this challenge. There are other companies that are facing similar challenges while they're going to the cloud or already in the cloud.
And now, experiencing these challenges. So, our offering is to help other companies, be more efficient and in the cloud.
Adel Nehme: That's great. And maybe expand a bit on Slingshot as well and what Slingshot is.
Salim Syed: So Slingshot is our first B2B software and it's designed to help, it's a data management solution that helps companies optimize cost and accelerate their journey into Snowflake, as well as it really allows you to remove your wastage and inefficiencies from the way you're using your data platform.
Adel Nehme: And I want to center today's discussion on really how organizations can maximize the value of managing their data in the cloud. You know, In many ways, centralizing your data in a cloud data warehouse and building a modern data platform is really table stakes for any organization today trying to be data driven.
A lot of data leaders have this top of mind. So maybe kind of getting a bit of background as well, walk us through why that is and why nailing data management and data infrastructure should be top of mind for every data leader today.
Salim Syed: The way you get value in a business, it really starts from data, right? It's understanding data, the insights, the, the insights that are coming out of data. is going to drive your business. It's going to help you with your customers. It's going to help make their lives easier. The data is very crucial. And what we're seeing is that when you move to the cloud, amount of data, the type of data, all this increases exponentially, right?
If you have all that data coming in and it's not well managed and well governed, you don't know where the data is and, the right accesses around it, right security, all that will make it more challenging for you. Either you will not have business, cannot get insights in a timely manner. or you will create silos that will prevent you from accessing each other's data and seeing the value that comes with, sharing of data across your organization.
So, to have a good data management is really crucial from the start.
Adel Nehme: I couldn't agree more because ultimately you're creating that pipeline, right? from collection to enable people to draw insights. And that pipeline needs to be resilient and needs to be highly efficient, well managed pipeline. So maybe zooming out on the broader context of the industry, you work with a lot of industry partners and have worked on data management extensively.
At Capital One internally. Where do you think we are today? When it comes to organizations effectively managing their data, where are we on this journey? As organizations mature their data management.
Salim Syed: tHis is a very good question. Different companies are different places in their maturity of data governance and adoption of cloud. But what I'm seeing is that One of the places where I see some, some friction is that a lot of companies are moving to the cloud, but they still have the mindset of on premise.
And let me explain what that means. That means that, in the, in the on premise world, you had a centralized team managing all of your data, right? They, they're the ones who are saying, this is the data you can have. This is, they're doing the data governance, data data publishing, all of that. And.
That really is hard to maintain in the cloud, especially because now you're going to have an explosion of data, as I explained, and this is all the businesses really wants to go at the speed of their demand, and they cannot be dependent on a central team, and the central doesn't have the domain expertise across the lines of business, so they become a bottleneck in the cloud, So that's where we see a lot of friction around. Adopting this new mindset where you want to allow the businesses to move at the speed that they need, but also there's two things you have to really be careful of. One is creating silos And the second is how do you enforce enterprise centralized policies, governance policies across, uniformly across your organization.
And that, to find that balance, Capital One has built. The way we look at things are You know, you have central policy and central tooling, but then federated ownership. What that allows you to do, a centralized team to do is create the policies, make sure the tool enforces them or puts the right guidelines or guardrails around how you publish data, how you access data, how you govern data.
but then gives the ownership for the line of business. So they can move, they can publish on their own. They can, find data on their own without. having a bottleneck on the central team. And I see that as the, as the biggest challenge. fOr companies to really, truly adopt the cloud mindset,
Adel Nehme: Okay, that's really great. And maybe kind of expanding on that cloud mindset what have you seen are ways to deal with this particular challenge that you mentioned here on, having this federated approach while still centralizing data governance and data controls? Maybe walks through this kind of solution solution landscape in a bit more detail on how data leaders should be approaching this aspect or this challenge of data management.
Salim Syed: it starts with what Capital One believes in, how to solve this problem, which is, centralized tooling and centralized policy. What we've built is we've been centralized tooling for our imagine our data publishing. I'll give you two examples of publishing. To your data environment and then what consumption needs are.
First, let's talk about publishing. So one way to publishing is you create centralized policies. You give it to the line of businesses. They publish data into their data environment. It's very becomes from a central team. It becomes very difficult to enforce all those, to guarantee that everyone's following the practices.
So what we've done is we've built central tooling that allows you to publish your data. And making sure that the, the publishing, all the governance need for publishing data, for example. Are you registering your data? Are you checking for data quality? Are you identifying what is sensitive, what is not?
you, making sure that the, the sensitive columns like, credit card or personal information are encrypted or tokenized? All that is governance need for publishing a pipeline. And what we've done is we made that transparent to the business. So that if they're using our central tooling.
All that comes free. So not only is going to help you be more compliant, but also by making sure that everything is registered, the data quality is collected. It makes it on the consumption side for our users and employees. When they're accessing data, they have, first of all, they can find data very quickly, find relevant data very quick, because now we've, we have all the data registered in the right places we're collecting the right metadata.
And. Also, we provide the consumers of the data, enough information about the data that they can trust it. So think about, what is the result of your data quality? What is the what are the sample data size, sample data uh, min, max, data profile? All that is given to you so that you can trust it.
And then in the same experience, we allow you to Access the data, request access to data so that it goes to the right owner of the data and they can, approve and you provide your business justification. The point I'm trying to make is you want to make sure that all the data governance needs are as easy as possible, as transparent as possible.
So they don't become a hindrance to the developers or the data producers or the consumers. to getting their job done, right? and you get both benefits. You get the benefit of compliance and meeting all the regulatory requirements, as well as it's easy to find the data. It's easy to get. to the data quickly and get the value out of the data.
Adel Nehme: That's really great. And you mentioned a lot here, the data governance aspect to it. But one thing you mentioned earlier was the need to, develop this federated mindset and to let business teams also lead with their data while having kind of centralized policies.
Maybe what have you seen are effective ways teams can organize themselves around this model, As data management as well evolves. Maybe walk us through the different ways successful data teams have been able to organize themselves around and successful data producers and consumers have been organized as a consequence of this shift in mindset.
Salim Syed: The best people who know the data are the businesses. They have all the domain expertise. They understand what the data means, how to use it, and you want them to own it, right? and that's where the mindset comes that they're the ones who know it. Give them the tooling that allows them to publish data quickly, allows them to ingest the metadata about the data.
So that, everyone can get benefit and the other thing you need to make sure is that data has to be, with proper access control, it needs to be, you don't want the architecture to limit data sharing between organizations, if architecture allows that. Then you can build access controls on top of it, you can build that, but as soon as you create an architecture where you have to copy data out of one place into another, and then go through multiple hoops, it becomes very hard, and it just slows down the data sharing.
So data sharing is going to be a very key component of any organization and you want to make that easy, but with the right controls in place.
Adel Nehme: And we're talking about controls, we're talking about data governance, right? Like, this often makes me, and we, there's a bit of a maybe a friction sometimes, as you mentioned, between having centralized data governance policies, but also a federated model, Maybe who should be the stakeholders involved in setting up a data governance strategy?
Maybe walk us through in a bit more detail what a successful data governance journey looks like here for organizations as they set that up.
Salim Syed: You have to have buy in from. leadership it requires buying from the businesses businesses have to also create their own governance, own risk officers that allow them to manage their own risk because every line of business has different risk around it, right? You want to make sure that they have the right ownership, they have the right, risk reviewers within their line of businesses, as well as a central policy.
And the way it really works is if you have. leadership that says this is the way we're going to do it, but this is the benefit all organizations going to get it. Then you have a lot of buying, but without the buying from leadership, it becomes very hard to start this from the grounds up. So that I think is a requirement.
It perfects us
Adel Nehme: Yeah, I couldn't agree more here, especially on the importance of buy in for leadership. Now, speaking of buy in I think in a lot of ways, the other side of cloud transformation and effectively managing data in the cloud is actually ensuring that people have the tooling, skills, the ability to adopt Data, right?
and which is, this is much more of a people and tooling problem than it is only a data management problem. So maybe walk us through good examples of data adoption that you've encountered at Capital One and Capital One software customers.
Salim Syed: So it, it comes really down to again, I said the leadership, but it's a, it's a data mentality, right? You want to have everyone be educated on making data driven decisions and it needs to show from top down everywhere. Even during development processes, you're making decisions based on the data, right?
So that's the culture you want to create first. But on top of that, You want to make sure that it's very easy to find access and trust data. So, the democratization of data is very important in any organization to allow just data engineers to get access to data, but make it so easy that, anyone with the right access and need can get value out of the data.
So that, and then the third piece would be just. education train the trainers and make every data even during, uh, experience where there's inefficiencies in the way you're accessing data, make it a teachable. So teachable ones create more, more it's a, it's got a trickle down effect across the organization.
So that's how I see it, yeah.
Adel Nehme: Okay, that's great. And maybe expanding here on the aspect of providing access to data. You mentioned here in effective ways of surfacing data. I think one common anti pattern we see in organizations today is that there's really limited context on how this data is useful and could be used.
And a lot of organizations are trying to solve this, with the metadata platform, right? Or a data catalog. I would love if you can comment here on the importance of surfacing metadata and providing that data catalog for the wider organization. And what are effective ways that you've seen as well to provide that context for the wider work?
Salim Syed: Yeah, metadata is everything, and catalog is, is the key on that. Metadata allows you to know about your data. It allows you to know, not just the business context or the technical context of the data, but it's actually becoming bigger than how metadata was used in the past. It was about catalog.
You kept the business and technical metadata. But now there is a concept called passive metadata, which is all the cost associated security, all the resiliency associated with the data set. Like, how do you track that along with the metadata, along with the data? So, think about a table you have in your environment.
How often is it used? Is anyone using it? How, how often is it updated? How much does it cost to maintain that table? How much does it cost to access that table? Those are all passive metadata. And the bigger, the bigger view you have across all the metadata gives a company a lot of edge around where not only about.
providing, information or what data you have in your environment, but also what's valuable, what is used more, what two datasets are relatable, how often people join the two data together to create the insights, all that becomes very important. So you always want to start with a catalog, but it needs to expand.
on just static metadata. It needs to be alive. Everything that's happening to that data needs to be collected at the same time to give you even more insight into your operational excellence, right?
Adel Nehme: yeah, this often makes me think about really the importance of providing as well the business context on, how data is being used, like effectively which teams are using it, which queries, which tables is this data used? And yeah, I, I couldn't agree more here on the importance of, of seeing how that data is being transformed.
Now, as we're discussing about, a lot of the challenges organizations face when it comes to data management, I'd love to also learn more about Capital One Slingshot, right? So maybe walk us in a bit more detail what Capital One Slingshot aims to solve. In a lot of ways, the challenges that we're discussing here today were challenges faced by Capital One, So I'd love to learn, how Capital One Slingshot aims to solve a lot of these challenges.
Salim Syed: Before I get into Capital One The slingshot, I want to explain the concept of optimization. One of the things that happens in the cloud is that when you are on premise, your data, data platform costs are pretty much static, right? You, you buy a certain number of servers and then you have You have some constraint on how much you can use, but the cost doesn't fluctuate until you buy a bigger server or expand, and it took four to five months to do that.
But when you move to the cloud for the first time, cost is now dependent on how you use it and how much you use, So the more compute you use, the more queries you run, the more your cost is going to be, And what that means is there's a room for a lot of inefficiencies. That you probably had in the old environment, but it didn't affect your cost.
It affected your performance but it didn't affect your cost, but now it does affect your cost. So what Slingshot does is it, tries to help you optimize your cost, but the way it does is in a few steps, right? And first is it provides you the visibility. into your cost. Where are the costs spiking?
Where, where are the costs high, low, all that? Second, it gives you near real time alerts and insights into the cost drivers. The last thing you want to do is wait until a month and bill comes and realize that you had one query running for 30 days, a runaway query. So you want to know about that as quickly as possible.
And then the last one is Not everyone is going to be an expert in Snowflake to know how to optimize their queries or optimize their server configuration. So what we try to do is we try to recommend you when we give you, when we find inefficiencies, we give you a recommendation on this is why we're seeing the inefficiency and this is exactly what you can do or give you a couple of options.
And then in the tool allow you to make the changes to your settings, right? And all three together allows you to not only save cost, remove inefficiencies, but also get everything that happens in the Snowflake environment is basically all managed and well governed, because we also have a way for you to provision resources, change resources, all through approval process and the proper tagging, so it's, it's well managed, well governed as well.
that's in a nutshell what Slingshot does.
Adel Nehme: That's great. And when you mentioned here on cost optimization, maybe we should have covered that a bit more in depth. Maybe walk us through the, drivers behind ballooning costs when trying, when managing data in the cloud.
Salim Syed: Yeah, there, there's a, I, I, the way I see it is there's four different areas where cost can be so variable and have an explosion if you don't manage it well. First one is the compute cost, which is in the modern data warehouses is data platforms. You always have a compute aspect, which is separate from your storage and the more queries you run, the more you spend.
The second one is just the query itself, the query optimization. Third one is your data set optimization. Which is how much data do you store? How is it being model correctly. And the last one is environment optimization, which is around, there's, there's going to be a lower environment dev QA and production.
And in these, there's a lot of inefficiencies that can happen in the lower environments as well. But let's go back to the first one, the compute. This is where, what we understood is that there's no one size that fits for all times of the day. What we see is that workload usually goes up and down based on the time of the day, day of the week, and you want your snowflake resources to size up and down accordingly, and today's snowflake doesn't let you do that.
It gives you one size and you have to, so it's very important to know the ups and downs and making sure your compute goes with it. That way you're spending the most effectively also making sure that your servers are shut down when not in use. And you're, you have some timeouts in your queries so that, a runaway query keeping a server up.
So watching for those is very important. The second one is query optimization. Like I said, queries, the longer the query runs, the more it costs. So if you have a badly written query That does, for example, a Cartesian data join that can run forever. All that will cost you a lot of money, and you won't even know about it unless somebody's watching that.
So it's very important to not only build in alerts for runaway queries, but also provide a way to give you advice on how to rewrite the query and get the best performance and lower the cost. And our Slingshot tool does that also. third one is data set optimization. I call it data set optimization because it's got a few aspects to it.
One is storage cost. Even though you, you'd think storage cost is much cheaper in the cloud compared to on premise. But when you're dealing with multi petabyte of data, then It's caught will sneak up on you. So it's very important to have a retention strategy on your data. That means know how long you want to keep it.
It could be based on business requirement, regulated requirement, like figure out a way that your storage will not keep going up and up. Otherwise, what happens is even if you start with a 5 percent storage cost, In year three, it'll be 40 percent storage cost because you're, you're not purging anything old or, you know, archiving anything old.
So it's very important to keep a retention strategy. Second is a lot of data that we're loading today in our data warehouses. You have to understand the consumption patterns as well. So there's no Loading something in real time when you're consuming it once a month because there is a cost to loading data as well.
so These are the insights you want to draw from a tool that allows you to know, okay, what is the best way to load the data that helps with the usage, And the last point is on the environment cost. One of the things you notice is when you're building data pipelines in lower environments, There's a lot of room for inefficiencies there.
For example one of the things we did was we made a rule that says you cannot get a, in a dev environment, you can't have a server large, greater than small, for example, size And just by enforcing that rule, we saw a significant savings and if there is a need for something larger than that, you go through a certain approval process.
But, to enforce rules like that, to enforce that jobs that are when you're testing something, when you're developing something that are not just running when you're not working, right? And those things can happen. And so it's very important that you pay attention to the lower environment, even though that's a much smaller.
a percentage of total cost, but that can also balloon out of control if you're not putting in the same inefficiency checks that you're going to put in production, right? So point I'm trying to make is don't forget about the lower environment cost that can sneak up as well.
Adel Nehme: Yeah, I love this overview. And I love how, you've thought about all of these edge cases and all of these drivers of ballooning costs when designing Capital One Slingshot. But maybe here switching gears slightly and thinking about how leaders and organizations should think about their data management journey and to avoid, ballooning costs.
A lot of organizations are still early in their cloud journeys. A lot of organizations are still setting up their cloud infrastructure. And maybe what advice would you give them to make sure that they're driving as much as ROI as possible when it comes to data management? How do they avoid being in a mode where they're increasing investment and ballooning and costs are ballooning without necessarily driving a lot of ROI from data activities.
Salim Syed: There are times that the companies that are moving to the cloud don't see what's right around the corner is they go and then they face all these different challenges around governance, around cost, around accessing data. So, my recommendation and advice has always been think about data governance, data management when you're starting off.
the journey to the cloud. Sometimes it's much more harder once you've opened the Pandora's box to put everything back in because once it's it's much harder to do that if you don't have your data registered across your organization. Now you're going to have to go through and collect all that.
But if you think about a central tool, central policy, then from any new data that's created automatically will be well managed, well governed. So that's that's the first piece that you should think about is invest in a good tooling that allows you to, do the centralized policies, enforce the centralized policies, but give ownership to the, to the lines of businesses.
As well as on the cost side, I would, I would reckon same thing. It's very important to from the beginning put in the right policies on how you're going to do charge back to your line of businesses. Are you going to have a budget or not? Are you going to have a way to request more funding?
All that needs to be part of the a data platform in this modern world. You can't just have the, since it's, it's a variable cost, you need to have a proper funding, proper budgeting, and proper way to asking for more resources the right approval workflows should be built in, and then visibility is very important on the way you're spending your money.
Adel Nehme: Yeah, that's good. I agree more. And what's interesting hearing about, you know, your journey now leading Capital One Slingshot, as we discussed a lot of ways, in a lot of ways, Slingshot was built upon, you know, Capital One solving these problems that we're discussing right now. Maybe walk us through, briefly, the journey of productizing a tool.
built internally and productizing it to the wider market. What changed, if any, in your approach as you were productizing Slingshaw?
Salim Syed: the first thing what we had to do was we had to build the tool we built within Capital One was very specific to Capital One, right? So there was a lot of hidden integration with Capital One infrastructure already. So we had to build SAS platform that was supposed to be multi tenant.
That is supposed to be multi tenancy. It has to protect the customer's data. It has to make sure that, the data's don't overlap each other. So we had to create a whole new SAS framework. We had to create a different software and development practice to help with this journey.
And one of the things that you're going to see from Capital One build software is that. It's security performance, scale, resiliency, these things are just built in into our DNA, right? So everything we build is built for a hardened platform and that's how we've always started, even though other startups will start with just providing you the features, then figure out how to harden it later, we've always made sure that the product you see is going to be a hardened enterprise grade software.
Adel Nehme: And, as we are closing our episodes today, Salim when looking ahead, maybe what are trends that you're excited about when it comes to organizations managing their data effectively? How do you think the landscape will evolve over the foreseeable future.
Salim Syed: No, I think it's already evolving into that federated mindset. The more companies I speak to, they realize that it's going to be very hard to manage it from a centralized point of view, right? So, that's the shift that's happening, but it's not happening as fast as I'd like to see. but it's clearly evident from the strategies of different companies.
The way they execute it is a little different. They may keep data across multiple platforms. But what's interesting is you have a central catalog across all, a central access policy across all your data, and that those are the key part, right? That way, you break the silos and you have a single, single way to get data, even though it's owned by different line of businesses.
Adel Nehme: Yeah, that's definitely an exciting development as we see that federated mindset evolving and, empowering organizations and different lines of businesses to use their data effectively. Finally, Salim, it was great having you on the show. Before we wrap up, you have any final call to action or notes to share with listeners before we end today's episode?
Salim Syed: Yeah, absolutely. This is what I say quite a lot is when it comes to the cost optimization. It's never a once and done deal. You have to be vigilant. It's a constant battle. So you need to be vigilant. You need to be watching for a new users, new usage and new usage pattern. Workloads are always introduced in the system.
So don't focus on reducing cost. I would say focus on reducing waste and reducing inefficiencies in the system, and then you will be able to scale with, with peace of mind, knowing that the, the money you're spending is going to the value generated for the business.
Adel Nehme: That's really great. Salim, it was great having you on Data Framed.
Salim Syed: Thank you so much. It's been my pleasure.
AWS Certifications in 2024: Levels, Costs & How to Pass
Top 20 Snowflake Interview Questions For All Levels
Nisha Arya Ahmed
Avoiding Burnout for Data Professionals with Jen Fisher, Human Sustainability Leader at Deloitte
[AI and the Modern Data Stack] Adding AI to the Data Warehouse with Sridhar Ramaswamy, CEO at Snowflake
Becoming Remarkable with Guy Kawasaki, Author and Chief Evangelist at Canva
Mastering Slowly Changing Dimensions (SCD)