Hands-on learning experience
Companies using DataCamp achieve course completion rates 6X higher than traditional online course providersLearn More
Large organizations are inherently different from young startups, or even medium-sized companies. They have usually been around longer, have stricter compliance requirements and company cultures that may go back generations.
What is the path to making large institutions data-driven and making it stick? What are the types of unique challenges that arise in these types of organizations, and how can data leaders and practitioners overcome them?
In this webinar, Maksim Pecherskiy, former CDO of the City of San Diego and lead data engineer for the Development Data Partnership at the World Bank will demystify the unique challenges to operationalizing data science within large organizations, how large organizations can get their data science practice off the ground as quickly as possible, and what are the best practices that facilitate a data-driven culture within large organizations.
Hello, everybody. Thank you for having me here and it's great to be here. And I'm very excited to present to such a diverse crowd today. So I'll start with a little bit about me. Right now, I work as a data engineer at the World Bank. It is a very large IT organization. I mean, they do other things, but the IT organization in the World Bank is very large. And what I work on specifically is ingesting large data sets for a subsection of the bank called the development data partnership.
And really, the bigger picture is to build out the data structure for data sharing across several international organizations. Before this, I worked as the chief data officer for the city of San Diego. And I know there are some people from the city-- from City of San Diego here, so I'm very, very excited about that. And I work to build the data infrastructure for moving data around the city. I also work to implement an open data policy, and the open data program, as well as, build out our data team, the process, and the people.
Initially, I got into government working with code for America in Puerto Rico. And I was part of a very small team that built a very simple tool that helped a lot of small businesses in the area, which was very exciting for me and showed me that I could have a lot of impact by doing tech in government. And really my background is in software engineering solutions architecture and really, I just tend to try to find the right ways technology can have an impact on solving thorny business problems.
So before we get started, let's set the stage. The big tech companies we know and well, maybe love, today, have been really good at taking data and making use of it. Scaling and operationalizing data science has been a core component of how these companies have become successful. They were able to build top-notch data teams, they were able to design and build their own data tools, procure the best data tools, and they were really able to focus and design powerful internal processes, and architectures for moving their business forward using data.
But if you look at these companies, they're not that old. They grew up with data. They grew up with data being around, it was a thing, it was-- they built their business around it. But older organizations, for example, financial institutions or health care institutions, or the place where I know most is governments, they didn't have that privileges, City of San Diego was founded in the 1800s. And so they're still learning to adjust, and they're still learning how to effectively align data to their operations.
And so in this presentation, I really want to focus to the data leader or data practitioner in this type of organization. How can you incorporate it into your processes? How can you make your company data driven? And how can you use data to empower the organization to operate in a better way?
So today on the agenda, what exactly are we going to talk about? So first, let's talk about the current state of data in organizations, in these specific organizations that are older that I mentioned. Next, we can talk about challenges faced by these types of organizations to becoming more of a data driven organization. Next, we can talk about how data leaders and practitioners within these companies can overcome challenges. And then finally, we can take a look at some of the new stuff coming out in the next year or two that I'm pretty excited about and how it's going to impact large companies.
So let's start with the current state. So for the past several years, many organizations began investing in data science. They hired Chief State officers, they started in San Diego in 2014. They built centers of excellence, they built teams, they invested in infrastructure, and they hired data scientists, data engineers, and they began to really think about, how can the data be useful to the business, how can we share it within the organization, and how can we become really more data driven.
So there's a new vantage partner survey of executives that covers 70 leading firms that comes out every year. And as of January 2020, 50% of US state governments had a chief data officer. 63% of financial services firms they surveyed, had a chief data officer, and 47% of life sciences firms, so that's health care and biotech, also had a CEO.
But there's still a ways to go. So many firms are embracing AI and data science solutions. 95% report ongoing investment in data science and AI. I'm sure you've heard data scientists as the most popular job of the 21st century. But only 14% report having deployed AI capabilities into widespread production. And we can link to the survey as well, but just see Google is on that list. And so many firms struggle to create repetitive processes and build supporting infrastructure for taking their data and actually connecting it to their business.
And so out of the 70 firms, that new vantage partners surveyed, most of them were investing in big data or AI initiatives as I mentioned. But only 37.8% could say they were able to create a truly data driven organization. And then out of those firms, out of those 70 firms, 90% cited people who process challenges as barriers to becoming data driven. All this to say we have a lot of really great technology. Technology is not the issue, it's the people, it's the processes, it's the culture.
So data driven is a nice buzz word we're all excited about it. What does it really mean to be data driven? And why are so many firms that are investing all this money in data operations don't feel like they're really getting there? And really, the core of it, the way that I think about it, is that in order to improve the people and processes that you use the data, it needs to become used and trusted. Or in other words, it should be easy to make decisions based on data. And going against what the data says, should be really the exception and not the rule.
As you know, there are many different types of users in an organization. There's executives, there's managers, there's analysts, there's scientists or state engineers. And each of these groups are going to look at this slightly differently. But used and trusted boils down to a certain combination of these factors. So first is data collection. The right data collected at the right time, they're corrected correctly, and this data is collected with purpose.
Then data users across the organizations know that this data that's been collected exists. And they know how other, how analysis or what analysis have been done on that data, and they're able to find those analysis, and they're able to find that data. So that's where we put it under the bucket of discoverable.
Next, data is reliable in a good quality with no gaps or inconsistencies. So if you have data that gets collected, but there's an outage at a random time once an hour, doesn't really make data reliable, and doesn't make that data good quality, just garbage data. So reliability is important.
Next, data is understood. So I'm sure for the people that have worked at data in this webinar, looking at a spreadsheet with a bunch of columns on it, trying to figure out what each column means, what the caveats, what are the nuances, how this data was generated, that's going to influence your analysis a lot. And a lot of times, it's hard to figure out. And so data being understood was crucial to its use.
Next, data is compiled. So security protocols control access to data. For sensitive data, computing environments are provided that prevent leaks. Legal regulations are followed, we can talk a little bit more about that later on. So making sure that using data is possible in a secure environment. Finally, data is actionable. So data users have the technology, the training, and the ethical frameworks they put that in the bucket as well to use this data correctly.
So these are kind of like the six pillars of used and trusted.
So what's holding back organizations? Can I drink of water? So there's three bucket of challenges that prevent organizations that want to take data and make it useful for their business and operationalize it. First, there's organizational or we can think of it as the bureaucracy and processes. There's a cultural or the unspoken rules of how an organization operates. And then there's the technical, there's the actual software and hardware, and how all those pieces fit together.
So let's start off with organizational. So I'm going to make a crazy statement here. Large organizations are really large, right? And they have silos. And the silos are separate from each other and tend to-- it's another word for teams that rely on each other but don't necessarily trust each other or have a smooth communication flow. So silos could be-- because of legal requirements, so if you think of a compliance unit in a bank. But also silos could be because that's the way it's always been done and these teams never had to communicate and changing the culture for enabling them to communicate is different.
And so this silo situation tends to lead to issues with communications and alignment. So that means that information doesn't flow freely between these two units or these multiple units. And as a result, not everyone is moving towards the same goal and not everyone has the same priorities. Organizational politics also come into play. So managers of these organizations and these silos and middle management, they may have different interests that may not directly align with what's best for the organization.
And they want to build up their resources, and they want to have a bigger team, and they want to have more responsibility, so they have-- so they get promoted, which is fine, but that's what happens in large organizations. And so what result is treating data like a project in a subunit and not necessarily a cross-cutting initiative. So an example is when you see a large organizations having three separate or four separate AI centers of excellence. That's exactly how that might happen.
The exact opposite of that in a way of what I just described is kind of another symptom of silos is the trade off of speed for coordination. So it's the, we want to make sure everybody's on board with this to go through this, to go and do this thing. And we want to make sure that everyone's on the same page and everybody's aligned, and that may slow an organization's movement down.
Cultural issues come into play too. So not understanding the bigger picture and why getting access to data or using data to make decisions is important, questions like what we've never used data, things work now, why would we start? It could be better. It could be worse. The data does not cover caveat acts or caveat why even though they happened not super often. Lack of data literacy, so not everyone necessarily knows what machine learning does. People read the news and they hear really bad stories, so that creates risk aversion.
Managers often think that if I share my data with another team, they're going to analyze my performance and I'm going to get punished because they may not know how else their data can be used. And so-- and that kind also leads into risk aversion, right? If you don't do something new, you don't get punished. But if you do something new, you could either succeed or you could fail and get punished. And so the impetus for trying something new has to be really strong.
So depending on the organization there can also be misalignment of incentives culturally. And so maybe management wants eye-catching dashboards so they can present it to their superiors. But engineers want to use the latest and greatest tools and do the coolest project because they want to build up their resume. And business users just need to answer it to a question-- just need to answer a question so they can make decisions and move on with their lives, or maybe they want to process automated.
And so if you just built a big manager's happy, you can get value out of that but sometimes, you can't. If you make developers happy, again, you can get value out of that but sometimes, you can't. And then, if you build to make the business users happy, again, you can also get value out of that but sometimes, they're actually asking the wrong question, and if you reframe the question slightly differently, you get a lot more value out of that. And so these symptoms kind of result in a lot of projects that begin at the prototype and kind of end there and don't go past one.
Another one is legacy project management methods. So waterfall framework has been used forever and ever and IT. And the idea is to let's plan everything out with timelines and deliverables and we're going to know everything in advance and inevitably you're going to be wrong. And so that's why agile has gotten so popular. And so what that creates is an environment where Oh, you blew the timeline because it took you longer to train your model than you expected. So the project is a failure, let's never do this again.
And then lastly, data is not everyone's thing. So in the city, people that pay roads, they don't really think about data, it's an administrative burden for them. And so they may not think of data as core to their operations, and so they may not invest that hard in data system management or making sure it's up to date or prioritizing that work because it's not really core to what they do, it's just a burden that they have to work with. And that can lead to, again, unreliable data.
So now let's talk technology. So organizations again, as we talked about they grew up based on a number of needs but older organizations weren't exactly built around the concept of using data. And so what's resulted is a patchwork of legacy systems that has gotten put together over the years, and those systems tend to be the record of data or sorry, the core systems of record for the data.
And these systems can be funky, they can be mainframes or they can be difficult to pull data out from and so you end up having to work with custom connectors and you end-- your project ends up ballooning because just to get the data out, you have to do this work that you didn't expect. And owners, the data-- the owners of these systems, the department directors, they may be hesitant to replace them because obviously, there's the switching costs and the project-- potential project failure cost, and it could be a really big project, but there's also retraining costs. If you have 100 staff that are used to using the system, the cost to that changeover and the loss of productivity is actually pretty high.
And so this likely happened, as I mentioned, because there's no unified data strategy before because people weren't thinking about data maybe 10, 15 years ago. And so, at least, not the way that people are thinking about data now. And then last is compliance and security. So there's HPA, there's FEPA, there's GDPR, there's California data regulations, and so it's really, really important to adhere to these and to prevent unethical data use. Companies don't want to end up in a lawsuit or end up in the news and so this is another security or technical challenge that can present itself to being data operationally useful.
OK, so that was really negative, all these challenges. And so all of these things they can contribute to what Brian Balfour, he's a really, really knowledgeable person about this kind of stuff is a great blog. He describes this as a data relive death. And so it kind of starts like this. Data isn't constantly maintained, so it gets forgotten, it's hard to access, maybe you forget to update it, or it's not cordier operations, then it becomes irrelevant or it becomes flawed.
And then you start hearing things like well, we can't use this data because of in this caveat they don't collect it in this situation. And then people start to lose trust in the data. And then eventually, people just use it less because it's easier to just go with your gut and make a decision then try to figure out all the caveats or what's going on. I have to drink a lot of water.
OK, but it's fine. This is normal this is, I would say part of the fun or the challenge of working in a big organization. These organizations were not built with data in mind and data science wasn't the thing 10 years ago. Databases have only been around for like 25 years and large enterprises are older than that. And so they weren't built around using data to run the business.
So great, we've laid out all the negative stuff. How can we approach this? How do you figure this out?
And so first, I would say that it's really important to understand the landscape. What are those different silos in the organization? How do different teams communicate? Are there teams that communicate better with each other than with a third team? Who are the powerful players? Who are the people that you need to appease, you need to sell them, you need to get their support? Who can be your champions as you go along on this journey?
And then there's also people who don't have official power, they have a lot of credibility and they get listened to. And who are those people? And how do you get them on your side? And how do you make them happy? And then lastly, who are your blockers? Who are the people that may not want to change or may stand in your way and how do you bring them along? How do you leverage your champions?
And how do you-- and then lastly, how do you communicate to different types of stakeholders? As we mentioned, there is executives, managers, data scientists, and just from my experience, if all of a sudden you find yourself in a meeting trying to explain yourself-- trying to explain ETL to a CEO or mayor, you should probably change your strategy just as from experience.
So part of understanding your landscape, is understanding users in your organization that use the data, and I already mentioned that. So the manager maybe wants to look at Tableau dashboards. The executive, may want a report. The data analyst wants to use Excel or maybe write some SQL. The data scientist wants to access the data lake and to be able to run multiple models at the same time and figure out which ones working better which ones are failing.
And for you, as you're going about this, it's really important to empathize deeply with your users. Pick who is it that you're going to serve first, and how that's going to affect your other personas. Don't neglect making higher happy. Managers and executives are important, they're your champions. But also make sure that you're getting buy-in from regular business users who actually do the work, who are going to come back to you when the manager is not around.
And then finally, build a place of practice where your data users can come together. So please to share ideas, best practices, learnings, and code. And that'll let you build empathy and create a loop that helps you understand your users and build powerful leverage across the organization.
As your approach project, somebody with a technical background, it's really, easy to say, Oh, I want to do this cool thing and it's going to be really impressive and hard.
But there's a discipline involved in keeping things simple because you won't be able to fix everything at once, you won't be able to cover everything at once, and it's really, really important in the beginning to prove success by having a win and taking it to production. And so I think a project that is visible and impactful across the organization and day-to-day users does not require a ton of stakeholders or coordination, allows you to leverage work you've done on this project for the next thing, and also a project that significantly impacts your chosen business user.
Someone is humming, I think. Or singing, it's really beautiful. So for example, in San Diego, I started with a street work visualization map. It pulled data automatically from a street payment management system. And here's kind of why I picked it. It was important to the mayor and a political issue. It required coordination with two stakeholders from the transportation department, so there wasn't a lot of people involved.
It made me and let me make the case for infrastructure for data automation across the city because this data needed to be updated continuously. It provided significant help to cruise planning work, and also it let me get to know and build relationships with people across the organization and understand what doing these types of projects is actually like. And so success of a small project means value for future engagements. And so you're able to show value for future engagement.
So to let you understand the pitfalls of making data actionable, it'll help you prototype a framework for project delivery, and it lets you build tools that you will use later on. And then finally, it lets you reach into other parts of the organization and build relationships, which is one of the most important things.
And as you progress, it's important to align data strategy with the strategy of the business.
And I know I've said this, but don't work on a project because it's cool or because it uses NLP or computer vision or neural networks, those are all fun things, but go for projects that let you expand and test your infrastructure, let you serve your business users, and projects that align with company goals and results. So my team defined a template, its scope of work for every project that we did, but tested for some of these conditions, identified users, and made sure that stakeholders report it. Was it perfect every time? No. But that's what you should aim for.
And this will eventually let you start thinking of data as a product that creates feedback loops and improving your work with each project that you do.
As you go through these loops, take opportunities to empower users. Work with different data governance groups, define authoritative data sets. Users shouldn't have to think about, Oh, where is the official roads file if they work in a city?
And then make these data sets available. Also, provide tools to users. So don't become a bottleneck for every data project, let people serve themselves using tools that they're comfortable with. So for example, in San Diego, we had several Tableau licenses, we provided those to people, but we also provided APIs and downloadable CVS for people that can do work on their own. And people could also find the data that they needed with an external and internal catalog.
Create systems for sharing sensitive data. So the work I do at the World Bank is for the development data partnership, and it creates legal and technical framework for sharing data across organizations. So I didn't do this much at the city, but I've been learning a lot here about the legal implications that this concerns, and also the legal implications of this concerns, and also the-- that platform needs as well.
Now, your users are empowered, so motivate them. Don't reward shiny dashboards. Reward processes improved, reward outcomes gained, reward things that really move the business forward, and make sure that the right data users are rewarded and others are discouraged. Then provide best practices. Don't make people reinvent the wheel. Docker containers-- provide docker containers with pre-installed libraries, so it's easy to spin up for people to work on.
Provide starter repositories with basic documentation and folder structure. People shouldn't have to think about, Oh, did I put the raw data and the process data and the outcome data? Secure compute environments, provide those with ceramic or notebooks or tools like databricks. And also, I would say, provide simple checklists for completing a data project. And really, like the overall goal here is best practices should be simple to implement. They shouldn't be a thing that you have to hammer people to do, it should just be naturally easy to do as part of the project.
And then finally, create a place for users to search for data. Search for authoritative data sets, search for work previously done by others so they can build off of it. So Lyft and Uber recently open-source some of their internal tools for it and San Diego as I mentioned, we had an internal and external data catalog.
Then build onto your infrastructure. So it's really important to draw the distinction between what IT does and what the data team does. So Joy Bonagura who's the CEO of California who is awesome, she kind of draws this distinction. IT is the gem. So when you want to go work out, they are the ones who provide the machines and the weights, and all that. The data team are the personal trainers. They're the one that help you get in shape, learn how your body works, make sure you don't strain your back, things like that.
And then business users are clients, right? So they're the ones that come to the gym and work with a personal trainer, so that they can improve how they operate physically. Provide the base layer of infrastructure. So people shouldn't have to figure out ETL for every project they do. People should be able to automate a process without bringing on a full IT team. As you move up the stack, provide standard interfaces for data access. So Python users are users, Tableau users should feel welcome to use the data and should be easy for them to do so, even steady users.
Remember, large organizations use many different tools. And you don't want to create a bottleneck by forcing somebody to use a tool that they don't know or they're not comfortable with. And then as I mentioned earlier, make best practices easy. So every project should have documentation, reproducibility, and the basics as the default.
As a data scientist, I should be able to spin up a repo with basic documentation, it has good project structure, it has a solid compute environment that is in line with how I need to use the data in this case. I need to know where to store my raw data, my process data, my outcome data. I need to-- this project needs to be set up to access the data lake, a minimum set of boilerplate documentation.
And they also have a way of security experimenting with and deploying my models as APIs. Again, if that's what your users need, a lot of organizations are not there yet, a lot of them are and so that's something you have to look at. This will allow for better-documented projects and allow for less rework. And encourage this, encourage best practices by making it the standard, social pressure is powerful and if everyone else is doing it, the new person on the block will start doing it as well.
So what does success look like? How do you know if what you're doing is effective, right? It's really tempting to use a number of projects completed, number of data sets of available, number of Ripa followers, the number of people that come to your meetings, those are all good metrics, but they don't really address the core of your success. The things that I would think more about are a number of decisions that get made with data, how long it takes someone to find a set of data or the time to data, how many goals, objectives, or ochres get set and tracked with data and how those are reviewed. Those are the things that I would think about.
All right so we've come to an end and let's wrap it up. Let's bring it all together. So here are the kind of key takeaways that I want to leave you with. The biggest challenge is not technology. It's the people, it's the processes, it's the organization, it's the culture. And it's a really fun challenge. Take the time to understand your users. Understand what their challenges are, what they need to accomplish in their work. You're building for them they're your customers.
And it's also really tempting to build everything right especially if you're, again, my experience coming from an engineering background, it's really tempting to say, Oh, this has to be perfect. But it's really important and there's a certain discipline that comes from making compromises between long-term gains and quick wins. So if I can take a shortcut here, but deliver it faster, and the shortcut isn't going to come and bite me later on, I should probably take the shortcut.
Also make sure that the incentives are in place, that the people in the organization are rewarded for using data correctly, and using data to move the organization forward. There's a fine line between doing that and creating a shiny dashboard culture where people just build dashboards and visualizations and projects that prototype and don't move forward. Also enable base infrastructure but allow flexibility and access. So that's what I was mentioning early, ETL should be one thing, data automation should be one thing, but analysis can be flexible.
And then try to empower people with best practices making it easy and making it the default. And then the last big one is I really, really recommend getting some mentors. So I'm super fortunate and thankful to have a closed circle of people that I trust and respect that I go and ask these questions. And honestly about 80% of what I told you today, and maybe even more, I learned from somebody else.
And so now that we have our takeaways, let's look a little forward. So 2020 over-- 2020 will be over soon, and hopefully, with it all the 2020-ness. And let's look at the future of how data is going to get used in the enterprise. I'm specifically excited about data orchestration platforms, data discovery tools and engines, and these concepts that have been popping up of Data Lake houses.
So orchestration platforms, they go by different names. But I think they're going to get more popular and they're going to gain more ground, especially in the data engineering space. So we have airflow, pre-fact, Luigi stitch, and there's a lot more of them. And it's interesting because a lot of companies, a lot of older companies have been doing this for a long time, such as alteryx to lend, for me. But these newer tools, they're compatible with modern software approaches, things like Version Control, DevOps, Continuous Integration.
So for example in Python, we use Python in the City of San Diego, we use airflow in the city of San Diego because we could version our code, and that was really important where some of these other ones you can't do that with. And there is starting to be some overlap and some convergence with tools like domino and databricks where you can schedule jobs. And so I'm curious to see how that will play out. And yeah, so there's no overlap in functionality.
I've also been seeing more big tech companies release data discovery engines. So Lyft has released Amundsen Amundson a little while ago. Uber just recently released Data Book, which is a really interesting project. A company a startup called all-Asian has been around for a while, they seem to be doing super well. And as data increases, companies I think will invest more resources in enabling their teams to find it and document it and reduce rework and analysis rework that's been done.
I also have been seeing this really interesting trend with, right now Data Lakes and Data Warehouses refer two of those for a while, right? So the concept of data warehouse is for VI and it's structured, and a Data Lake is just a flat file storage where you can put your data. But I've been seeing, for example, Delta Lake and also Snowflake have been introducing these lake houses where it's still flat files, but the schema and the versioning are there as well of the data. And so I'm really, really excited to see how that plays out and maybe we won't have data warehouses and data lakes anymore but we'll just have Lake Houses, which sounds really funny. But I'd also like to be at Lake House.
Data Science Evangelist at DataCamp
Data Engineer at World Bank
Companies using DataCamp achieve course completion rates 6X higher than traditional online course providersLearn More