Course
Intel CTO Steve Orrin on How Governments Can Navigate the Data & AI Revolution
Steve Orrin is Intel’s Federal Chief Technology Officer. He leads Public Sector Solution Architecture, Strategy, and Technology Engagements and has held technology leadership positions at Intel where he has led cybersecurity programs, products, and strategy. Steve was previously CSO for Sarvega, CTO of Sanctum, CTO and co-founder of LockStar, and CTO at SynData Technologies. He was named one of InfoWorld's Top 25 CTO's, received Executive Mosaic’s Top CTO Executives Award, is a Washington Exec Top Chief Technology Officers to Watch in 2023, was the Vice-Chair of the NSITC/IDESG Security Committee and was a Guest Researcher at NIST’s National Cybersecurity Center of Excellence (NCCoE). He is a fellow at the Center for Advanced Defense Studies and the chair of the INSA Cyber Committee.
Adel is a Data Science educator, speaker, and Evangelist at DataCamp where he has released various courses and live training on data analysis, machine learning, and data engineering. He is passionate about spreading data skills and data literacy throughout organizations and the intersection of technology and society. He has an MSc in Data Science and Business Analytics. In his free time, you can find him hanging out with his cat Louis.
Key Quotes
I think one of the exciting time we're in is that it doesn't require an advanced PhD in data analytics and data science to play with some of these technologies and tools. And so I would encourage everybody, even policymakers, get online and play with the tools, see what they can do, understand, things like PyTorch and... and some of these other Jupyter books are not rocket science. So you can actually, with a little bit of coding experience, can do a lot. Even ChatGPT doesn't even need to have a coding experience. You can just go online and interact with it and learn what it means to work with these tools. And so they're not frightening, they're really exciting. And I think all of us can benefit from getting more experience with these technologies.
Working with data in the public sector is both similar and different to any other data driven organization. Every organization has some aspect that is gonna be data-driven. Even the ones that aren't on the bleeding edge of AI and machine learning, data drives the mission and drives the enterprise. Data is at the heart of a lot of what any large organization does. What you'll find across the US government, is that in order to provide citizen services or national defense or policy, it's all gonna be driven by data. And the effectiveness of a given organization is how well they're leveraging that data. And so maybe the one way to look at it is which ones are being driven by the data or which ones are driving the data. Which side of the equation do you find them solving? You'll find a lot of organizations are drowning in data that they can't consume fast enough, or they don't know how to take full advantage of. And then you have others that are sort of driving towards, how do I leverage this data that we have access to to affect mission, to do better citizen service, reduce time to response? And so you'll find, as you would in any organizations, sort of that whole chasm of sort of from the laggards to the leaders. But I think at its core, most of the US government has recognized that data is a central component to affecting the mission and driving the scale that they have. The challenge that you have in public sector is that the data scales are much larger than you would find in most organizations, the sheer volume of data that they're dealing with, whether that be sort of a tax, the IRS dealing with tax information, the VA, which is one of the world's largest healthcare providers. The sheer numbers that we're talking about makes it a much bigger problem, but fundamentally it's the same problem you'd find in an insurance company or a hospital just with bigger scales.
Key Takeaways
When handling sensitive data, it's important to implement 'privacy by design' from the onset. This involves tagging data with the right policy controls and governance at the start of the data lifecycle, which can help maintain data security and control.
For data scientists and AI practitioners, iterative learning is not just for the model, but also for themselves. They should embrace real-world testing and experimentation, learning what works and what doesn't, and trying different model approaches to solve problems.
To become a data-driven organization, it's not just about having large volumes of data, but about leveraging that data effectively. This involves understanding the unique needs of the business, applying appropriate data management strategies, and using data to drive mission-critical decisions.
Transcript
Adel Nehme: Hello everyone. Welcome to Data Framed. I'm Adel, Data Evangelist and Educator at DataCamp. And if you're new here, DataFramed is a weekly podcast in which we explore how individuals and organizations can succeed with data and AI. When thinking about the role of government in driving positive change for society at large, I think it's safe to say that today's government agencies are faced with an increasingly complex and challenging world.
From pandemics to supply chain resiliency, banking failures, climate change. As we covered on last week's episode, governments across the world need to be empowered with technology to accelerate the value they provide for their stakeholders and data and AI are crucial parts of them. However, the road to data and AI transformation for governments is fraught with unique challenges and risks and is full with unique opportunities.
So we invited Steve Orrin on today's show. Steve Orrin is Intel's Federal Chief Technology Officer. He leads all of Intel's public sector solution, architecture, strategy, and technology engagements. Part of his role is understanding how and where governments across the world can use data and AI provide higher value for their stakeholders and citizens.
Throughout the episode, we talked about the unique challenges governments face when driving value with data and ai. How agencies need to align their data ambitions with their actual mission. The nuances between data privacy laws between the United States, Europe, and China. How to best approach... See more
As always, if you enjoyed this episode, do let us know by leaving a comment or rating or letting us know on social media. And now on today's episode. Steve Warrens great to have you on the show.
Steve Orrin: Yeah. Thank you for having me here today, Adel.
Adel Nehme: Very excited to have you. So you are the Chief Technology Officer at Intel Federal. I'm very excited to speak to you about what it means to succeed with data in the public sector, but maybe to set this stage. Walk us through a bit more in detail. What does your role entail at Intel Federal?
And maybe share with the audience where Intel Federal sits within Intel's overall suit of solutions.
Steve Orrin: My role is a very interesting one, even with Inside Intel. there are three aspects to what I do. Part of my job is to help the US government in the broader federal ecosystem adopt and understand how technology can enhance their enterprise and their mission use cases, how they can get the best performance or, or advance functionality from the technologies that are available today and what's coming down the road in future technologies, future architectures.
So in some respects, I'm translating all the architecture technology from Intel into government speak. The other component of my job is then translating government requirements back intel speak. So our business units and our technologists and our architects understand the needs and requirements of federal use cases, both at the enterprise and mission level, and be able to translate that into next generation capabilities or use cases that we can drive to help meet those needs.
And so a lot of my role is playing that back and forth of translating function. So, Then the third area, which is really why I find a lot of the fun, is doing a lot of innovation. So understanding what technologies we have to bring to bear, understanding these really interesting problems across the federal government, and then innovating together with the government and with Intel technologists and my team to solve those big hairy problems.
And so coming up with novel approaches or novel app applications of technology to address the uh, current or feature needs of the government customer and its ecosystem. So it allows for both the engagement with customers, with real problems, engaging with the technologist on what's coming down the pike.
And like I said, my role is somewhat unique at Intel where we have CTOs for various technology domains. So a memory CTO, O A C P architecture, a client. my role spans across all of Intel. I focus on a large customer base with unique needs that are representative of the broader industry.
one day I'll be talking about an IOT based problem set. The next day could be high performance computing. So you get the full gamut of technology capabilities and Intel Federal was set up really chartered with that kind of eye. George recognized we needed a central entity that could represent Intel and engage with the federal government.
But it's all of Intel. Everything from, like I said, from client to Foundry to high performance computing and enable intel to engage both, the US government and governments at large on their unique requirements and missions and some of the unique regulatory and compliance requirements that comes with engaging with a government entity.
Doing it in a, in a natural way that meets the government needs while also being able to service the broad range of intel technologies and capabilities so that you didn't have 50 business units all calling on the government at the same time. So,
Adel Nehme: That's very exciting and very insightful. A lot to unpack there. Right? You mentioned that your role really requires holistic perspective on how to leverage technology to drive value within the public sector. And what I really wanna focus into today's, Discussion is specifically in how the public sector, especially federal agencies, can really drive value with data.
So, I'd love to understand from your perspective, working with a myriad of government agencies on their data-driven ambitions. First, how do you define a data-driven government, and what do you think are key components of becoming data-driven in the public sector?
Steve Orrin: The first thing to start off with is, every organization has some aspect is gonna be data driven. Even the, the ones that aren't on the bleeding edge of AI and machine learning data drives the, the mission and drives the enterprise, whether it's logistics management, Or the VA servicing the veteran data is at the heart.
We may call it different things, enterprise, business intelligence, machine learning, data analytics. But data is at the heart of a lot of what, any large organization does. What you'll find across the US government, and really would be the same on any government, is in order to provide citizen services or national defense.
Or policy, it's all gonna be driven by data and the effectiveness of a given organization is how well they're leveraging that data. And so maybe the one we look at is which ones are being driven by the data or which ones are driving the data, and which side of the of the equation do you find?
They have some, you'll find a lot of organizations are drowning in data that they can't consume fast enough, or they don't know how to take full advantage of. Then you have others that are driving towards, how do I leverage this data that we have access to, to affect mission, to do better citizen service, reduce time to response.
And so you'll find as you would in any organizations, that whole chasm of from the laggers to the leaders. But I think at its core, most of the US government has recognized that data is a essential component. to affecting the mission and, and driving the scale that they have.
The challenge that you have in public sector is that the data scales are much larger than you would find in most organizations. The sheer volume of data that they're dealing with, whether it be, the IRS dealing with tax information, the va, which is the one of the world's largest healthcare providers.
The sheer numbers that we're talking about makes it a much bigger problem, but fundamentally, it's the same problem you'd find in an insurance company or a hospital just with bigger scales.
Adel Nehme: That's really awesome. And you mentioned here quite a few different agencies, right? From the IRS to the VA to even NASA and these types of different government organizations, right? Maybe digging a bit deeper here is how does the relationship between the agency's mission affect the ambitions and how much data a particular agency can use?
You know, Maybe in a nutshell, can you elaborate on how important is the alignment between the agency's mission with the overall data-driven ambitions of a particular agency should be.
Steve Orrin: So that, that's a really interesting question. And what you won't find is that it's a common answer across the board. It's gonna be somewhat unique to the agency. The complexity of many of these agencies and the variety of missions. So you pick someone like the i r s, you think, oh, they just do the taxes at the end of the year, but they actually have multiple missions across the agency.
Everything from corporate. To personal tax. You have fraud detection, you have how the money flows, and so the whole backend of how payments have, there's a lot of different submissions all within that broader agency. And so at its core, when you look at who are the data driven agencies, is the one that are looking at the mission where data is transformative.
We're finding that it's a, it's an ongoing process. More and more agencies are learning that where they've had success in one area, they can apply those successes in others. Not everyone within an even agency is on the same sort of path at the same time. I think the ones that are, more advanced are the ones where they've got e either the funding to drive that mission and so be able to ad advance the technologies and take better advantage.
Or they have the mission imperative where the data is the enabler. So obviously in the area of like defense and intelligence, data is king, but you'll find in other places it's also a very powerful tool. And some places is the key component. I'll pick on one example, U S D A, which covers things like the drug administration, you know, and health of the livestock and the food and have inspectors out making sure that the livestock is healthy and that.
The food is sanitary. They're collecting data. And one of the things that you know, was recognized early on is that being able to get good, accurate data and being able to scale a, a human workforce with addition of sensors and data communications enables their mission. So they were able to tie those two worlds together, say they, their mission actually requires better use of data.
And so they drove, sort of ification of many of those environments so they can get better coverage. That they get more realtime responses than waiting for someone to come out and do their yearly inspection. And so those kind of use cases are driving that data. Like I said, the data driven aspect is when they can map the data is an enabler to their mission.
the other side of the camp, and this is one where I've seen it's very interesting, is that you also see data driven being used in, on those called the, the more mundane enterprise side of the case. There's a great example Going back on a number of years where the Air Force kicked off one of the very first AI projects and it was actually on looking at their own internal contracts and acquisition process and looking at reducing inefficiencies and redundancies in the contract language and the procurement processes and how they go about doing contracting.
And so their data set was their own contracts, database and acquisition process. The outcome was reducing, like I said, redundancy over expenditures, just of some of the complexity of having multiple terms where they didn't need them, but everybody did their own special contract when it wasn't needed.
Finding all of those unique things and getting that outta the process to streamline how they do contracting. So it wasn't, the Air Force of Flying planes part of the mission. It was how do we get better at buying things by using AI and data analytics to reduce inefficiencies? And so it's an example of where data drove massive efficiencies and Better use of dollars and use of budget at the same time, it would it affect actual change that it impacted the mission in that they can now buy things quicker, which means they can procure the planes or the, the, equipment they need in a more efficient way. And so it's not always on the tippy edge of the spear of sort of affecting the actual mission.
Some of those backend things are equally important and really equally interesting.
Adel Nehme: And in a lot of ways, they play into each other because, as you mentioned, if you're able to gain efficiencies from a data application on a backend process that saves you costs, streamlines. Particular process, you're able to even allocate better funding or more funding to achieve your mission at a much more effective pace.
So let's start unpacking maybe how to get to a data-driven state or how to can Oh, how agencies can approach their data journey. Right. Of course it all starts with data collection ingestion. You mentioned here that government agencies really are In a lot of ways facing a lot more challenges than other organizations in different industries when it comes to data volumes, the level of like, the amount of data that they need to grapple with.
And, of course in the government setting as well, there's the thorny issue of data privacy and regulation. How do you manage that particular conversation, right? So maybe can you discuss the challenges when it comes to data ingestion, creation, a bit more detail in government and how that looks like?
Steve Orrin: So it's a, it's a really good thing and we, we. Talk about sort of the life cycle, if you will. A lot of folks wanna skip to the end and let, let me do ai, cause that's the sexy thing. I want to, I want ai, everything. And even when there's a good use of ai, the people think often forget that the heavy lift is everything that leads you to that which is the data wrangling, data curation, data management, which starts with ingestion, data set management, all the way through.
I like to use the analogy of an iceberg. So the machine learning, AI and all that algorithmic part is that tip of the iceberg you see? That stands on the shoulder of the large iceberg of all the work that went into getting you quality usable data. And what we find often in, in the government agencies, and I think you'll find the same in the private sector as well, is while everyone's looking at that north star of what can I do with the cool ai de jour, the heavy lift and the work has to go into the data and that's always the long point LA tent.
What we've seen is that a lot of the experimentations that were did some really cool stuff in the lab. Never crossed the chasm because they didn't plan for how do you actually get the right set of data and scale that to operationalize it when you get outta the lab where you had, like your own little laptop and you can do all sorts of fun things in PyTorch.
How do I operationalize it? How do I scale that when I don't have the data management infrastructure and the data architecture to support it on the outside? And so a lot of organizations have gone through that, learning the hard way. Of, finding that really cool project funding, a lab experiment, and then it, never transitioned to practice.
It never got out of the lab, or when it did, it failed miserably. And so what you see is an investment in data architecture and data services and data curs, that investment in getting the data into a quality enough place and an accessible enough way for the advanced machine learning and analytics and AI to happen.
So there's a lot of work being done on modern, you know, you'll hear things like digital transformation and modernization, which, are buzzwords that are used. But really at its core, it's how do I get my infrastructure ready to handle the data, to do data ingestion at scale. So no one wants to put all their, no one can put all their data in one big lake.
So it's, how do I have, the micro lakes, the data lakes, the data swamps, whatever you wanna call them, where I can get my processing to the data. I could do curation at different domains. And then connect them together in an efficient way to be able to drive the analytic problems in the government.
As you mentioned, regulations and policies can often get in the way of dynamic data sharing. I mean, you think about the social media network have a huge repository of data that they can do all sorts of fun training on, but when you get into the, into a government, and even when you find in various private sectors like regulated industries like healthcare or financial services, data sharing becomes hard because now you've got an anonymized.
P I I or you've got, in the case of government data, sometimes you have to deal with classification of data sensitivities, foreign data, local data, so that those challenges present themselves in unique ways of how do I do analytics across data sets where the data sets can't be connected. And there's some novel approaches that people are looking at doing sort of, you know, multiple types of training.
Iterative training and feedback training. So the new models for how do I do those sort of multi data set training approaches. Where I don't necessarily have the massive data lakes that you, you find at the Facebooks and Googles, but I actually can get really good results when I get domain specific, well curated data sets, and then do some spot training on the different data sets as I move on.
I think that's where a lot of the, and and really, and you hear a lot of the talks now and people are, are talking about data governance is gonna be one of those key unlocks. For public sector data architecture, data management. Whereas in other industries it's sort, you know, we always joke that, data governances is like security.
It's the bolt-on after the fact, which we wish was better in the government space, data governance is actually becoming part of the forefront and you see it in a lot of the policy. So you look at whether the be the d o D policy that was put out around ai, you see the same thing on the civilian side and the, and the data.gov.
In their policy data governance is right there, you know, like top five with a bullet of one of the things that has to happen to actually be able to do the kind of things we want to do with data. And so that's where data governance and being able to apply those policies on the early stage of where you ingest from, how you make sure you have quality data without bias, how do you label it and curate it to get to the exciting stuff is where a lot of the work is happening now.
Adel Nehme: That is really fascinating. And, connecting to our conversation on agency mission an additional wrinkle on a lot of times is timeliness, right? For example, you may have an agency like fema. Who in the time of a hurricane or a natural disaster, really needs timely data to be able to act effectively and to save people's lives.
Right. So how do you advise agencies then that have that particular wrinkle to their mission when trying to understand how to best solve for that data management and data curation challenges?
Steve Orrin: So let's pick on the FEMA example. It's, it's a good example to talk about the expect a lot of those disaster recovery, search and rescue kind of missions. There's two data sets you're dealing with, you're dealing with the data from last week or last month before the hurricane. Sort of what, the coastline looked like, how many houses, how many buildings, what were the streets, where were, where were the cameras, where were, all of that information is the current situational awareness that you knew up until the point that the hurricane hit.
And then you have post hurricane and now suddenly you need to be able to get, eyes on the street. what's the reality? For real-time mission around search and rescue, and then the longer term mission around recovery. And so what you'll see is the application of advanced technology like drones, being able to go out and scan the environment.
Being able to recognize what cameras are still in place, in the form of, surveillance cameras, cameras on buildings, things that they can get access to. They're still operational and be able to see what's the current state of the streets. Where are good local meeting points that are stable enough for people to do the search here, rescue Michigans.
And so the the advice is, you need to have the sort of that standard data curation for what's the world existed beforehand and that's managing your data. And then you have to have a system that's capable of pulling in dynamic, less structured, cuz you don't necessarily know what the formats from a drone versus a camera that was on a McDonald's or things like that.
And so having the ability to take in less structured data, And quickly be able to categorize it and say, okay, I've got feeds from different areas, geolocated. And so it requires a slightly different approach to how you manage that data to be able to get you the what you need real time. And one of the things you recognize as well, you have probably wanna have really good quality data about what was the environment right beforehand.
So the Google Earth views and other kind of sensor data that you've collected around an environment, especially in a hurricane prone section like Florida or in a wildfire area in California. You also recognize that you want to, you want to get multiple sensors that may not have the same quality or the same fidelity when you're dealing with the real time.
And so you may not have as good a quality images, but you can, compare. And that's one of the benefits of having access to both data sets. You may not have the best quality data from the drone that's flying over maybe it's an amateur drone. It's not the expensive, satellite from Google.
But the flip side is you have something to compare again. So as long as you get that data available, you can enrich the quality of your data by knowing what was there before and being able to detect is, it's what you're seeing now. And so a lot of it is being able to have that ability to stream in data, in real time with a different set of government's controls in order to be able to, affect the mission they have on hand.
The other side then is being able to do, that prolonged mission of being able to do recovery and how do you help people figure out, you know, is the house salvageable? Things like that. That requires a combination of human end machine. So having, the sensors and the, and the things to be able to collect the information as you have it.
And augmenting that with, people going through and labeling the data and providing sort of the human expertise. And at the same time, we look towards the future, not today, but in the future, taking that expertise of that labeling and then training models to get better at recognizing those things at scale in the future.
And so that's where the human. Human machine components is still a factor, especially in that data curation as we look affecting mission towards the future.
Adel Nehme: This is really great insight. And we mentioned here earlier in our discussion, you know, outside of these different challenges really relevant to the agency, there's also the entire data privacy conversation. Maybe to give listeners a bit of an overview, walk us through the landscape of data privacy challenges government agencies face when collecting data, curating data, and trying to operationalize certain use cases.
Steve Orrin: it's a big challenge because different agencies will have different, what they call authorities on what they can do with the data. And it's very similar to what you would find inside the private sector. The challenge of the government is everyone sees the US government as one big entity, when in actuality it's lots of entities.
And those entities are representative really of all of private sector. You have financial services functions. In the US government, we talked about R S C M S, the Medicaid medicare programs are one of the largest payers of reimbursement. So they have that. the treasury and others dealing in financial transactions and regulatory things for financial healthcare.
I mentioned the VA and the D H A defense health are providing healthcare. So they have p i i and have control. Law enforcement has a certain authority about being able to capture data, but only under certain conditions, like with warrants and with probable cause, which is different from say, when you're talking to the, Urban planning and you're giving your information, opting in. So the data governance is gonna be very much specific to that agency and the authorities that agency has. Just like you would not expect your healthcare provider to be sharing data with your financial institution because they have different authorities, in this case, regulations, controlling the privacy of that data.
We have the same thing on the government side. And so what you'll find is that the data governance and the policies applied to both the collection and a lot of time, the opt-in and the use that's publicized of how the do used, but also on the background on how that data can or can't be shared or when it is shared, how it's anonymized.
And that's done by policy. And that's one of the things I think the government has somewhat of a leg up on private sector is that's very well-defined governance policies, what you can and can't do. And so, Because those policies are, 50 plus years old, predating a lot of the technology as far as it's ingrained into the system that I need to anon anonymize my data, or I, I can't hold this kind of data, or I can't request this kind of data.
Then they've built the systems with that in mind. Now there's the flip side problem is that, it becomes harder to share data across agency and it takes sometimes, unnatural acts to figure out how to do information sharing when it's legitimate. One of the best examples is around cyber information sharing.
When we're looking at cyber attacks across different agencies and across different sectors, by government mandate, you're supposed to share that information, but now you have to anonymize a lot of it and be able to share just the information that net relevant to that cyber event or that indicator of compromise.
Those systems were typically built with an eye towards, well, I can't share anything cuz it's, government data that was retained on a warrant in the case of law enforcement. Or it's a data form of a vet which is covered by privacy controls around healthcare information. And so sometimes you have to figure out how to work with within those regimes and provide the right assertions and attestations that the data has been kept private.
And that's where, concepts we're hearing about now are starting to take flight, what we call privacy by design. Where people are building their, their tools and their architectures to build privacy in. Now this is mostly focused on sort of the next wave of data acquisition and data management, where they're, looking at the ingestion process.
And some of the automatic labeling will add context, will add metadata around the, policy or the regulatory regime. That governs this data so that then as it's moved through the engine, that policy sort of lives with the data. So if it's VA data that's being collected in, it can be tagged as this is veteran healthcare information.
And so even if it's going into a system that's gonna look at sort of payer fraud problems, the patient data has been tagged and therefore is, is separated from whether or not the payments went to a legitimate location and that way. the new models. And we're seeing this not just in the government.
We're seeing it in, in private sector. We're seeing it in a lot of the some of the more regulatory environments, social environments, so like healthcare providers doing telehealth, things like that, of designing those tools to tag the data as part of the ingestion so that you get better controls on the governance side.
Adel Nehme: That's really great. And maybe deep diving a bit more into privacy by design, maybe what are some of the key considerations and best practices government leaders and data leaders in this space should think about when handling sensitive information secrets personally identify identifiable information.
Maybe walk us through, how privacy by design looks like throughout the data life cycle.
Steve Orrin: Good question, and it's, it's not a one size fits all, and I wish it was a silver bullet. I'd say, here, push this button. You get privacy by design. It's more of a process than a set of technologies, but it, again, it looks at that whole life cycle. So as you build, as you're doing everything from the sort of the data set management, data ingestion, being able to assert governance at the very beginning, and that's one of the key challenge changes around privacy by design.
It's not waiting until the data's in and being used that you start applying governance. It's governance at the start. So with one technical implementation is around tagging, so tagging the data with what type it is or what control space it is. Being able to separate the actual p II from all the other ancillary data.
One of the big problems we've had in a lot of the data ingestion is that it's all a bunch of data and either you over classify it so, well, this is all p i I really, it's not all p i i, but it was all collected as part of a p i I exercise. And so part of it is being able to separate what's actual p a i, what can be shared more broadly, because that'll make life easier downstream when you actually are starting to look at some of the analytic use cases.
So step one is make sure to both tag the data with the right policy controls and governance, but also to characterize your data on ingestion so that you can apply those policies as they go through. The other aspect of privacy design is not just on the, you know, it's on the data, but also on the tools.
On the products. And building them to understand what to accept privacy and privacy design is a great concept. It's not only that, it's also security, but design is part of that package. Having controls in place about who can access the data. So that's not just the data flowing and the controls on it, but on the user, whether that be user, be an administrator, a data curator, and data labeler, or an algorithmic developer is building the next cool, large language model.
Having controls built into the tools that apply to my access to data, strong authentication, being able to apply access control policies of how much of the data do I need to see. And this is where some newer technologies are being looked at. Confidential computing, homomorphic encryption or ways of enabling, maintaining security and control and privacy of the data while still allowing someone who maybe isn't allowed to have access to the data to perform actions on the data.
That's really one of those interesting future cases we're seeing the technology today start to enable is where I can actually, break that problem set of having an algorithm developer wants to do analytics and wants to test a model against data that they don't actually have rights to see.
Cuz it's either, it's classified data or it's private data from a a P I perspective where it's regulatory control that you know only can be seen by someone who has legitimate need to know from a warrant perspective. Being able to do analytics on the data without compromising the actual data itself is one of those next generation properties that confidential computing and homomorphic encryption.
And some of the other security models are enabling privacy by design to, to scale better. But in a lot of the current implementations, it's both, access control applied from a data labeling to the data, and then the tools that you're using, enforcing those access control policies as you interact with the data.
The other big challenge, and this is something that a lot of people are working on today, is when you start aggregating data together, both from a, data lake and data oceans, that you're pulling data from different sources that may have different regulatory requirements, even if you can within the data management system, apply the governance.
When you start applying sort of these analytics or these models that go across data regimes. What's the outcome of the model that's trained? does a model trained on p i A data become itself a P I I model, and these are questions that are, that, the AI policymakers are struggling with right now.
I think there's a, a tendency to over classify or over, you know, make every, everything's gonna be p i I, because at the same time, some of those models do expose information if you ask it the right questions. And so we're all learning. I think the industry is learning, and I think policymakers are trying to catch up to go what's the right policies?
But from an architectural perspective, one of the things that I think helps certain government agencies is that within their domain, the regulatory or the policy regime applies to their domain. So the i s is governed by an IRS and a policies for their data, and so they can operate within their data.
It's when they start sharing with other agencies that it gets complicated. And this was one of the examples of where in the US the Affordable Care Act presented some unique challenges because you had your healthcare data. So the fact that you were on the, the program and your i r s data, showing that you were eligible or that the payments could go through.
So there was some backend connections. And what they had to build is some policy brokers between the IRS systems and the CMS systems. To be able to make those linkages without compromising the privacy both directions. And so that was really interesting approach of creating these data governance brokers that basically applied policy and access control across systems without having to re-engineer systems that some of them had been around for a long time, that weren't necessarily ready for that data governance to be put into place beyond what they already were doing within their agency.
I think that model is starting to be used in other places as we look at more community sharing across agency to be able to provide, you know, again, you can't go and throw out your systems and start with a green field. That's just not reality. And so while the organizations themselves are evolving and modernizing their infrastructure, they're seeing this, this sort of broker data, broker data governance broker concept, whether it be in the form of a CASB or certain other technologies, a SAS e, to apply policy on the seam between.
So they allow them to do the data sharing in a controlled way without having to break everything that they're doing on the, on their enterprise side.
Adel Nehme: That's incredible. I love that example of the affordable. Care Act because it's always really interesting to see, the data implications and the, data process behind major political decisions such as the Affordable Care Act and what that means from a data sharing perspective. Now, we've been talking about the data ingestion, data curation side of things.
That at the beginning of the data science lifecycle. But maybe let's talk about the end of the data science lifecycle a bit more on the sexy stuff, as you mentioned, like machine learning, AI and data science. So maybe based on your experience working with government agencies who are operationalizing machine learning use cases, ai what are the different key success factors for the development and deployment of these machine learning applications in the public sector?
Steve Orrin: So Del, let's look at it from two perspectives. I wanna give some examples of where they've been very successful, and really I think there are three factors that drive the success of those projects. So number one is when they choose an impactful mission. the most successful ones are where if they do the work and they spend the time and they build something, that it means something to the people who aren't AI scientists.
And so it either affects the mission that allow them to do more, do it better, do it faster, do it more at scale. Whatever you're doing, make sure it matters. The other key lesson learned is make sure yourself the most important thing you're doing, cuz you're gonna screw some things up, some things are not gonna work.
So what I suggest is you don't take the most important project and start there. You take the second or third most important project and you target that. So there'll be definitely people who can see the value, but you're not gonna disrupt the agency or disrupt the mission if something doesn't go or if you don't get the right accuracy out of the gate.
So you pick an importance, I call it the medium level project, not the most important project, but you asked to be a medium. If it's some sort of throwaway project, then no one's gonna care to take it to scale. So step one is pick the right project that affects mission. And then the other thing is, a lot of data scientists get what I call analysis paralysis.
They want the perfect algorithm with 99.999% accuracy and waiting. That's not gonna be reality. And you'll spend 12 years designing the perfect algorithm for a very small niche problem and never see the light of day. So one of the things is get out there and try things. And that's why a medium level project is that if you, if it's not accurate enough outta the game, it didn't hurt anyone, but it can start to show the relevance to mission.
one example, and I'll pick on the forestry that did a proof of cons on drones. And one of the challenges of forestry is they gotta go out to all the national forests and, check the, the health of the trails. Check for blight, disease on the trees. And of course now in California, a lot of it is checking for potential hotspots of future wildfires.
That's a very labor intensive people walking trails kind of thing with with clipboards and or even a laptop. Or a tablet taking stock of what they see. And then you're only getting coverage of the areas that someone can physically walk. And so one of the projects they tr they tried was to, and help scale those folks.
Not replace them, but scale them by having drones fly through the forest with recognition algorithm and AI in the drone to look for blight on trees. And so they were number one, they got better coverage cuz drones don't have the same limitations as walking through the brush, can navigate around obstacles.
They were able to send them out with a semi-autonomous, or you, you, you have this geofenced region go scan. So they got much better coverage. And because the, the, while the drone wasn't 99.9% accurate, if they could identify within an 80% accuracy a, a blight, it meant that someone could go there and check it out in person and reduce the amount of time before somebody detected it.
Because by the time it's broad enough of a problem that someone on a trail would've seen it now have a real issue of blight. Whereas having the drone be able to detect what looked like blight early enough and get a human to so verify, allowed them to reduce that window of exposure significantly.
And so even though the model wasn't a hundred percent accurate, it didn't need to be, it actually affected and enhanced the mission in an operational way. And that was a, a huge success. I, the other thing about it, it didn't require massive data sets. There was plenty of, good online. and recorded data sets of what blight looks like and the different styles of trees.
So it was a good narrow problem set that they could go after and didn't require a whole rearchitecting of the entire infrastructure. So it was a medium level problem with a medium level of investment, but they were able to affect the actual mission fairly quickly with something that was good enough.
And that was one of the, I think one of the key successes there. When we look into, other parts of the government, there's a variety of use cases around data. And when think about it, The US government and really all of industries are ification of everything. So everything's got a sensor, it's a camera, it's a RF signal, it's there's lots of sensors out there and how to, how to make sense of all that is a key challenge.
And so what we're seeing is folks looking at ways to look across those different domains and get better situational awareness. Think smart cities. How do I do traffic management within an environment? Well, I've got cameras on on all the different stoplights. I've got sensors that are doing traffic flow analysis.
I've got time of day analysis and backend data on where things are going and telematics information. So how do I merge that all together to get a model of a, are there congestion spots that we need to do new infrastructure projects where there's some optimizations in just the, the stoplights that I can do to get better flows at certain times of day.
There's an example, out of the east coast where one of the things that we're looking at is being able to change the traffic flows based on a game letting out of a stadium. And be able to adjust the traffic lights to both be able to handle the Uber and Lyft kind of traffic that's gonna be necessary to take people home, but also deal with the large crowds leaving the stadium, going to their cars, going to the street, and adjusting so that you get more efficient, exiting and entry into those environments.
All of that was done through the ification and AI analytics of those data sets.
Adel Nehme: That's very, very interesting use cases. I love the use case on the blight as well. It's a very interesting application of machine learning here. What I wanna do now is shift gears slightly and discuss maybe a global perspective on how government agencies across the world leverage machine learning and data.
We've been speaking lot on US use cases, but I think what's interesting here is that. Different countries have different standards and considerations when it comes to data privacy, data collection, deployment of machine learning systems and applications. In Europe, we have G P R in the United States, we have hipaa, for example, for healthcare China.
There's a pretty different approach to data privacy there as well. In a lot of ways you have a unique vantage points, that you consult with a lot of agencies globally as well. Maybe walk us through how big the differences can be here between countries. How does this alter the type of use cases that can be deployed?
And maybe your approach working with different government agencies globally and how that adjusts.
Steve Orrin: So Dell is a, a huge problem and. I think, we'll pick on EU as an example. So GDPR is one that everyone knows about, but there's other regulations in place or the, the in the EU around citizen sovereignty. So your data of each individual EU nation needs to stay within that nation. so your citizen data in Germany versus citizen in Brussels versus citizen data in France, that needs to be either geolocated within that environment or there's special safe harbors that can be created if.
When the governments of both sides agree that they can have that infrastructure in place for that citizen data, that creates huge hurdles for broad information sharing and information transitional. Now GDPR helps because it provides a consumer right around that data that you have the right to be forgotten.
You have the right for them to protect your data, and that operates on the consumer level. But on the government services side, they're very strict. Controls on where that data can be, can reside. How that data can be processed For government services, that works pretty well, except for in places like where you have these distributed clouds.
So you have a cloud that's gonna have data that could be across the world. And so one of the things that the cloud providers are doing are providing regions and the right locations with governance controls around where that application data can flow. In order to satisfy those government regulations.
In the US we have the Gov cloud, which is a specific set of cloud regions for the federal government that those applications, they must reside within those regions and they can't move in. They're physically blocked from it. You'll have the same thing in the eu. You'll have the same thing say in India or in us, where they have geopolitical boundaries around data use.
Where it becomes really hard is when you look at a lot of these AI use cases that work best when trained across data sets. And I think the large social media companies, but even cross government agency collaboration is sometimes hampered by the local data governance and data sovereignty laws around data.
Now what we're finding is cooperation agreements at the, at the go government level. For the sharing of certain types of data to, basically help both sides. And so we're seeing that in the EU with the us, with the Five eyes and other countries on specific data domains. So things like the Covid was probably one of the best examples, sadly, of this, of being able to share epidemiology information, share information about, effectiveness of treatments cross these these geopolitical boundaries.
Required acts of government and information sharing agreements in order to be able to enable that and to enable it with private industries so that the pharmaceutical companies can make sure that the vaccines were effective across different regions with different populations.
And so we're seeing, that the modern era is stressing a lot of those what we think of as good regulations, and they are for protecting data. But ultimately as we drive towards a data-driven world, not just a data-driven organization, it's gonna require those sort of, more cooperative agreements for sharing the right data.
I think what what's helping is that no one is trying to do broad data share. Like, I'm just gonna give you the keys to the kingdom share everything. It's gonna be very specific for a specific domain and of a specific use. We see, healthcare and, and science collaboration is gonna be a key area where you'll see those cooperation agreements.
and this laptop is done through universities. Another where they create a safe harbor, a safe environment for data to be shared for particular uses. And they can, can put strong controls on who can access and goes back to the privacy by design and the data governance design by who gets access to that data.
They use. Brokers be able to make sure that only legitimate folks with the right credentials are accessing the data. We're seeing a similar problem when we talk about national defense on, on a global scale. When you're looking at whether it be the cyber crime activists and how different government agencies and law enforcement globally are working on trying to share data of what we're seeing in the U.
You know, the The Interpol may be seeing as far as a ransomware gain versus what the US or someone in South America. And it's been a challenge as governments figure out how to share that information in an effective way. What I think you find is that when there's a strong need that becomes a strong will.
I wouldn't say that it's operational in the sense that we just have a working mechanism where governments share data easily, but for those specific cases where there is a strong desire and a strong need, and it's a good outcome that everyone collectively sees, will makes the way, and you find these cooperation agreements or special data sharing ar arrangements that enable the sharing of the right kind of data to affect that mission because of like I said, in the case of ransomware or other.
Cyber criminal activities that the need is so high. In the case of epidemiology for covid, the need was so high globally we need to work more on how do we more operationalize those so that we can do more data sharing. Cuz the traffic pattern analysis that we're doing, let's say in Sacramento is, can probably, can be equally good for Berlin and vice versa.
And how could those two organizations, those two entities share that telematics data with each other to make, each more efficient is a, it's a challenge we haven't solved yet.
Adel Nehme: That's very insightful in a lot of ways. You mentioned that when there's a strong will, there's a way right. we find that in crisises a lot of the times years of progress happens in days, right. As we saw with Covid when it comes to data sharing. As we close out our conversation, Steve, this has been a very insightful discussion I'd be Remi about to discuss with you.
You mentioned slightly here in our discussion on training large language models, how you see tools like chat, PT and Genet, AI and large language models really affecting government in the future. Maybe how do you foresee, what type of applications do you foresee for these types of tools in a government setting, especially given their black box nature as well as, what we discussed in terms of data privacy and the importance of making sure that citizen data is protected in these types of applications.
Steve Orrin: So it's a hard question. I, I'll say I wanna separate the large language models that we all know, like the chat GPTs, the Bards, and the others that are these massive billions and billions and sometimes trillion node large language models that. open AI and Facebook and ma and, and Microsoft and others have access to that kind of large data to train those models.
Even within the US government or any other government, they may not have that level of, openness, nor do they wanna share their intrinsic data with. An external model like that. So let's separate those two. I think everyone sees the, this chat C P T style, which the generative models, the transformers are going to change how we do things.
They're gonna enable all sorts of use cases that we've just are starting to scratch the surface on whether that be better citizen services to automated, instead of listening to a dial tone and waiting online. More automated chat, so you know, when you need help with your i r s taxes, having something that's actually pretty better than, so someone with a, a call list call sheet actually interact with you is a benefit.
there's also a use case around data quality and, and code quality. There's a lot of work being doing prediction and being able to do prediction models. So there's a lot of excitement. It's the practical application side that we're still getting our head around.
And what I'm seeing and what certain combinations are seeing is there's actually a benefit to, instead of these huge large languages, is doing more the domain specific language model part. So that providing that generative model trained on a more domain specific dataset. Because again, chat c BT is trying to be the, everything to everyone.
It's gotta be able to, beat the MCAT scores and tell you what Macbeth is really all about and everything in between. Whereas if I train, if I'm looking at a, a language model approach for a specific AR subdomain, you can actually reduce them the overall size and get pretty good results without having to spend the, millions and millions of dollars that an open AI or a Microsoft would typically do.
And so what we're seeing is the practical application of large models are gonna be very domain specific. And that means I can bring it in-house or at least into a GovCloud environment and focus it on the problem. Whether that problem is looking at the. Common interactions that someone has when having trouble submitting their taxes or filling out their, their, their healthcare forms to providing better prediction analysis on infrastructure and when it's gonna fail.
I think one of the really cool areas is gonna be looking at sort of contracts and how do we can get better contracts by looking at predicting what, contracts led to successful programs versus others. There's a lot of really cool efficiencies that we will get. And I think we'll also start to see applied to ways we can't even imagine today when we start training them on the key problem sets that affecting government.
I think that when you look at this sort of the policy side, there's a lot of people that are worried or scared. They see chat, T b T, like, oh my god, they, it's gonna take over, it's gonna eliminate jobs or it's gonna, do nefarious things. I don't understand.
And I think that's just a lack of understanding of the technology. And a fact that a lot of times regulations, trail technology and when the technology is changing every couple of weeks, it's hard to keep up. And so I think that the policy regulators are working at trying to get their hands around what the ethics and the biases around these kind of large language models are.
And it's hard because a lot 'em are black boxes. You know, you, you don't know I gave you that answer, but sometimes the answer is weird. And so I think one of the things that we as, as people in the industry and as data scientists and experts, it's incumbent of us to help policy regulators and to educate the legislatures and the policymakers and the influencers to understand what thet technology does, what it doesn't do.
Let's call 'em policy technologists. Spend time with your representatives and with your lawmakers to help them understand what, how the technology works. What are some good policies versus bad policies? A lot, lot of knee-jerk reactions right now that aren't necessarily gonna be useful, and some of them can't even be applied because the way the technology works, it doesn't work the way the policy thinks it does.
And so I think as we look at this adoption of these next generation transformative models and everything that follows beyond that and around graph neural networks and some of the really cool things that we're gonna be able to do in the future, we need to help the regulators and help the policymakers keep track of what this technology means and how best to leverage it.
As opposed to always going with the, well, let's try to block the technology or try to keep it from becoming Terminator, which is not gonna happen.
Adel Nehme: Steve, as we wrap up our conversation, do you have any final call to action before we end today's chat?
Steve Orrin: step one is pick a project that actually is important and makes sense to your organization. Do the work upfront to get the data into the right format, to do that data curation, data wrangling, to enable the cool AI machine learning approaches and go off and try things and get them out in the field and test them in real world environments and see what works and what doesn't.
A lot of the data scientists I work with AI folks is, you know that iterative learning is not just for the model, it's them as well, learning what works and what doesn't, and trying different models, approaches to solve the problem. I think, one of the things, the exciting time we're in is that it doesn't require a, an advanced PhD in data analytics and data science to play with some of these technologies and tools.
And so I would encourage everybody, even policymakers, get online and play with the tools, see what they can do, understand, know things like PI Torch and, and some of these other Jupiter books are not rocket science. So you could actually, with a little bit of coding experience, could do a lot. Even chat, G B T don't even need to have a coding experience.
You can just go online and interact with it and learn what it means to work with these tools. And so they're not frightening. They're really exciting and I think all of us can benefit from getting more experience with these technologies.
Adel Nehme: Thank you so much, Steve for coming on data framed.
Steve Orrin: Thank you.
Learn topics mentioned in this episode!
Course
Introduction to Deep Learning with PyTorch
Course
Introduction to ChatGPT
podcast
Why AI is Eating the World with Daniel Jeffries, Managing Director at AI Infrastructure Alliance
podcast
How this Accenture CDO is Navigating the AI Revolution
podcast
Data Science & AI in the Gaming Industry
podcast
Data & AI for Good, with Marga Hoek, Founder & CEO, Business for Good
podcast
Data & AI Trends in 2024, with Tom Tunguz, General Partner at Theory Ventures
podcast