Skip to main content

How to Build AI Your Users Can Trust with David Colwell, VP of AI & ML at Tricentis

Richie and David explore AI disasters in legal settings, the balance between AI productivity and quality, the evolving role of data scientists, and the importance of benchmarks and data governance in AI development, and much more.
Nov 17, 2025

David Colwell's photo
Guest
David Colwell
LinkedIn

David Colwell is the Vice President of Artificial Intelligence and Machine Learning at Tricentis, a global leader in continuous testing and quality engineering. He founded the company’s AI division in 2018 with a mission to make quality assurance more effective and engaging through applied AI innovation. With over 15 years of experience in AI, software testing, and automation, David has played a key role in shaping Tricentis’ intelligent testing strategy. His team developed Vision AI, a patented computer vision–based automation capability within Tosca, and continues to pioneer work in large language model agents and AI-driven quality engineering. Before joining Tricentis, David led testing and innovation initiatives at DX Solutions and OnePath, building automation frameworks and leading teams to deliver scalable, AI-enabled testing solutions. Based in Sydney, he remains focused on advancing practical, trustworthy applications of AI in enterprise software development.


Richie Cotton's photo
Host
Richie Cotton

Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Key Quotes

One of the things that AI has done is it magnifies all of the efficiencies and inefficiencies of your organization. We've been helping our customers shift testing left. It means move it earlier into the cycle. AI accelerates that. Customers can shift testing left quicker. They can do more of their testing earlier. They can automate sooner. They can get their requirement validations, they can contribute in peer reviews. All of this is done earlier, but it doesn't change the fundamental equation. If you're rolling out a ton of AI code and you have bad security practices, now you've got worse security practices because you're doing more of it in an ungoverned way.

Every person in your organization is probably telling you ‘be more efficient, be more productive.’. But it's a dangerous trap to fall into because we mistake the volume of output for completed work. As organizational leaders, we need to take a step back. A person that's generating 500 confluence pages talking about how they're doing their work, and their great plans— that is probably not generating 500 pages worth of value, that’s generating 500 pages worth of stuff. You should start to measure what is actually important to you.

Key Takeaways

1

Avoid over-reliance on AI-generated content by implementing robust validation processes to ensure accuracy and relevance, especially in high-stakes fields like legal and compliance.

2

Shift focus from output volume to meaningful outcomes by measuring velocity and value, rather than sheer quantity of AI-generated work.

3

Enhance data governance practices to prevent unauthorized use and leakage of sensitive information, ensuring AI models are trained and validated with clean, compliant data.

Links From The Show

Tricentis 2025 Quality Transformation Report External Link

Transcript

Richie Cotton

David, welcome to the show.

David Colwell

Richie, pleasure to be here. How are you today?

Richie Cotton

Life is good. I'm trying to keep things jolly. Just to kick things off. I like a good disaster story. So, can you joke with it? What's the biggest AI disaster you've seen?

David Colwell

I have been in the field for a little while, so I've got, stories on both fronts. When it comes to biggest AI disaster story that I've seen. I think my favorite is every time lawyers try to use artificial intelligence for their cases. Because AI is just effective in that how good it is at seeming like it knows what it's talking about.

David Colwell

And we ascribe a lot of intelligence to somebody that is able to speak with our lingo, that someone that knows the words that we use and that uses them correctly. So I've seen maybe or now, different examples of when legal teams get their hands on AI and they give it a simple task like summarize this brief for me and, you know, generate some arguments that I might want.

David Colwell

And because it uses all the right legal terminology, because it's going. Yep. You can see in this section, yeah, these reasons and here's the rationale, that conflicts with this law or something like that. They start, overly trusted at th... See more

e beginning. And there was an absolutely comedic brief, I think it came out of the New York Circuit Court, which, lawyer had created their entire arguments on a very serious case completely using ChatGPT.

David Colwell

And it read very naturally. It was a very good, wellconstructed document that was full of completely hallucinated citations and the results. And I know what you call a legal slap back, but whatever the whatever the opposing counsel wrote back in their motion was more or less them scratching their head and saying, well, I you're I know we don't really know how to respond to this, but every citation in this appears to be completely made up.

David Colwell

And the judge going through reading it issued this scathing opinion saying, I have no idea how you have been this stupid, but I'm trying to read all of these citations. I cannot find a single one of them. I mean, the lawyer was lucky to keep their license. They got sanctioned, they got publicly slapped down about it, and the defense they came back with was I thought that this was like a search engine, that I could just kind of put things in, and I only got facts back out.

David Colwell

And so I started to trust it. I think that's probably the only reason that lawyers still has their license, to be honest, was because the judge put them into the ignorant and not malicious category. But I think that all of the major AI disasters that I've seen that are at least things that I could have a bit of fun with, not the ones that have truly gone after off the track fit into that category of people having decided to use AI for something that they should not have been using AI for, and putting a bit too much confidence into AI's ability to sound like it knows what it's talking about, and speak with great confidence

David Colwell

and a vast amount of ignorance.

Richie Cotton

Absolutely. I mean, it's it's a brilliant story. It's, very easy to enjoy. Like, a bit of schadenfreude there is like, okay, lawyers doing stupid stuff. Brilliant. I think a lot of people are being pressured to use AI more and more for this kind of content generation. So I think we're going to see a lot more examples of this where we get, just, mistakes happening because I did something that looked plausible or generated.

Richie Cotton

Something looks possible, but it's just not. So how do you protect against this? I mean, I get where do you start with stopping, the tide of these kinds of errors?

David Colwell

It's difficult because on one side, you've got every person in your organization is probably telling you, or at least a lot of the executive leaders is saying, use AI, be more efficient, be more productive. And speaking as a sort of people later myself, sometimes I do. I fall into this trap as well, where I'll be looking at my team and I'll be like, come on, you've got access to our tooling.

David Colwell

You could have done this much quicker. You know, I believe you could be doing twice as much if you just used all the tools that we had at hand. It's a good thing because in many cases, it is true. But it's a dangerous trap to fall into because we mistake the volume of output for completed work. And that's really where, as organizational leaders, we need to take a step back and say that person that's generating conference pages is talking about how they're going to be doing their work, and great plans and everything like that is probably not generating pages worth of value.

David Colwell

That generating pages worth of stuff. And this person over here who's, you know, trundling along, getting their work done, don't discount that. You should start to measure what is actually important to you. Measure things like your velocity, how fast you're getting work done, measure the things that are meaningful and not the outputs. And that's often what we got used to doing, because we're used to a world where the only way you get to an output is a human having gone through the steps to create that output, now that that cycle is broken, we now are focusing a lot more on how do you determine that the output is the right output.

David Colwell

That's the thing that you wanted. And there's different ways that we go about this in the industry. There's things like benchmarks are great ways of analyzing. How good is the output generally for the task at hand. If you've got a volume of data. But there's also enterprise testing practices which customers are implementing. And this is a lot of the work that we do in centers where in order to validate that the AI has done the right thing, you need to put these kind of validation points in place so that we can confirm that, yes, the output for for example, let's say you've got an AI generating code.

David Colwell

It wrote a lot of code, but that code is causing the test to fail. So that's actually a net negative. You are worse now than you were before. So having those tests there allows us as people to gain kind of a ft view of what the AI is doing and say, yeah, it looks like you're on the right track.

David Colwell

And it also allows really smart engineers or smart AI practitioners to go one layer deeper and say, well, I'm going to integrate that into my work cycle so that I can not just wait for someone else to tell me that the I got it wrong. I can feed that information directly to the AI and allow it to kind of selfcorrect along the way.

David Colwell

That's the game that we're playing at the moment. Us and these artificial minions that we've generated is our role shifted from typing a lot of things out to peer reviewing and setting up frameworks so that the artificial meaning doesn't get too far off track. I think that's where most organizations are trying to crack the nut of AI. Productivity is less about generation, and more about verifiability.

Richie Cotton

Okay. Yeah. I mean, certainly as a content creator above average, where there's oh, yeah, I've got to create all this content. I only got to put out a podcast every week and yet, not going to have not replace myself with that. I completely, at least on the show, but, yeah, having different ways of measuring what success looks like is more about the, the quantity of, of high quality output rather than, than, the quantity of output itself.

Richie Cotton

Okay. So, I mean, since you mentioned, the software testing, Zappos is like one of the big use cases for I jolt me to, how does your workflow change them? So suppose you've got all these, these, agents helping you out. How do you workflow change to and to make sure that you're still creating high quality output?

David Colwell

Yeah, it's a good question. So the first thing that you have to, start with is real facts and data about what it is we're trying to achieve. And we're a very data focused organization. In fact, we recently did this, tri centers Quality Transformation report. You can find it on our website. It's a good read.

David Colwell

It surveys a large number of organizations across the globe through different countries, different regions, different size organizations. And there's a lot of interesting statistics in there. But the main goal for us out of that report was to say, are we actually solving the right problem here? Are we dealing with the challenges that our customers are facing? And some of the statistics that came out of it were it's interesting that the quality engineering teams had the same challenge that the development teams had.

David Colwell

They were being told, you need to be moving faster, you need to focus on speed. I think it was I mean, it can be hard off the top of my head, but I think it was like % of organizations that their primary focus for their quality organization was on speed. Now, as quality engineers, that's a, a good goal to have, but it puts you in that productivity bind where you really are.

David Colwell

The last line of defense in this sort of AI, waterfall that's coming down because upstream and this will be telling you a real story from my my personal history in developing AI organizations. You enable your product teams with AI, and they start using AI to generate requirements and generate, what needs to be done. They may even be, using AI to come up with ideas of what we could do and how we can help our customers.

David Colwell

So they've got now more ideas and more things being generated and that traditional kind of bottlenecks that would happen when you would tell the product team, okay, great. Can you go write all this down, put some thought into it, come up with what your priority lists are, and then start siphoning those to the team. That natural bottleneck of work that they had to do before they could communicate the, you know, vast brain dump of ideas.

David Colwell

They had to the engineers disappeared overnight. So now they've got like a funnel where they can just throw huge volumes of things down to the engineers, and the engineers are picking them up and going, wow, okay, I guess all these things are coming my way. I better use AI to start creating the code. And so now you've got fast ideas, fast code, and you get to the last point where you're supposed to be going, all right.

David Colwell

Not just is this idea done correctly? That's an important part of testing, but yeah. Is this idea benefiting the business? Is it, you know, actually meeting the original goals that were defined? This is something that you often call like static testing is validating that the requirement that you got is actually a good requirement to solve the use case that it elicits.

David Colwell

And then once you've got those, are we using the right tools and products to be able to deliver it? So is it, you know, the correct outcome? Now when you take that mission of making sure that it's the right thing and that it's built right, and you say, now your main goal is speed that starts to come with sacrifices.

David Colwell

And the quality teams are saying, well, I'm I'm snowed under like, I can't handle this massive pipeline of work that's descending on me without some help from artificial intelligence. And so you end up with artificial intelligence being both the cause and the solution, and the problem where people say, all right, because there's a huge volume of work, I need to be moving faster.

David Colwell

The only way I can move faster is to use artificial intelligence. This is where it becomes critically important that these organizations have access to good, reliable AI tooling to use for this. And it really does split into two parts. There's a whole section about how do you test an artificial intelligence itself if you're building a product based on that?

David Colwell

But the more common problem is just how do you test artificially generated content? So artificially generated code, artificially generated requirements, etc.. And this is where if your test is a kind of trying to also upskill into becoming artificial intelligence engineers at the same time, they're going to be completely snowed under. We recommend that they look for companies. And this is a selfish plug because I work for one of them that produced this type of tooling.

David Colwell

My day job day after day is just to find ways to use AI. To make testers lives easier. We have to accept into ourselves the burden of figuring out how to not just generate more what we affectionately call internally AI slop, which is just AI generated content that isn't adding any value. We have to figure out how do we make it relevant?

David Colwell

How do we make it accurate? How do we communicate degrees of confidence to the customer when the AI may or may not know what it's talking about, so that we can unlock the speed that these quality engineering teams are seeking in a way that doesn't sacrifice the original mission statement of protecting the production environment, protecting the end consumer from more AI generated slop.

David Colwell

I'll give you one example of where AI has proved critical to solving the problems that AI creates. When I writes a requirement, or realistically, when I writes any kind of narrative document, it is extremely prone to filling in the gaps of what I say it thinks you're thinking, but I'm not sure it thinks. I think it more or less does mathematical calculations, and we ascribe to it the anthropomorphic thinking label.

David Colwell

But when the AI gets a requirement from you, you can almost imagine that in its head it's reading all the other requirements that say the same, like, okay, let me think about this requirement that I saw online or these JIRA tickets or these issues in GitHub or wherever. It's seen this nature of thing. And what it comes back with is will all of those talked about security and scalability as well?

David Colwell

So I'm just going to add some security and scalability things into this requirement. And that one also talked about, you know, policy making decisions. I'm going to have some policy making things because your requirement was just close enough to those ones for to trigger those ideas. Here's where the first of the major AI syndromes, the, you know, the diseases that infect humanity when it's exposed to artificial intelligence.

David Colwell

If you'll excuse the somewhat dramatic, phrasing, the first of these diseases is that we feel like we can't slow the machine down. We don't want to be the bottleneck. And so when you get these huge narratives generated by the temptation is that we skim rate them, we go, oh, yeah, that looks about right. That looks about right.

David Colwell

Yeah. That's roughly what I'm after. We don't subject them to like detailed review where we're going through and going. Do I need that? Why are there security requirements in this I like I know security requirements in general are a good thing to have, but are those the right ones. Are they relevant to my problem? It is unnatural for us to do that because we feel like the work is complete.

David Colwell

And the final step before we hit the go button, dusted off and take our KPI is us reviewing this thing. And so we naturally want to get through that step as quickly as possible. And that is where most AI errors creep in. Is the AI adding things that don't belong there because it hallucinated up an idea of what it thought you wanted.

David Colwell

The cure for this is yes, we need to retrain humans because ideally that thing never ends up there in the first place. We need to teach people that it's okay to be the bottleneck, but we also have got specialized AI agents that are part of that quality agent componentry and try centers that is looking for that where it's been tasked with saying, hey, read this requirement.

David Colwell

Do you think that any parts of it are superfluous? Is there anything here that really isn't relevant to the business? It's not addressing the core concern. And if there are, raise those to the human because yeah, AI is very malleable. It'll take on whatever role you assign it. And so when it's the when it's the requirement creator, it's taking on the role of, let me think of all the things when you give it a role, okay.

David Colwell

You're a critic. You're trying to find missing components. It does a pretty good job of that. So AI is both a massive productivity benefit, but without a good testing practice that's got good AI enabled tools, you run the risk of throwing a ton of AI slop into your customer's lap, at which point they're likely to pick it up and say, I didn't actually want this.

David Colwell

What is this thing that you've done a lot of stuff, but it's not very useful stuff. So yeah, that's how we think about testing. It's kind of like that final bastion of defense against huge volumes of AI rubbish being thrown into production.

Richie Cotton

After a lot taking on there. But you had me worried sort of in the middle of saying, okay, actually one of the really useful skills now is just reviewing AI slop. That sounds like a terrible job. So I'm glad that, yeah, do those. That would be horrible. So I'm glad that there are agents to help you out, that it seems like that's quite a common architecture.

Richie Cotton

But in this you have like one agent creating stuff, and then you have a second agent that's going to be reviewing or acting as a, as a critic that. Does it change your posture testing as well? You mentioned that quite often quality people come at the end of the process and they're trying to decide, okay, is this any good or not.

Richie Cotton

Seems can you move quality earlier into the process to to, I guess, eliminate that final bottleneck?

David Colwell

You can and you should. So one of the things that AI is done is it magnifies all of the efficiencies and inefficiencies of your organization. So we've been helping our customers shift it, what we call it shift testing left. It means move it earlier into the cycle. We've been helping our customers do that for quite a while.

David Colwell

And so I really just accelerates that they can shift testing left quicker. They can do more of their testing earlier, they can automate sooner. They can get that requirement validations, they can contribute in peer reviews. All of this is done earlier, more so with the help of AI. But it doesn't change the fundamental equation. And that's something that's important to remember is just like if you're rolling out a ton of AI code and you have bad security practices, now you've got worse security practices because you're doing more of it in an ungoverned way.

David Colwell

And if you've got good security practices, you're moving faster in a safer way. So you get you multiply the benefit. You can almost imagine. AI is like hiring a bunch of extra people that aren't particularly well trained. If your process can handle that and integrate those people into your environment and get good value out of them, then you're going to get even more value out of artificial intelligence.

David Colwell

But when it comes to testing something that our internal QA used to a lot of, now we've spent some time training them in this regard, is we've let them know that you are now sort of the custodians of AI risk at tri centers, and part of your job is to analyze the requirements and to say there will be a desire to use AI.

David Colwell

To solve some of these, you have to determine if AI is the right solution in those cases, because there's, a bit of a saying that we've ingrained in to everyone. It's as if your use case cannot tolerate consistent, persistent, incurable failure. Then you can't use AI because we trained AI to make mistakes. Sometimes that's part of the it's baked into the cake.

David Colwell

AI training involves a degree of accuracy, which is another way of saying a degree of error if you invert that. If we ever had an AI that was % accurate, we would have reinforced entered a database. So we know that I can make errors. So when our testers are looking at use cases, they will ask questions right up front in the beginning and say what will happen when this fails?

David Colwell

What will happen when the error happens? And if your use case doesn't have an answer for that either by saying, well, when an error happens, there will be a person to review it, or when an error happens, there is a way of mitigating that via, recovery mechanisms. Later on, if the business owner or the product owner doesn't have a good answer to that question, then you can't use AI for that.

David Colwell

It would be a good example, perhaps, is, back to the original one that we came to. The AI in legal is actually a good use case, but the lawyer should be reviewing what the AI puts out and saying yes. Is this right or is it wrong? Something we call a human in the loop? A bad case for AI?

David Colwell

This actually has happened with a couple of companies is oh, now your AI with no lawyer review is giving legal advice to layman people that do not understand the law. Now, when it makes a mistake, who's going to review that? You expect a, an engineer to pick up a legal document and say, oh, you, you misquoted the law there, or actually, that's not what that case that was in said about, how we apply this.

David Colwell

But no, there is no potential for human review in that use case. So we would call that a use case defect where you have a critical flaw in your use case. And we expect testers to shift all the way back there to consider themselves not just as guardians of quality in organizations, but to be guardians of use cases and determining whether that risk is acceptable to the group.

David Colwell

So a longer answer to your question, which was about can we shift testing left, coming back faster? Absolutely. We can. We can be moving all the way to the left and we should be, but we should also be moving quality back into the definition phase to determine if this is the right thing to be doing in the first place.

Richie Cotton

Okay. Yeah. Certainly. You can have something that's very high quality but completely, pointless, as well. Yeah. You can stupid things, that, that are great quality. So, I like the idea of thinking about what you should build as well as, having all those sort of different steps in order to ensure high quality throughout the process.

Richie Cotton

It does sound like a lot of it is about getting the fundamentals right. So you mentioned the idea of, you gotta have like, good security policies throughout. I guess the role of like, things like data governance in there as well, so can join through that, like, what do you need to put in place before you can actually create, high quality AI products?

David Colwell

So this is going to get a little bit into designing an AI product organization, because there's been a shift in the last three years when it comes to how much data you need to get productive. I if I cast my mind back, or years ago when I started rolling out AI product to our customers, you needed to lead any of your, rollout times with six month sort of, you know, data gathering, data labeling, data cleaning, data science exercises.

David Colwell

So you could not even get to the first results of are we getting a good enough accurate model developed to to deliver this outcome until you already had a data set and the data set sizes were quite large, like you'd be thinking tiny data sets would contain thousands of records that have been labeled and realistic sized data sets.

David Colwell

You were into the millions of records that you would require, so it was a very big exercise, and there was a huge focus on data lakes and data pipelines and everything to make sure that that worked. What changed was transformers, which is kind of the underlying architecture of large language models. Currently as sort of a bit more general purpose machines.

David Colwell

And we learned that there is a lot that a model that has been trained on this vast volume of data that they have access to, like the generally the whole internet, more or less that those models can actually solve a large number of problems with very, very limited data. I think the original paper that hypothesized this was from OpenAI.

David Colwell

It said large language models, a few shot learners. What that means is with a task, the language model is unfamiliar with, you can give it a few examples of how to do that task, and its inherent ability to understand natural language and sort of translate between those tasks yielded a pretty good outcome, with no training at all. Now, what this meant is that an entirely new category of data, volume, emerged, and that was you have benchmark ready data.

David Colwell

That means that you are not training the model. You're not fine tuning the model. In fact, you are not touching the model at all. All you're doing is tuning everything else, tuning the context, tuning the prompt, tuning the pipelines and the tooling and everything else. And that set of data that you need is only enough data to be able to tell you if you are accurately solving the problem, that you could be ten, that volume of data.

David Colwell

Now, what that changed about the way that we as product organizations considered use cases is instead of having a narrow set of use cases where we said we can only really operate in a world where we've got a huge amount of data to be able to validate whether it's true or false. Our use case horizons broadened massively. We could now start to look at in the world of testing, for example, we went away from, oh, we can just do a stability on whether a test is broken or not or should be raised or not.

David Colwell

Areas where there was a lot of information density. We then were able to say, oh no, we can generate test cases from requirements because even though we don't have a huge amount of data on that, the language model is doing a fairly good job. And so we can just expand that out, or we can have a build automation, we can have a build the test case data sets.

David Colwell

We can have it deal with the queries to the database. So the world became our proverbial oyster in the space where we didn't have a lot of labeled data. Now the challenge that this starts to create is for a data organization. If you don't have good data hygiene, you'll start to suffer from one of two problems. You'll start using data illegally, or you will start to inadvertently expose data leaks.

David Colwell

So imagine that you're operating in this new world where you say, all I need to do is give a few examples to the machine, and then the machine will be able to figure it out. From there. Well, what's in those two examples? Do those examples contain customer confidential information? Do those examples contain sensitive things like API keys? For example?

David Colwell

Do you know what's in those examples? Because we learned pretty early on that anything you tell the AI sufficiently motivated user can get back out of the AI. This is on the go context extraction. Or the AI could inadvertently say like, oh, we gave it an example of how to set up a project in GitHub that contains the actual API key that you used to connect.

David Colwell

And then a customer says, hey, how do I connect to a project in GitHub? And it says, oh, you should use this API key because I started my input. So data hygiene becomes critical in environments where you are no longer going through that rigorous process of take the data, label the data, review the day like that was a six month exercise.

David Colwell

Now product teams left and right, just picking data up from here there and putting it into the large language model. So you start to need a really strong data governance organization that is really, really confirming what's going in, what's going out, what is being used to train the model, what is being presented as context to the model? Do we have a good handle on who owns that data?

David Colwell

Do we have the rights to use that data? Does that data contain sensitive information? And you need all of this not just because it's the right thing to do, which it is. You need all of this because your customers will demanded of you when you got certified with ISO I think about or months back might have been sooner than that.

David Colwell

It was a big exercise in going through and saying, yes, we have confirmed that we know where like the data lineage of each record that gets presented to the AI that we can confirm when eyes are trained and when they're not, who owns that data that we have rights to access that data, that it doesn't have any sensitive information in it, even down to the point where we are presenting the correct data to different prompts that go to different customers, so that we never end up in cross data pollution environments.

David Colwell

So when we say that this new world of AI, we tend to use AI and generative AI synonymously. So split them out. In the world of generative AI, if you're running a product organization, you need to get a very good handle on where your data is, who owns your data, what type of data it is, what sensitive information it contains, and you need to be able to have that process automated to the point where your product orgs can confidently use the data they have access to, and don't have to ask all these questions individually, because if they're asking it individually, then you're answering individually to your customers.

David Colwell

So it becomes even more important to have strict data governance policies just the same as it is to have strict security policies. Because this will help you move fast. In a world where AI is doing all of the lifting for you.

Richie Cotton

Absolutely. That's, a great vice. I have great, data governance policies. But at the start of this conversation, we've talked about how, you benefit with quality control engineers, the process that become the human bottleneck for this. And then you need AI to help them out there. Is the same true? If you're working in a data governance position.

David Colwell

It can be. So we have a lot of our data governance processes automated. So you can say things like, we know that you can only present data to the AI that the person using that product has access to. Because our authorization controls within, within the company have already automated that where it starts to get into the human necessary human oversight.

David Colwell

And I really do say this is the evolution of the data scientist, is there is a, an entire new currency being built up around how much gen AI ready data do you have? And this is more of the data that you have the right to access that has been properly cleaned and confirmed can be used for things like fine tuning.

David Colwell

How much of that do you have? Where is it, who owns it and how is it being processed? So as an example, if we want to generate test cases, one of the goals that we have is to improve the accuracy of the test case generation. And over the long and medium term, we know that a solution to improve that accuracy is fine tuning.

David Colwell

How do you get better results out of an AI? You can add better context, you can add better prompt, or you can improve the base model. That's what we call fine tuning. Now, if we rely on all the product orgs trying to figure that out, how to fine tune what the right data is, etc., then we're going to end up with a governance nightmare, with everyone having their own view as to what's correct and what's incorrect.

David Colwell

And so we have our internal group. Oh sorry, banged my microphone there. Our internal group focusing on how to get that data gen AI ready. And that means that it needs to have an automated process set up for each use case, for ingesting the the appropriate data, for turning that data into anonymized data that can be used safely in the algorithm for setting up spot checking policies so that people can go in and review subsets of that data, make sure that it is, in fact, anonymized, and then implementing training practices and policies that will then actually do the part.

David Colwell

The data scientists and machine learning engineers are most used to, which is what the data for the model under the benchmarking. So that thing that used to be the hardest part of everyone's job of get, get the model, train the model and do it is now somewhat commoditized. And the part that used to be someone else's problem, which is how do I get the data and move it into this state where it's ready to be is now very much the challenge of data scientists.

David Colwell

So that's the the new bottleneck that emerges. The good news is here, that's kind of a trailing concern. So the pattern that we have in our organization is that you have to go to market with tens of data. So where you know how to validate that, that's in the benchmarking territory. You know that the process is good with a model that has not been fine tuned with examples that are not created from real data.

David Colwell

So the synthetic examples that have been ingested into the model, and the reason you want to do that is because there is a huge usability question with every AI product, which is we think we're solving the right problem, but we don't know if we're solving the right problem in the right way. And the only way to determine that is to get it into the hands of the people that are doing it, and to observe them to say, all right, how do you interact with this?

David Colwell

Like, the best example here would be there's a benchmark out there called SWE bench, which is the software engineering benchmark. And if you look at the deviation between what the questions are in SWE bench and what real engineers ask the AI to do, it's a massive gap and it's filled with usability challenges. You need to get there as quickly as possible.

David Colwell

Once you've identified that, yes, this is the right problem to solve and that we've kind of nailed down those usability challenges. That's when you move into the world of collect the data, clean the data, get the fine tuning set up, because now you're trying to optimize, you're trying to go from it's the right problem we're solving into we're solving that problem quickly, cheaply and accurately.

David Colwell

And that's the realm of where data scientists play. So they do become a bottleneck over the long term because you as you start to put more and more AI, capabilities into production, you start to have more and more of these cases where like now we need this optimized and that optimized. But it's the lucky thing is that this is almost always trailing deployment.

David Colwell

So you're getting value to your customers quickly. It really becomes the internal fuel for your AI organization is to build that data up at time.

Richie Cotton

Okay. So that's interesting. You mentioned the changing role, the data scientist there. So, I think, yeah, there's been the sort of longstanding thing where you have to spend way too much time training data, and then you get to the fun part of modeling. Actually, the modeling that's been sort of taken away. There's more data cleaning you've done, more data governance be done, actually, since since you've gone from being a manager of a data science team and then, machine learning teams and now you're working, you're running an AI department.

Richie Cotton

Told me through, like, what's the difference? Like, how is your, the profile for who you're hiring changed?

David Colwell

Yeah. So I think the big shift went from AI being a thing that was done in a a piece of the organization, like, you would have your data science team, you'd have your machine learning team, and they would operate sort of like a services team, where there would be requests coming in from business somewhere saying, hey, we need to improve our fraud detection.

David Colwell

We've tried all these algorithms and we can't quite get it. Can you guys do one of those magic model things that you do and then see if we can get better results out of that? Or sometimes it would be a reach out. The first examples we had, scientists would reach out to where we said, hey, we know you've got this problem.

David Colwell

Do you know we can solve that problem with a computer vision object detection model, for example, improving the accuracy of selfhealing with, regression models. So it used to be that I was kind of locked away in a corner to the degree of go do your magic stuff and then come back to us and we'll try to figure out how to deploy that giant ball of math in a scalable way.

David Colwell

The big unlock here was not actually large language models. It was commoditized access to large language models, because now you've got every product organization under the sun can just connect to an OpenAI endpoint or an Azure endpoint or bedrock endpoint. And now they have AI. They've they've got what they wanted out of the box. And so the biggest challenge at the beginning was making that decision.

David Colwell

Do you just let everyone loose because you will gain a lot of cool use cases, and you'll get fast delivery by not putting any gatekeeping in place. You can say go go nuts. And this almost made it feel like the entire discipline of being AI people would be relegated to niche companies that deliver like AI models, or do what we would call the AI labs.

David Colwell

And as, you know, as a machine learning engineering person, that I found that confronting and the natural feeling was to say, oh, no, you can't possibly do that. You don't know all of the activation functions that we know. The challenge was realizing that that didn't matter anymore for a large number of these organizations and a large number of product teams within those organizations.

David Colwell

And so we had, a very challenging discussion internally about like, okay, well, what is it that we do now that every place in the organization can roll out AI? And we started to see that every product team was able to take AI to production, but they very rapidly started running into problems that we'd solved decades ago. So they would say, hey, the AI is not doing what I want it to do sometimes, and my tests are failing because I've written a specific test saying if I do X and I expect Y because they didn't understand that they were dealing with a probabilistic machine, that it would sometimes answer the question like this and sometimes

David Colwell

answer the question like that. And they found it frustrating and irritating, and they're like, we don't know how to handle this. And we're like, what do you mean? That's you should benchmark it? Doesn't everyone know that? And that was when we realized that, no, everyone doesn't know that. Most product dogs are used to operating in that world of AI input x, and I get X plus one out because this is an adding function and it's reliable and it's unit testable and everything else.

David Colwell

And so I engineers evolve to take on that role of sort of being like AI architects or AI advisors across these organizations, teaching them. Now, that's not how I works. You should treat it like this or okay, prompt engineering is just another way of saying, we've got a magic book of prompt spells that we roll out. And, for some reason, they make the AI behave differently.

David Colwell

The AI engineers were the one saying, no, no, this is why it makes it behave differently. So it was unlocking that connection between outcome and cause. And unlocking that connection allows you to improve, because up until then, we found tons of product with tons of AI rolling out where they didn't have any way of improving it because they didn't know what was actually making it tick.

David Colwell

And so the best AI engineers got into that role very quickly and said, all right, I am no longer the gatekeeper that says you can't use AI unless you come through me. I'm the accelerator that says you can use AI however you want, but if you want to get good outcomes out of it, I'm going to come and help you get there.

David Colwell

And so they started doing much more communication, much more holding town halls and saying, hey, here's what's working, here's what's not. Let's communicate the why behind it to people. Not because where you think everyone will pick up the backpropagation algorithm overnight, but because everyone that we can educate to use AI better will unlock the whole organization to move faster.

David Colwell

So that was the journey for the AI engineers, the journeys, the data scientists. Those was much harder because at the beginning it very much felt like, why do we even need data scientists? Because we've got this machine that can just operate on very little data that can do its job realistically, without any custom, that you could just invent the data on the spot and say, hey, here's what this use case might look like.

David Colwell

You know how to do this task, go do it. And so at the beginning we were like, how? What really is the purpose of having a data science team? And then we discovered that just like with AI engineering, when you propagate out the anyone can do it attitude to the group organization, you fail to internalize all of the things that that team already knew how to do, all of their expertise that they had built up.

David Colwell

You're now just kind of asking everyone to absorb that expertise overnight. And so when it came to like, how do you design a good benchmark, how do you come up with the categories that you need to measure across? How do you govern against bias in your benchmark? How do you ensure appropriate distribution? So now what does your confusion matrix look like?

David Colwell

The only thing I talked to one team and I said, okay, for this you're going to need a confusion matrix because I was still in the mode where I was like, of course you do. This is the way we operate. And I got confusion back from the team, but I didn't get a confusion matrix back because that's a it's a term that they were completely unaware of.

David Colwell

They had to Google it and they're like, how do you produce this? What's the point of this? So that's when I realized you actually do need to start propagating out data science as a exercise, as an expert team, almost a guild, if you will, that are able to accelerate everyone else. And that was how we unlocked AI productivity as an AI product team.

David Colwell

Now, now, how does this change your hiring? Well, you still do need very good machine learning engineers and you need very good data scientists. But we started to say that to roll AI product out, you also needed engineers that can fill the gap between those engineers that can say, okay, that's what the data looks like. And that's what the machine learning solution is going to accomplish.

David Colwell

And this is what the customer needs. So I need to sew all of these together in a rapid way to develop a prototype, to take it to production so that we can see if we're solving the problem. And so we were hiring more Python engineers. We hired more machine learning engineers. We hired a few more data scientists because we saw that I was becoming so pervasive within the organization that we needed to get very early on character, capturing all of these skills so that we would have a sufficiently sized enabling team to now go start working with the product teams.

David Colwell

At this point in our journey, every single one of our product teams has AI people working in that team. They've got machine learning engineers, they've got data scientists, they've got these Python prototyping engineers that are living and breathing their business use cases, but they are implementing the machine learning and data science practices that we built up over time.

David Colwell

So I guess a summary of that journey is every organization is going to face this journey, and you will face the same degree of confronting feeling. There's no better word for it that we did. When you look at where the organization is heading and you get this sense that that's not my role, like I don't see that anymore.

David Colwell

Like you're not doing, you know, giant data analysis, build building and training models, like there's no home for me there. And the encouragement that I would give is there certainly is. You have now become an expert enabler, not a gatekeeper. And that's actually a very satisfying role, because what gave us satisfaction in our jobs was not I trained a model that was that was the work.

David Colwell

What gave us satisfaction was I solved that use case. I gave value back to the customer. We aren't a we're not a university. We don't do things just so we can write a paper based on them. We do things so that our customers can get some value out of it. Right now, one machine learning engineer can roll out five times more delivery by enabling other teams than they could have in the sort of centralized, gate kept model.

David Colwell

So it is a very satisfying outcome. But there's a lot of it, some pride, there's some kind of historical baggage that you need to give up in order to get to that location.

Richie Cotton

I do love the idea of, existing data scientists, existing machine learning engineers being enablers of enablers for the rest of the organization, a little bit more robust. Like how do you go about, like, teaching, all these sort of maybe less technical people odds, people with a different background, like those data skills, those machine learning skills? Yeah.

Richie Cotton

What does that rollout look like?

David Colwell

That rollout is challenging. If you if you do it smoothly, good executive support is necessary. And executive support generally comes from it is very hard to go through and confirm that you're doing AI safely, that you're doing it well, that you're doing it securely. That burden is high and necessarily so, because I always very powerful and everyone is both enthusiastic and concerned about it.

David Colwell

And so having a sort of centralized function of we know how to do the benchmarking, we know how to do this. And then distributed execution where you send them out is a model that is, for at least from an executive point of view, makes a lot of sense because they go great. I can centralized my community of practice.

David Colwell

I can make sure that there is our way that we do this within our organization. So getting that buy in early on is critical. After you've got that buy in, the challenge that you're going to face almost immediately following that is people don't feel the pain of doing an eye roll out wrong until they put it into production usually.

David Colwell

And that's when they'll have said, oh, we've got this great use case. We rolled it out to customers and then like support ticket after support ticket to coming in or they, they're being told, all right, it needs to do better at this task. And the engineers start to feel that frustration. What do you mean by better? How do I get better?

David Colwell

How do I how do I go from here to there? Like if I change the prompt now, it breaks all these other tests that I'd written and these ones might be like, how do I handle this? And so if your organization is a low level of maturity, then you'll find a lot of pushback from the engineering teams being like, no, no, you guys are you're the lab people.

David Colwell

We move fast. We break things. You know, we've got this this speed to market that we can and that will disappear over time as they start to feel that pain. It's important to engage early on in the education. And so, we've got a lot of good examples that we run internally of saying, hey, we rolled out this thing, check this out.

David Colwell

It does this, this and this better. Now we upgraded the model. It does these things better and these things worse. Is it better or worse. And just like it letting the engineers feel that pain of going I know like how did you why did it get worse over here? I thought models only got better over time. So models get different over time.

David Colwell

And they. It's like doing a heart transplant of your product every time you change a model out. So you don't want to end up in this locked place where you can't change anything for fear of breaking something that's actually, relates very closely to engineers, because that's what it's like doing code refactor when you have no regression tests.

David Colwell

So that roll out starts with, yes, getting executive support there also starts with education and showing people, hey, this is what we do. This is why we do it. Not just hiding in the the research environment and saying you all don't get it. You don't understand our statistical brilliance. You need to get much, much better at communicating the why behind the what.

David Colwell

You do. Because that's the secret to unlocking people's minds and going, yeah, I actually do want that thing that you do. That sounds very useful. Please come over here and build a benchmark for our team. Yeah, give us some class based balancing to tell us whether or not we're doing the right thing or whether we've got bias in our model.

David Colwell

Help us understand what what level of hallucination do we have? How do we measure the distance between these phrases? That's something that doesn't doesn't make sense in a lot of engineers minds. This is not the world they've lived in for a long time. So if you can communicate that out, it makes that journey a lot smoother. It makes it easier to go from step to step.

David Colwell

If people are convinced that they want the thing that your team offers and not just a being told yes, go consult the grand, intelligent people in the lab and they'll, emit a Excel sheet that will tell you whether you did good or bad.

Richie Cotton

Okay. Yeah, I can certainly see how, is both good advice that you want to just just send all the sort of, knowledge you build upon the job around the rest of the organization, but no one's actually going to listen to it until they've built something that doesn't work. And then. And then, then they need your help.

Richie Cotton

One thing I realized, we talked a lot about benchmarks throughout this process. Actually explain what a benchmark is. You just want to give a quick description of, like what one of these benchmarks for your model performance is going to involve.

David Colwell

Say, you've already picked up on me using those. You see those terms, right? So old habits die hard. So the easiest way to understand a benchmark I grew up from a very young age, you know, with PC games. So I was used to this. It's like a benchmark is just where you run a bunch of different scenarios and you measure how good or bad it is, and you say, all right, overall, this one has this much in the good column and this much in the bad column.

David Colwell

The benchmarking is not that much different in the world of large language models. So you'll have a bunch of use cases that you'll have to, think through. So let me give you an example that we've got internally. You've got a task which is you want to take in a JIRA take it. You want to generate a set of test descriptions out it like, what should I be validating to make sure this JIRA ticket is correct without a benchmark?

David Colwell

You've just got JIRA tickets and then output. It's a set of test cases. And are these the right ones or are they the wrong ones? Like I don't know who reviews it. All right, give it to a test. And we were doing this for a while, to be perfectly frank. So go give it to some testers until they can tell you what they think.

David Colwell

And they began through and be like, oh, I like this one. That one's a bit rough. And we're getting this, like really fuzzy feedback from people about like, was it good or bad? And so what we did was we went through and gathered a set of these genetic apps, and we curated what we thought were the ideal sets of test scenarios to be generated for different categories.

David Colwell

So what are the functional tests? What are the security tests? What are the performance tests? And then whenever we change the AI, which is kind of that combination of context prompt model, you know, training data, all of that, it would kind of loosely call that the AI because people think, oh, you're only benchmark new models. No, you benchmark any change to that stack.

David Colwell

So when we change that stack, we would go in and say, go back through, feed it in a set of JIRA tickets and look at the outputs and tell us, did it get the test cases that we create it, and how many of them did it get, how many of them did it miss, and so on and so forth.

David Colwell

And that's a benchmark because we get out of that a set of statistics saying this is how good it was, this is how bad it was. This is where it got better. This is where it got worse. Now we view that as a testing activity, because those are the test cases of an AI product where you go in and you say, how, how right is it and how wrong is it?

David Colwell

And if you do that, well, you unlock engineering teams to iterate much faster because when engineering teams are trying to build their AI product, that stack I talked about earlier, they will often be operating. And this sounds stupid, but it is very true. They will be operating off feeling when they say the results because they've got no they've got no data to use, they've got no benchmark.

David Colwell

And so they'll be going that looks about right. Oh I think that got better at doing this. And it's literally them iterating over and over again like that. And it feels very productive because you're like hey, I'm getting like the results are better and the results are better and better until you throw someone else on it and they're like, oh, it keeps changing.

David Colwell

Feels wrong to me somehow. Now you've got two people arguing about their feelings when it comes to how good your AI model is, and the thing that we tell our engineering teams is, yeah, that happens in the team. Now imagine you've got customers arguing with you about how they feel. The model is like, how are you going to, you know, that's like a whole football stadium full of people doing a group therapy session with your results.

David Colwell

Do you want to engineer based on that? No. You want to engineer based on some outcome. You want to know if what you did improve, degraded or stayed the same. So build a benchmark, spend that time and based on our internal statistics, you will get to accuracy, which is the currency of progress in AI is accuracy. You will get to accuracy to times faster if you have a good benchmark than if you have a no benchmark.

David Colwell

And when we say a good benchmark, a good benchmark should not just say %, %, %. A good benchmark should also break down for you along different business value axes, categories of things. So for us, it was like how many functional tested it built, how many nonfunctional, how many of these were edge case, how many of them were high value or low value, etc., etc..

David Colwell

So when we're giving the results back to the engineers, we're saying that change you made improved the amount of functional tests it was generating because, well, you put some stuff in the prompt saying focus on functional tests and etc. or maybe you gave better examples or better context with regard to the functionality, but it degraded the security tests.

David Colwell

Now is that good? Is that bad? That's back to the product manager and the engineering team to figure out. But at least now they know what to look for. Is it bad at generating security tests? All right. Well maybe we need to add some changes to improve security context. Is it generating a lot of edge cases but missing the straight through paths?

David Colwell

Maybe I need to rearchitect how the algorithm works, and maybe split it up into different agents and get them to tackle these problems independently. So creating a good benchmark, which we view as the testers role is, oh, it's actually the tester combined with the data scientist to figure out what the benchmark is. The tester brings the categories and the risks and the things that he wants to he or she wants to see.

David Colwell

So that unlocks productivity for the engineering teams as well as gives you something to go back to your customers with and say, okay, this new version improves in these areas with these regards. The last point on benchmarking is how do you keep your benchmarks alive? And you need to be gathering results from production, so you often can't gather all the data from production because customers will not want to give it to you.

David Colwell

They'll say it's sensitive, but customer tolerance for you using that data for benchmarking is much higher than using that data for training. So training ends up in the model and can be measured all over the place. Benchmarking is a bit safer, so we have customers that opt in to having their data be part of the benchmarking model, because that gives them some comfort that, hey, when you're seeing if the model is good or bad, when you're improving it, you're actually trying to improve it.

David Colwell

For me, because my data is part of that. So there's some reward for them in doing so. But you need to make sure you're continually kind of collecting that data and putting it into the benchmarks so that you're keeping it current and you're dealing with those shifting user behaviors over time so that you're not just, you know, hallucinating about what you think the inputs and outputs are going to be, but you're gathering real inputs and outputs from production.

David Colwell

It's very much something that the data scientist lean into in our organization.

Richie Cotton

Yeah. Sydney is a fascinating topic. And I guess, is I suppose an example of benchmarking is that if you've got like a customer service chatbot, like anything, the customers ask the chatbot that would then be used for benchmarking and figure out the idea.

David Colwell

Yes. If you if your customers allow you to.

Richie Cotton

Yeah. Of course, of course. That, Yeah. Always ask for a customer's consent before you capturing the data. Okay. Brilliant. So, just to wrap up, what are you most excited about in the world of, data and AI?

David Colwell

So when it comes to what I'm most excited about, it's also what I'm most nervous about. I think it's it's going to sound super nerdy. It's everything to do with knowledge management, because we've gotten to the point where if you give AI the right information, it generally gets a pretty good answer out of that. So the whole game we're fighting at the moment is, how do you take this vast sea of unstructured information and turn it into something that the AI can very easily get accurate, up to date information out of?

David Colwell

And I'm seeing a lot of, movement in a genetic search. I'm seeing a lot of movement in knowledge graphing, for example, and knowledge searching. And I think that those spaces, those are the key unlocks for the future of artificial intelligence. In the more nearterm view, the less nerdy view of what I'm excited about. It's a lot to do with agents.

David Colwell

I'm very excited about, agents becoming our, everyone says like, an agent's going to be up here. I kind of view it as an agent's going to be your employee. Your employee as as an employee of an organization, you're going to have your own sets of agents that you build and manage, and they do work for you.

David Colwell

And I think that that excites me because for our customers, that feeling of being snowed under and being, you know, drowning in the volume of work that they have, I see agents as a cure to that. And we are addressing this. We are rolling out these agents for our customers. We've already launched one. There's two more in the pipeline to help our customers, you know, get this value sooner.

David Colwell

That excites me because I think the world is a lot more fun when you could delegate some of that dull work to agents to get done. As long as we don't lose our, our intelligence and our ability to review and understand when the AI is right and wrong, I see that as a very exciting future. So the nerdy side knowledge management gets me because I'm still an AI nerd at heart.

David Colwell

But on the actually coming soon and productive and out there in the world, agents are something I find incredibly exciting.

Richie Cotton

Oh, yeah. Right there with you. Yeah. Knowledge graphs in like, graph transform is very, very exciting on a, on a technical front, but yeah, the big stories of, of, of this year for sure is, is AI agents, automating the work. Fingers crossed all the dull work goes away. Yeah. And we don't just end up as,

Richie Cotton

I stopped reviewers filing. I was, what bore people to, follow. So, whose work are you most excited about? At the moment?

David Colwell

I follow a lot of what the AI Frontier labs are doing. And for me, the most exciting company out there still is anthropic. I love the work that they're doing. And being responsible with artificial intelligence, they're very closely aligned to our domain. So they just recently released, like the model of Claude. That model is already a very good coder, and it's just starting to get better and better at these technical tasks because we're in their sort of enterprise space.

David Colwell

I follow a lot of what they do with they put MCP into the world. They've released the code agents. I think that that channel has a lot of potential still to milk out of it when it comes to, knowledge graphing work. It's kind of a toss up for me. There's so many companies that I follow in this space that I couldn't really pick a winner out of all of that.

David Colwell

And when it comes to, research, this this sounds kind of boring, but I'll tell you why I find it so exciting that Stanford, we're doing a lot of research on understanding the true productivity of artificial intelligence. And that research is incredibly valuable to me because it's, in the same way that you want to benchmark your use case to figure out if you're progressing or not, if you can get to a structured and rigorous way of determining whether the AI result is not just an accurate result, but whether the productivity of the system involving people in AI is increasing or decreasing, then that gives us the real information that we need to determine

David Colwell

the path forward for our entire organizations. So I really enjoyed the research they did there on determining the productivity of coding artificial intelligences. They recently presented that at the I think it was the AI Engineering World Summit, but they're doing a lot more research on governance of those agents and improvements of productivity in that space.

Richie Cotton

Absolutely. I mean, there's so many people will claim, okay, yeah, you can get, ten times productivity improvement for using AI. Not really gonna happen very often. I do like this, like real academic research going on just to measure like, how much are you actually going to get by, adopting a technology? So. Yeah, work out Stanford's great, great stuff.

Richie Cotton

Yeah. Dr. George says thank you so much. Yeah. Thank you so much for your time, David.

David Colwell

Thank you. Richie, it's been a pleasure.

Topics
Related

podcast

Building Trust in AI Agents with Shane Murray, Senior Vice President of Digital Platform Analytics at Versant Media

Richie and Shane explore AI disasters and success stories, the concept of being AI-ready, essential roles and skills for AI projects, data quality's impact on AI, and much more.

podcast

Building Trustworthy AI with Alexandra Ebert, Chief Trust Officer at MOSTLY AI

Richie and Alexandra explore the importance of trust in AI, what causes us to lose trust in AI systems and the impacts of a lack of trust, AI regulation and adoption, AI decision accuracy and fairness, privacy concerns in AI and much more.

podcast

Developing AI Products That Impact Your Business with Venky Veeraraghavan, Chief Product Officer at DataRobot

Richie and Venky explore AI readiness, aligning AI with business processes, roles and skills needed for AI integration, the balance between building and buying AI solutions, the challenges of implementing AI-driven changes, and much more.

podcast

The Data to AI Journey with Gerrit Kazmaier, VP & GM of Data Analytics at Google Cloud

Richie and Gerrit explore AI in data tools, the evolution of dashboards, the integration of AI with existing workflows, the challenges and opportunities in SQL code generation, the importance of a unified data platform, and much more.

podcast

Trust and Regulation in AI with Bruce Schneier, Internationally Renowned Security Technologist

Richie and Bruce explore the definition of trust, how AI mimics social trust, AI and deception, AI regulation, why AI is a political issue and much more.

podcast

Building Ethical Machines with Reid Blackman, Founder & CEO at Virtue Consultants

Reid and Richie discuss the dominant concerns in AI ethics, from biased AI and privacy violations to the challenges introduced by generative AI.
See MoreSee More