LLM Observability: 6 Lessons From Datadog's CTO

Ahead of DASH 2026, Datadog co-founder Alexis Lê-Quôc explains how AI changed code review, why production is the real test, and where agents should take over.

Jun 9, 2026 · 9 min read

Explore with AI

Open in ChatGPT Open in Claude Open in Perplexity

Engineering teams are shipping more code than they can read. AI assistants now write a large share of it, faster than any reviewer can keep up line by line. That shift is the backdrop for Datadog's DASH conference in New York this week, where co-founder and CTO Alexis Lê-Quôc is running a session called "The New Shape of Engineering."

His argument is straightforward. The way teams operate software has not changed: you ship a change, roll it out, and watch what happens, but the volume and pace have, and that changes what keeps it safe.

In this article, I'll break his thinking into six central lessons, from changes in the review process to using production as the ultimate test, and what you should learn.

If you're new to the concept of LLM observability, I recommend reading our guides to getting started with MLOps and LLM evaluation as a starting point.

In a Nutshell

Lê-Quôc's through-line is that observability becomes the control layer for software that AI writes, tests, and ships, for the people operating it and for the agents themselves.

The six lessons, in brief:

Review moves off the code itself. There is too much AI-written code to read line by line, so the real check is the tests, specs, and proofs you design upfront, including guarding against agents that game those tests.
Production is the only test that counts. A green CI run proves little once real users hit assumptions you couldn't check in advance, and a model's output is never fully certain, so you monitor it live and keep a stop button.
Let agents take the toil. Hand them the dashboard-watching and hypothesis-chasing that fatigues humans, and keep people for the high-judgment calls.
Split the work into two loops: Use a development loop (write, ship, verify, fix) and an operations-and-security loop (detect, investigate, resolve).
Keep AI spend in check. Right-size which model does which job using agent trajectory data, and leave that decision with the developers and SREs who make it.
Learn how to learn. Models are patient tutors, but the skill is interrogating them: understanding systems layer by layer, and asking why the code they wrote actually worked.

Elevate Your Organization's AI Skills

Transform your business by empowering your teams with advanced AI skills through DataCamp for Business. Achieve better insights and efficiency.

Request a Demo Today!

Lesson 1: AI Broke the Old Way of Reviewing Code

Let's start with the pressure that sets everything else off: there is more code than anyone can read.

Lê-Quôc is blunt that the old model, a human reading a pull request line by line, does not survive contact with AI-assisted development. The anxiety he hears across the industry is about reviews becoming impossible, because there is too much going on to follow by reading PRs.

His response is not to ask people to read faster, but to move the review somewhere else.

The review isn't the line of code anymore; there's too much, you can't keep up. It's about what tests we design upfront, and telling the agent not to cheat them.
Alexis Lê-Quôc, CTO at Datadog

That last clause is easy to miss. Once you orchestrate one agent to plan, another to write, and another to test, you also have to stop the writer from gaming the automated tests instead of solving the problem.

He goes beyond tests. Datadog now adds semi-formal and formal proofs that a spec does what it should, something too taxing to attempt widely before agents took on the heavy lifting. It works best on backend and coordination systems, where the behavior is mathematical enough to reason about precisely.

Lesson 2: Production Is the Only Test That Counts

Passing every test in CI is necessary and nowhere near sufficient. The failures that matter happen later.

The place where it really matters is production.
Alexis Lê-Quôc, CTO at Datadog

Every release rests on assumptions you can't fully check beforehand, about the shape of the data and how users behave. Hold those assumptions up to enough real traffic, and the rare cases stop being rare; they become the everyday slowdowns and errors of data and model drift.

LLMs make this harder: With ordinary code, you can at least reason through every branch, but no one can explain mechanistically why a model returns what it returns, so the same input is never guaranteed to give the same output. The occasional strange result can't be engineered away.

So you stop trying to prove a system correct before it ships. Instead, you

Write evaluations for the behavior you want
Monitor it in production
Keep a stop control for a rollout that turns bad.

The question is no longer whether it passed, but whether a problem is a one-off or the start of a trend.

That live signal is not just a dashboard for humans. Wired into the deployment system, it lets an agent roll a change out the way a careful engineer would, to one percent of users, then five, judging from real data whether the change is doing what was intended.

Lesson 3: Let Agents Take the Toil

Lê-Quôc's case for agents is not that they replace engineers, but that they take the parts of the job that wear people down.

Troubleshooting an incident means throwing hypotheses at a symptom, and on long incidents, it is often a far-fetched one that proves true. Datadog's Bits AI agent checks them all in parallel, ahead of the engineer, while the person steers it toward the hunch a dashboard would never surface.

The deeper point is fatigue. An on-call rollout is sudden alertness followed by hours of nothing, repeated until your judgment frays.

You're on high alert mode, and then you're watching paint dry.
Alexis Lê-Quôc, CTO at Datadog

An agent does not mind, and it does not get worse after four hours of staring at numbers. Stress and fatigue degrade human performance, which is why teams rotate people through on-call in the first place.

Hand the tireless watching to a machine, and people come back rested for the calls that need them. The same logic covers security triage, where analysts burn out sorting false positives from real threats.

Lesson 4: Split the Work Into Two Loops

Lê-Quôc organizes Datadog's agent work around two loops.

The development loop

Most engineers will recognize the first loop:

Write code
Ship it
See if it works
Fix it
Repeat

Datadog's angle is that a problem originating in code usually has its fix in code, so the platform tries to hand you that fix, informed by what it knows about the application: its ownership, its recent changes, and the errors it has thrown.

He points to database query optimization as an example. Any model can rewrite a slow query; the harder part is proving the rewrite is faster and safe before it reaches production, so Datadog tests it against a realistic copy of the production data first and hands over a pull request with the evidence attached.

The operations and security loop

The other loop runs in parallel, either by the same people or a different team:

Detect
Investigate
Fix
Repeat

This is where Datadog's AI Guard triages security events and blocks attacks faster than an analyst working through them by hand. Agents can also handle routine operational chores that engineers do daily without much enthusiasm, like resizing that one Kubernetes pod.

Across both loops, Lê-Quôc is firm about the order of operations. Datadog does not start from "here is AI, what problem can it solve?" It starts from a problem customers already complain about, usually some version of "I don't want to do this repetitive thing", and works back to whether an agent can be trusted with it.

Lesson 5: Keep AI Spend in Check

Cost is the constraint sitting next to safety, and keeping the price of operationalizing large language models in check is becoming its own discipline. Lê-Quôc's answer at DASH is Datadog's Agent Console.

Ask a developer which model they need, and often they will name the most powerful (and expensive) one. Sometimes that's the right choice, but a lot of work is boilerplate that a cheaper, faster model handles just as well. Telling the two apart means reading the trajectories of an organization's agents, which tools they call, and how often they succeed, until patterns appear.

Those patterns become heuristics rather than rules: a frontier model like the latest Claude Opus or GPT models for planning, something cheap like Claude Haiku for generating tests.

Task	Model tier	Why
Planning and hard reasoning	Frontier (e.g., Claude Opus, GPT)	The strongest reasoning earns its cost here
Routine, boilerplate code	Mid-tier (e.g., Claude Sonnet, GPT-mini)	Capable enough, and far cheaper to run often
Generating tests and simple transforms	Cheap, fast (e.g., Claude Haiku, GPT-nano)	Speed and price win while quality holds

The principle underneath is about who owns the decision. Roll cost up to a single number, and you get what Lê-Quôc calls "very low actionability": either everyone stops spending, which kills useful work, or everyone keeps spending, which the business can't sustain. He would rather put the data in front of the developers and SREs who choose the models.

Lesson 6: Learn How to Learn

Asked what new engineers should study, Lê-Quôc gives an answer that sounds old and isn't.

You've got to learn how to learn.
Alexis Lê-Quôc, CTO at Datadog

Models are the most patient tutors ever invented, able to explain anything at any pace, a level of access that used to be reserved for royalty with private teachers. But a tutor is only useful if you interrogate it. The skill is knowing what to ask and how to check the answer.

He recommends understanding computers layer by layer instead of treating them as magic. Take a scheduler, a load balancer, a sandbox, and ask a model to explain how it works, then keep pushing:

What does this term mean?
How do you measure it?
What is the math behind it?
How do you know it is working well?

Studying the classics this way is slow on purpose. He compares it to learning an instrument; you can listen to music all day, but to play piano, you have to put your hands on the keys.

The same goes for AI-written code. Vibe coding is fine, he says, as long as you come back and ask why it worked: why was it built this way, are there better approaches, what was it modeled on. The aim is not to write less code with AI. It is to understand the code you now produce so much more of.

Final Thoughts

Lê-Quôc's central message is that the loop hasn't changed, but the pace has. What is different is that no human can watch closely enough at the speed AI now moves, so the watching, and a growing share of the building, moves to agents that don't tire and don't panic.

He argues for treating observability as a control plane rather than a set of charts. If agents are going to write, test, ship, and operate software, they need the same grounding in real production data that good engineers rely on, plus a person holding the judgment calls and the stop button. Datadog is positioning observability as the layer that makes that trade safe.

The skill this framing asks of engineers is clear: read systems through how they behave in production, not just through their source. If you want to build that habit, our Machine Learning in Production skill track is a good place to start.

Author

Tom Farnschläder

Topics

AI Agents

AI for Business

Top AI Engineering Courses

Track

Associate AI Engineer for Developers

29 hr

Learn how to integrate AI into software applications using APIs and open-source libraries. Start your journey to becoming an AI Engineer today!

See Details

Start Course

Track

AI for Software Engineering

7 hr

Write code and build software applications faster than ever before with the latest AI developer tools, including GitHub Copilot, Windsurf, and Replit.

See Details

Start Course

Course

LLMOps Concepts

1 hr

16.7K

Learn about LLMOps from ideation to deployment, gain insights into the lifecycle and challenges, and learn how to apply these concepts to your applications.