How Agentic AI Is Reshaping CDN Incident Response

RSS
How Agentic AI Is Reshaping CDN Incident Response
Agentic AI is changing how SREs handle CDN incidents. Principal SRE at NVIDIA shares five lessons on building workflows you can trust.
Franz Knupfer
Published:
May 19, 2026
5 minute read
Best Practices

1. Training an AI agent is like onboarding a new employee.

The reason AI agents hallucinate and behave unexpectedly usually isn’t a model problem. Instead, it’s a context problem. According to Mercereau, building a reliable agentic workflow is less like configuring software and more like onboarding a new hire. You need to give them access to the right tools, explain how things work, and test them before you trust them with anything critical.

“We as humans have the benefit of all of our surroundings — feel, touch, sight, smell, hearing,” Mercereau said. “All of that gives us context clues. And it’s really important that you give your agentic workflows those same sort of context clues to be able to produce what you need out of them.”

The complaints Mercereau hears most often about AI trace back to this. An agent given a vague prompt and no tools will behave like a new employee given no training and no system access. The output reflects the input.

For teams getting started, Mercereau recommends the NVIDIA NeMo agent toolkit as the scaffolding for building these workflows. It’s what he uses day-to-day in his role as Principal SRE for CDN at NVIDIA.

Learn how to write an agentic skill for analyzing CDN logs.

2. Don’t give an agent responsibility until it’s fully tested.

Once you’ve built a well-contextualized agent, the next question is how much to trust it, and when. Mercereau’s answer draws on the same new-employee logic: you don’t put an intern on call their first week.

According to Mercereau, “Before I give my agent the authority to decide whether or not I get paged at 3:00 in the morning, I want to validate and put it through its paces. Just like I do with any other colleague that I work with, I’m going to train them up, give them access to the tools that they need, tell them how different things work.”

The path to that trust is iterative. Ouderkirk framed it as a process of gaining confidence rather than a switch you flip. According to Ouderkirk, “It’s not a zero-to-one. You’re not going to install a cloud scale and everything is going to change overnight. It’s a process of iteration, keeping human-in-the-loop, and then at some point you give it the agency to do what you want it to do.”

Consistent behavior across varied prompts is Mercereau’s benchmark for trust. If you alter the question slightly and still get the expected answer, that’s a good sign. When the agent reliably answers as expected, you can give it more authority.

3. Once ready, agents can make the 3AM incident call easier.

Mercereau described what mature agentic workflows can look like in practice. “Imagine being able to have an agent troubleshoot and investigate exactly what’s going on when 20 people on this conference bridge have been analyzing data for hours and are unable to figure it out. But you, from picking up the phone to implementing a solution, only took you nine minutes. And it’s all because of the toolsets that you’re able to employ.”

The situation discussed involved combining an agentic approach with Hydrolix data to solve a CDN issue quickly.

Faster incident resolution isn’t the only benefit. In the long term, agentic workflows can use event triggers to decide whether an issue warrants escalation at all, investigating first, and only paging a human responder if necessary. As Mercereau puts it, “Imagine a world where instead of being paged at 3:00 in the morning, you can stay in bed for this one.”

Security concerns remain critical. While agents can help solve many issues, agentic workflows handling sensitive operational data need security reviews before deployment. The speed at which these tools can act can exacerbate the risks of data exposure.

4. Logs remain foundational, and log retention is critical for agents.

For agentic CDN observability to work, the agent needs access to high-fidelity, long-retention log data. This is where a lot of teams are underinvested without realizing it.

According to Ouderkirk, if you’re discarding logs after seven days, any monthly or seasonal pattern looks like an anomaly to your agent. You end up getting paged for things that aren’t problems and quickly losing confidence in what you agents are flagging.

Mercereau extended this to the metrics-versus-logs question. Metrics are useful, but they’re a reduction of the underlying data. You decide what to measure, and that decision encodes assumptions about what will matter later. Raw logs are necessary if you want to have flexibility about the questions you can ask and the answers you can get.

“You’re never going to get the granularity from a metric that you’ll be able to get from a log,” Mercereau said. “Especially when we’re talking about CDN—especially when we’re talking about global streaming, global content delivery—there are micro-outages all the time all over the world that won’t even flag in a metric if your metric doesn’t have that level of granularity.”

For operators building agentic workflows, this is a foundational architectural decision. Full-fidelity, long-retention log data isn’t just a nice-to-have. It’s required context for your agents.

5. When CDN infrastructure fails, your business takes the hit.

CDN sits between your platform and your customers. When something goes wrong in that last mile, customers don’t see a CDN failure. They see a business failure. “It’s the last thing your customers see and the one thing that they never blame when there’s a problem,” Mercereau said.

Issues like CDN outages, degraded streams, and misconfiguration errors can lead to churn and reputation damage if you don’t identify and mitigate problems quickly.

This is what makes CDN observability a business problem, not just an ops problem. The faster you can detect and respond to issues in that last mile, the better you can protect your brand.

Next Steps

Interested in learning more about Hydrolix? Check out these resources.

With Hydrolix MCP, you can ask natural language questions on petabyte-scale CDN data.
CDN Insights provides pre-bundled dashboards across all major CDNs so you get one unified dashboard with the ability to drill down into full-fidelity raw logs in seconds.
Get a demo or quick value assessment.

Share This Post…

Ready to Start?

Cut data retention costs by 75%

Give Hydrolix a try or get in touch with us to learn more

Schedule a technical demo

View all FAQs

How Agentic AI Is Reshaping CDN Incident Response

Table of Contents

1. Training an AI agent is like onboarding a new employee.

2. Don’t give an agent responsibility until it’s fully tested.

3. Once ready, agents can make the 3AM incident call easier.

4. Logs remain foundational, and log retention is critical for agents.

5. When CDN infrastructure fails, your business takes the hit.

Next Steps

Cut data retention costs by 75%

Bot Insights: Purpose-Built Dashboards for Every Team

Product

Use Cases

READ MORE

The Business Value of Complete CDN Visibility for Game Publishers

Best Practices

Use Cases

READ MORE

Strategies for Effective CDN Capacity Planning and Monitoring

Best Practices

READ MORE

How Agentic AI Is Reshaping CDN Incident Response

Table of Contents

1. Training an AI agent is like onboarding a new employee.

2. Don’t give an agent responsibility until it’s fully tested.

3. Once ready, agents can make the 3AM incident call easier.

4. Logs remain foundational, and log retention is critical for agents.

5. When CDN infrastructure fails, your business takes the hit.

Next Steps

Cut data retention costs by 75%

Related Posts

Bot Insights: Purpose-Built Dashboards for Every Team

Product

Use Cases

READ MORE

The Business Value of Complete CDN Visibility for Game Publishers

Best Practices

Use Cases

READ MORE

Strategies for Effective CDN Capacity Planning and Monitoring

Best Practices

READ MORE