3 Stages of Building Self-Healing IT Systems With Multiagent AI

The ability to shift to a self-healing system for operations management, where humans are brought in only when necessary, is the true ROI of AI agents.

May 13th, 2025 9:00am by João Freitas

Featued image for: 3 Stages of Building Self-Healing IT Systems With Multiagent AI

Image by Gerd Altmann on Pixabay.

Modern IT teams are under pressure to minimize downtime and constantly improve operations management, but the tools they’re given to accomplish their goals often fall short, even if they allow some automation of routine tasks.

At a basic level, organizations can improve the operations management pipeline using individual AI agents to complete simple tasks and offer alternative options if known fixes don’t work. AI agents can also go far beyond this by acting autonomously in a network of agents getting closer to self-healing systems.

AI agents are currently being used by many organizations, with one survey finding that more than half have already deployed AI agents. However, multiagent AI systems offer a glimpse into the future of operations. These networks of agents — as opposed to individual agents acting autonomously for individual processes — enable collaboration between AI tools to diagnose and resolve IT issues in real time.

Multiagent AI systems can allow significant improvements to existing processes across the operations management lifecycle. From intelligent ticketing and triage to autonomous debugging and proactive infrastructure maintenance, these systems can pave the way for IT environments that are largely self-healing. Eventually, the need for human intervention may be limited only to cases where their expertise is needed, helping to keep businesses online and deliver constant innovation for their customers.

To realize this potential, organizations should focus on three key stages — diagnosis, remediation and continuous learning — to build progressively toward a self-healing IT environment.

Diagnosis

AI agents in operations management will significantly accelerate incident diagnostics. Incident investigation and diagnosis is a time-consuming, multistep process where the starting point is often a single incident responder, which can be extremely difficult, particularly as regulations pose short reporting deadlines on organizations.

In Europe, the EU has launched the NIS2 Directive and the Digital Operational Resilience Act (DORA), and in the UK, a Cyber Security and Resilience Bill is coming later this year. These regulations all demand that organizations increase digital operational resilience, and that is extremely difficult for humans to achieve when working manually.

At this initial stage of operations management, an AI agent network can be deployed to triage the issue, assess an incident’s severity and potential blast radius, and escalate it as needed. AI agents can understand ongoing support tickets, as well as which services are directly affected in an incident.

By operating within a network and with more advanced reasoning capabilities, these agents can also identify which services depend on others and provide the additional context human responders need.

AI agent systems allow human responders to leverage intelligent automation to detect and triage incidents, depending on their severity and whether a known fix exists. At the triage stage, agentic AI is able to carry out diagnostics — such as application errors, computational resources such as CPU or memory, or gathering logs and trace data — for responders to rule out typical, recurring issues.

Remediation

Following the identification stage, an AI agent network can bring major improvements to incident remediation. When an incident is detected, AI agents can attempt to debug issues with known fixes using past incident information. When multiple agents are combined within a network, they can work out alternative solutions if the initial remediation effort doesn’t work, while communicating the ongoing process with engineers.

Keeping a human in the loop (HITL) is vital to verifying the outputs of an AI model, but agents must be trusted to work autonomously within a system to identify fixes and then report these back to engineers.

In a network, AI agents have enough historical data to understand how to resolve incidents and when to escalate an issue to a human responder. The AI agents can even suggest a resolution to a human and then, if accepted, take autonomous action, allowing the human to act as an intermediary and verify the model’s output.

By combining an HITL approach with autonomous AI agents, organizations can iteratively improve their operations management processes and shift more toward agentic models and closer to a self-healing system.

Learning

The most important step in creating a self-healing system is training AI agents to be able to learn from each incident, as well as from each other, to become truly autonomous.

For this to happen, AI agents cannot be siloed into incident response. Instead, they must be incorporated into an organization’s wider system, communicate with third-party agents and allow them to draw correlations from each action taken to resolve each incident. In this way, each organization’s incident history becomes the training data for its AI agents, ensuring that the actions they take are organization-specific and relevant.

With time, AI agents can identify successful patterns in operations management and adjust their strategies accordingly. A system that is truly self-healing is created, allowing engineers to trust their AI agents to resolve simple, recurring incidents and only escalate high-priority or novel fixes to a human.

Trust AI Agents To Heal Your Systems

Since the latest AI hype cycle began, organizations have been struggling to derive true return on investment (ROI) from AI. While generative AI tools are good for summarization and content creation, they do not offer the transformative benefits that AI agents can.

AI agents will completely change the way engineers are able to spend their working time. From fighting fires and spending hours on call and in war rooms, engineers will instead be able to focus on driving innovation and improving services while trusting AI agents to manage the toil of incident remediation.

The ability to shift to a self-healing system for operations management, where humans are brought in only when necessary, is the true ROI of AI agents. This new paradigm for operations may shift system downtime to near zero, all while drastically improving engineers’ day-to-day working experience.

João Freitas is general manager and engineering lead for AI at PagerDuty. With about 20 years of experience in software development, machine learning and as a people manager, he was previously CTO at an AI startup and has taken several...