Advancing Cybersecurity Operations with Agentic AI Systems

The age of passive AI is over. A new era is beginning, where AI doesn’t just respond—it thinks, plans, and acts. The rapid advancement of large language models (LLMs) has unlocked the potential of agentic AI systems, enabling the automation of tedious tasks across many fields, including cybersecurity.

Traditionally, AI applications in cybersecurity have focused primarily on detecting malicious or anomalous activity across different data sources, cyber environments, and stages of the cyber kill chain. With detection being the center of automation, a substantial part of the security operation remains manual. Security analysts still spend large portions of their time manually investigating alerts, cross-referencing intelligence, and assessing and responding to potential threats.

With the rise of agentic systems, AI applications in cybersecurity are beginning to reframe around the needs of security analysts. These systems automate many of the time-consuming, tedious tasks analysts currently perform, freeing them to focus on higher-level judgment decisions and deep investigations. By leveraging advanced reasoning, dynamic decision-making, and tool-calling capabilities, agentic systems can now take on complex but repetitive tasks, such as researching threat intelligence, correlating security alerts, and executing preliminary response actions.

This post explores two practical agentic applications in alert management and vulnerability triage, offering a glimpse into the transformative potential that agentic systems hold for cybersecurity operations.

What is an agentic AI system?

In an agentic AI system, LLMs are connected to tools and enabled to reason, plan, and take actions iteratively. Rather than simply responding to prompts, the model works toward a goal by breaking it into steps, deciding what to do next, using tools to gather or analyze information, and adjusting its plan along the way. This setup makes it possible to automate complex, multistep tasks that weren’t feasible before.

Agentic AI applications in cybersecurity

This section explores two examples of agentic AI applications in cybersecurity: alert management and vulnerability triage.

Transforming alert management

Alert management in cybersecurity presents several challenges that hinder operational efficiency, including:

Overwhelming alert volume: As organizations become more security-conscious, they continue to deploy more security products and detection rules. This leads to an ever-increasing number of alerts, which can quickly overwhelm understaffed security teams.
Institutional knowledge dependency: Triage relies heavily on institutional knowledge and experience of senior analysts, making it difficult to scale and standardize decision-making.
Labor-intensive context gathering: Relevant data for triage is often scattered across systems and requires manual effort to collect and consolidate for investigation.
Tedious documentation: Writing up findings is essential but time-consuming, often done poorly or skipped altogether.

Agentic systems address key challenges in alert management by scaling triage through automation, reducing reliance on individual expertise by encoding expert knowledge into repeatable workflows, and using data querying tools to automatically retrieve investigation context. Additionally, agents can generate clear, structured documentation as part of their process, turning a traditionally tedious task into a built-in feature.

Agentic system for server alert triage

The alert triage agent (Figure 1) is an event-driven system designed to automate the triaging of server-monitoring alerts. Unlike chatbot systems that rely on human prompting, this system is triggered automatically by events (the generation of new alerts) and requires minimal human involvement beyond final report review.

Architecture diagram showing how alerts flow from cloud-hosted systems to an analyst. Multiple hosts send data to a Cloud Monitoring System, which generates performance, security, and system health alerts. These alerts are processed by an Alert Triage Agent, which works with a Cloud Metric Analysis sub-agent. The output is a triage report, reviewed by an analyst. — *Figure 1. Architecture of the alert triage agent*

An example of an input alert for the triage system is shown below:

{
  "__name__": "ALERTS",
  "alertname": "InstanceDown",
  "alertstate": "firing",
  "aspect": "availability",
  "component": "instance",
  "instance": "alert-triage-agent-test-host.nvidia.com:9200",
  "job": "file_sd",
  "location": "e111a",
  "region": "na",
  "service": "instance",
  "severity": "critical",
  "host_id": "alert-triage-agent-test-host.nvidia.com"
}

This system ingests alerts from a cloud monitoring platform that oversees a cluster of hosts. When an alert is triggered, the agent starts an automated investigation (Figure 2). It first interprets the alert, then iteratively suggests and runs the next best step, using tools to collect and analyze relevant data. This cycle continues until the root cause is found. Once the investigation is complete, the agent generates a triage report with a summary of the alert, the investigation steps, key insights from the data, and recommended actions. The report will be stored for a human analyst to review.

Architecture diagram showing the end-to-end workflow of an alert triage system. When an alert is received, a Maintenance Check first determines if the host is under maintenance. If it is, a report is generated directly. If not, the triage process begins. The Alert Triage Agent coordinates diagnostic checks, including telemetry metrics analysis, network connectivity tests, monitoring system status checks, host performance analysis, and hardware status checks. These components use data from cloud metrics, host systems, and a hardware management system. A Root Cause Categorizer processes the findings and produces a final report for the analyst. — *Figure 2. Tools and execution flow of the alert triage agentic system*

Multi-agent collaboration for smarter alert triage

This system is designed to be multi-agent, with each agent specializing in a distinct part of the alert triage process. At the core is the Alert Triage Agent, acting as a security analyst responsible for interpreting alerts, guiding the investigation, and assembling the final report. Supporting it is the Cloud Metric Analysis Agent (the data scientist “sidekick”), which, upon receiving context about the alert, queries the most relevant cloud metrics, analyzes patterns, and returns a structured analysis.

The two agents use separate prompts and have disjoint toolsets, enabling each to be tailored for its specific role. The Cloud Metric Analysis Agent serves as an agentic tool to the main Alert Triage Agent, and is only invoked when needed. This clear separation of responsibilities improves modularity, simplifies maintenance, and makes it easier to evolve the system over time.

NVIDIA NeMo Agent toolkit for cybersecurity use cases

This system is built natively with the open-source NVIDIA NeMo Agent toolkit, which enables fast and simple development through configuration-based agent creation. This toolkit is well-suited for enterprise cybersecurity use cases due to its modular architecture.

In large organizations, different security teams may build agents for a variety of use cases. However, many of these rely on common investigative functions, such as retrieving data from centralized cloud storage, analyzing system logs, or collecting host-level statistics. The toolkit provides standardized interfaces and reusable components that support these shared functions, reducing duplication and accelerating the development of new agents.

Evaluating the alert triage agent

To evaluate the effectiveness of the alert triage agent, the team curated a labeled dataset covering all root cause categories. On this dataset, the agent achieved a multiclass classification accuracy of 84.6%. Figure 3 shows the confusion matrix for this evaluation, a table that compares predicted labels against ground truth to show where the model is accurate or makes mistakes. The matrix shows strong performance in categories like hardware and false_positive.

In addition to the quantitative results, human experts reviewed the generated reports to assess their quality (Figure 4). Security analysts rated the outputs as Very Good for correctness and relevance, and Good for coverage and actionability. While the reports were generally accurate and focused, some lacked depth or included unclear recommendations. These initial results show that the system is promising, with clear areas for improvement. As a next step, we’re collaborating with security analysts to refine the system and improve how it supports human workflows.

Confusion matrix visualizing the classification performance of an alert triage system across six categories: hardware, software, network_connectivity, repetitive_behavior, need_investigation, and false_positive. Most predictions align with ground truth along the diagonal, with notable misclassifications in the software and need_investigation categories. — *Figure 3. Confusion matrix of the classification test results of the alert triage agentic system*

Bar chart showing three analysts’ average scores for four report quality components: Coverage, Correctness, Relevance, and Actionability. Scores range from 0 to 5. Relevance has the highest average score (3.7), followed by Correctness (3.6). Coverage and Actionability both have an average of 3.4. Analysts' evaluations vary across components, especially in Coverage and Actionability. — *Figure 4. Security analyst’s review of the alert triage agent’s report quality*

Supercharging software vulnerability analysis with agentic AI

Like alert triage, software vulnerability analysis is a repetitive yet critical task that often overwhelms analysts. Enterprise software containers often have complex dependencies and must undergo a vulnerability scan before release. The vulnerabilities found in these scans require a tedious, manual triage process, involving the retrieval and analysis of hundreds of pieces of information. The software security agent is designed to shorten this triage process from hours or days to seconds (Figure 5).

rchitecture diagram showing the workflow of a software security agent system. An event triggers pre-processing, followed by checklist generation. Task agents process checklist tasks in parallel, feeding into summarization and justification modules. These outputs populate a recommendations dashboard reviewed by an analyst. — *Figure 5. Architecture of the software security agent*

When given a vulnerability ID for a specific container, the agentic system kicks off its investigation. It has access to all relevant information about the container, including the code repository, software bill of materials, and documentation.

First, the agent searches the internet to gather broader context around the vulnerability. Then, it creates a custom investigation plan based on what it knows about the vulnerability. Using that plan, it digs into the available data sources, reasons around them, and ultimately produces a report that helps the human analyst determine whether the vulnerability is actually exploitable in that specific environment. For a more detailed explanation, see Applying Generative AI for CVE Analysis at an Enterprise Scale.

From blueprint to deployment: saving analysts time

The open-source NVIDIA AI Blueprint for vulnerability analysis includes an interactive experience, where users can provide a custom vulnerability ID and observe the agent perform live vulnerability analysis on a container. The blueprint makes it easier for enterprises to build and operationalize their own agentic AI applications. It is available through NVIDIA-AI-Blueprints/vulnerability-analysis on GitHub.

The agent has been deployed at scale to accelerate the NVIDIA vulnerability triage process and demonstrate the real-world impact of agentic AI in security operations. NVIDIA analysts estimated time savings of 5 to 30 minutes per vulnerability. Since each analyst reviews more than 10 vulnerabilities per week on average, the time savings can easily add up to several hours per week. Analysts can use this time to focus on issues that are more difficult to diagnose and prioritize high-risk vulnerabilities.

Beyond deployment: accuracy and efficiency

A successful deployment is just the beginning. For an agentic system to stay useful in production, it needs to maintain accuracy and efficiency as real-world workloads evolve.

Accuracy: Analyst annotations guide ongoing model improvement

An annotation tool helps to improve accuracy (Figure 6). Analysts can review agent outputs, flag errors, and provide corrections. The tool captures whether a result is correct and why it might be right or wrong. This feedback loop helps with monitoring accuracy over time, identifying coverage gaps across vulnerability categories, and aligning LLM-as-a-judge outputs with human judgments. Continuously evaluating model performance helps to ensure that the system stays accurate and steadily improves

Screenshot of the UI of the annotation tool where analysts review and comment on vulnerability analysis results. — *Figure 6. The annotation tool interface supports confirming exploitability status, validating justifications, and adding feedback to improve system accuracy*

Efficiency: Profiling insights led to an 8.3x runtime improvement

For efficiency, the system was migrated to the Agent toolkit, which offers built-in profiling and telemetry on execution time, token usage, tool invocation patterns, and more. This simplifies identifying and targeting performance bottlenecks. Figure 7 shows results from using the profiling insights to optimize execution time, with time in seconds on the x-axis and tool and function calls in execution order on the y-axis.

Side-by-side Gantt charts comparing system performance on 2 data points before and after optimization. The x-axis represents time in seconds, and the y-axis lists tool and function calls. The left chart shows longer and more staggered task durations (~48 seconds total), while the right chart shows more compact and parallelized execution (~29 seconds total). Agent Intelligence toolkit profiling insights enabled targeted optimizations that improve processing speed by 1.7x. — *Figure 7. Agent toolkit profiling results before and after optimization*

The Gantt charts created by the Agent toolkit visualize the time taken for each step of the workflow, enabling the identification of synchronous steps deep within the agent tool call stack. Optimizing these led to a speedup that scaled with the input size, improving end-to-end latency by 1.3x on one data point, 1.7x on two data points (shown in Figure 7), and 8.3x on 46 data points. The runtime was reduced from 20 minutes to just 3 minutes.

Selecting the correct agentic structure for the problem

When designing agentic systems, one of the key challenges is finding the best architecture for the task. It is recommended to make the system as complex as needed, but as simple as possible.

Consider alert triaging as an example. When the system is handling a single alert type with a well-defined investigation flow, a fixed execution path workflow works best (Figure 8a). LLM operations can be combined with programmatic steps into a custom, predetermined sequence. This system is simple, stable, efficient, and avoids unnecessary overhead.

When the system needs to support multiple alert types, each with its own (but still fixed) investigation path, adding a router becomes useful (Figure 8b). We can define a manageable set of execution paths, and let the router handle dispatching each alert to the appropriate one at runtime. This approach preserves the robustness and predictability of fixed logic paths, while introducing enough flexibility to scale across different alert types.

Figure comparing four designs for agentic systems. The top left shows a fixed-execution path LLM workflow that handles a single alert type through a linear sequence of LLM and programmatic operations. The top right shows fixed paths with routing, where different alert types are directed through separate but predefined sequences. The bottom left shows an adaptive agentic system in which an LLM agent dynamically selects among various tools based on the alert type. The bottom right shows a hybrid system that blends structured steps with a flexible LLM agent for a balance of consistency and adaptability. — *Figure 8. Four different agentic system designs (clockwise from top left): fixed-execution path workflow, fixed-execution path workflow with routing, hybrid, and adaptive*

When the logic is no longer fixed, the situation shifts. Agents are useful when the system has to handle too many alert types for predefined paths to be practical, or if the investigation flow for a single alert type depends heavily on context and data retrieved during execution (Figure 8c). Agents can reason through ambiguity and dynamically choose the right steps to take on the fly. The adaptiveness is powerful, but it comes with trade-offs, including higher token usage, added latency, and more effort required to tune.

For these reasons, hybrid designs are often used in practice (Figure 8d). In this structure, steps that are always required are implemented as deterministic logic outside the agent. The agent is then responsible only for the parts that require dynamic decision-making. This approach provides stability where possible, flexibility when needed, and better token efficiency overall.

In many ways, selecting the correct agentic structure has become the new hyperparameter tuning in the new agentic world. It requires iteration, good instincts, and a deep understanding of the problem space. With the correct agentic structure, the system becomes both more effective and much easier to operate and maintain.

Evaluating a complex agentic system

Like any machine learning (ML) project, creating a good dataset is the foundation for success. For agentic systems, the approach is similar but with some important differences.

Unlike traditional ML datasets, which typically focus on inputs and final outputs, agentic systems benefit from capturing expected intermediate steps along the reasoning path. These expected outputs enable trajectory evaluation, which involves analyzing the agent’s entire decision-making process rather than just its end result. This more detailed view helps surface where reasoning may break down or deviate from expectations. It’s also useful to track expected tool usage, including tool calls and their inputs, to better evaluate the agent’s planning and tool selection throughout the task.

One characteristic that sets agentic systems apart from traditional ML applications is their generative nature. This means that large datasets aren’t required to begin experimentation. One effective principle is to avoid overcomplicating or overoptimizing early in development. Instead, focus on building a quick proof of concept and getting it in front of users. This is when meaningful data collection and iterative system tuning truly begin.

Recruiting reliable LLM judges

LLM-as-a-judge is becoming one of the fundamental approaches to evaluating LLM outputs and agentic systems, thanks to its ability to assess natural language outputs. The process involves passing system outputs to a language model and prompting it to score specific dimensions such as clarity, correctness, relevance, or groundedness.

Before onboarding a new LLM judge, it’s important to collect a small set of human-labeled examples for calibration. Using these examples, it’s possible to align the LLM’s scoring behavior with human expectations by selecting the right model and engineering the prompt appropriately. Since LLMs will always return an answer, even when uncertain, it’s essential to ground their behavior in truth before relying on them for evaluation.

Once aligned, LLM judges make it easy to compare prompt variations, model versions, or structural changes. This accelerates iteration and supports long-term quality improvements. Notably, the Agent toolkit provides built-in support for LLM-as-a-judge (RAGAS) evaluation, simplifying integration of this method into the development cycle.

The agentic future of cybersecurity

When it comes to what agentic AI can do in cybersecurity, alert management and vulnerability triage are just the beginning. These example use cases show how agentic systems can go beyond simple automation to take on complex, more context-dependent tasks that typically require human expertise.

As agentic systems continue to mature, we believe they’ll become trusted assistants for analysts, streamlining investigations, connecting the dots, and handling the heavy lifting with ease. We’re excited to see how the community builds on this foundation and can’t wait to see the creative, impactful cybersecurity use cases you come up with.

Explore how you can use NVIDIA NeMo Agent toolkits and experience agentic AI examples on build.nvidia.com. For the AI Blueprint for vulnerability triage, explore the interactive demo or access tools and reference code for deployment.

To learn more about the alert triage use case and see it in action, watch the NVIDIA GTC 2025 session, Transform Cybersecurity With Agentic Blueprints on demand. You can also register to join the upcoming NVIDIA Agent Toolkit Hackathon.

Advancing Cybersecurity Operations with Agentic AI Systems

What is an agentic AI system?