How to use AI Assistants with your Monitoring system Q2 Meet Up 2025

How to use AI
Assistants with your
Monitoring system
Alexander Marshalov, 19.06.2025
victoriametrics.com
Simple, Reliable, Efﬁcient Monitoring

Agenda
What is MCP and how it works
How to use VictoriaMetrics MCP Server
Features overview
About VictoriaLogs MCP Server
Integration with Cloud
Future plans
Usage scenarios and examples
victoriametrics.com

What is MCP and how it
works?
victoriametrics.com

victoriametrics.com
MCP – Model Context Protocol
Open standard that describes a way for AI applications and agents to
connect to and work with your data sources and tools
Introduced by Anthropic in November 2024
What is MCP?

victoriametrics.com
Since MCP was introduced
https://github.com/modelcontextprotocol/servers/
https://github.com/punkpeye/awesome-mcp-servers

What is MCP?
https://modelcontextprotocol.io
“Think of MCP like a USB-C port for AI applications. Just as USB-C provides a
standardized way to connect your devices to various peripherals and accessories,
MCP provides a standardized way to connect AI models to different data sources
and tools.”
Simple, Reliable, Efﬁcient Monitoring victoriametrics.com

Some software
with chat interface
How MCP works?

Some software
with chat interface
How MCP works?
Any web services
or storages

Some software
with chat interface
How MCP works?
Any web services
or storages
. . .

Some software
with chat interface
User
query
How MCP works?
Any web services
or storages
. . .

Some software
with chat interface
User
query
How MCP works?
Hey, tell me
something about
my metrics…
Any web services
or storages
. . .

Some software
with chat interface
User
Any web services
or storages
. . .
query
How MCP works?
Hey, tell me
something about
my metrics…

Some software
with chat interface
User
Any web services
or storages
. . .
query
How MCP works?
Hey, tell me
something about
my metrics…
get all necessary data and documentation

Some software
with chat interface
User
Any web services
or storages
. . .
query
How MCP works?
Hey, tell me
something about
my metrics…
I’m glad you asked
6:94:20

Some software
with chat interface
User
query
How MCP works?
Any web services
or storages
. . .
?

Some software
with chat interface
(MCP host app)
User
LLM
Any web services
or storages
. . .
query
How MCP works?
?

What is LLM?
LLM → Large Language Model
A type of artiﬁcial intelligence model trained on vast amounts of text data to
understand and generate human-like language. These models can perform
various language tasks such as writing, translation, summarization, and
question-answering by predicting the most likely next words or phrases based on
patterns learned during training.

Some software
with chat interface
(MCP host app)
User
LLM
Any web services
or storages
. . .
query
How MCP works?
Tricky
integrations,
specific for
client
Tricky
integrations,
specific for
client
Tricky
integrations,
specific for client,
service, LLM
?

Some software
with chat interface
(MCP host app)
User
LLM
Any web services
or storages
. . .
query
How MCP works?
Tricky
integrations,
specific for
client
Tricky
integrations,
specific for
client
Tricky
integrations,
specific for client,
service, LLM
M x N
problem

Some software
with chat interface
User
query
How MCP works?
Any web services
or storages
. . .
MCP Server 1
MCP Server 2
MCP Server 3
LLM
API
calls
API
calls
API
calls

Some software
with chat interface
User
LLM
Any web services
or storages
. . .
query
API
calls
How MCP works?
uniﬁed
API MCP Servers
MCP Servers
MCP Servers

Some software
with chat interface
(MCP host app)
User
LLM
MCP Client MCP Servers
MCP Servers
MCP Servers
Any web services
or storages
. . .
query
API
calls
How MCP works?
MCP
protocol

Some software
with chat interface
(MCP host app)
User
LLM
MCP Client
Any web services
or storages
. . .
query
API
calls
How MCP works?
MCP Servers
MCP Servers
MCP Servers
MCP
protocol
?

Some software
with chat interface
(MCP host app)
User
LLM
MCP Client
Any web services
or storages
. . .
query
API
calls
How MCP works?
MCP Servers
MCP Servers
MCP Servers
MCP
protocol
victoriametrics.com

Some software
with chat interface
(MCP host app)
User
LLM
MCP Client
Any web services
or storages
. . .
query
API
calls
Tools - some executable functionality functionality for clients
How MCP works?
Prompts - reusable prompt templates
Resources - some data to provide context for LLM
MCP Servers declare and provide entities:

Some software
with chat interface
(MCP host app)
User
LLM
MCP Client
Any web services
or storages
. . .
query
API
calls
Tools - some executable functionality for clients
How MCP works?

Some software
with chat interface
(MCP host app)
User
LLM
MCP Client
Any web services
or storages
. . .
query
API
calls
How MCP works?
Common
recipes or
scenarios
that works with this MCP

Some software
with chat interface
(MCP host app)
User
LLM
MCP Client
Any web services
or storages
. . .
query
API
calls
How MCP works?
Documentation,
speciﬁcations,
etc…
Common
recipes or
scenarios

Some software
with chat interface
(MCP host app)
User
LLM
MCP Client
Any web services
or storages
. . .
query
API
calls
How MCP works?
API Endpoints
(can be generated based on your OpenAPI Spec)
Documentation,
speciﬁcations,
etc…
Common
recipes or
scenarios

You can create OpenAPI based MCP Servers
https://jedisct1.github.io/openapi-mcp/
https://github.com/higress-group/openapi-to-mcpserver
https://github.com/harsha-iiiv/openapi-mcp-generator
https://github.com/ckanthony/openapi-mcp
https://mcp-link.vercel.app/
https://github.com/hannesj/mcp-openapi-schema
https://github.com/janwilmake/openapi-mcp-server
https://github.com/baryhuang/mcp-server-any-openapi

Some software
with chat interface
(MCP host app)
User
LLM
MCP Client
Any web services
or storages
. . .
query
API
calls
How MCP works?
MCP Servers
MCP Servers
MCP Servers
MCP
protocol

User
MCP Servers
MCP Servers
MCP Servers
Some software
with chat interface
(MCP host app)
Any web services
or storages
. . .
API
calls
LLM
MCP Client
How MCP works?
MCP
protocol
query
?

User
MCP Servers
MCP Servers
MCP Servers
Some software
with chat interface
(MCP host app)
Any web services
or storages
. . .
API
calls
LLM
MCP Client
How MCP works?
MCP
protocol
query

User
Some software
with chat interface
(MCP host app)
Any web services
or storages
MCP Servers
MCP Servers
MCP Servers
. . .
API
calls
LLM
MCP Client
How MCP works?
query
MCP
protocol

User
Some software
with chat interface
(MCP host app)
Any web services
or storages
MCP Servers
MCP Servers
MCP Servers
. . .
API
calls
LLM
MCP Client
How MCP works?
query
MCP
protocol
Here are some
query (prompt) and
list of available
tools with their
descriptions.
Based on this data
tell me which of the
tools should I use
to answer the
query?

User
Some software
with chat interface
(MCP host app)
Any web services
or storages
MCP Servers
MCP Servers
MCP Servers
. . .
API
calls
LLM
MCP Client
How MCP works?
query
MCP
protocol
Here are some
query (prompt) and
list of available
tools with their
descriptions.
Based on this data
tools should I use
to answer the
query?
Here are some
query (prompt) and
description of the
tool and it’s
parameters.
Based on this data
tell me what param
values should I use
for this tool ?

User
Some software
with chat interface
(MCP host app)
Any web services
or storages
MCP Servers
MCP Servers
MCP Servers
. . .
API
calls
LLM
MCP Client
How MCP works?
query
MCP
protocol
Here are some
query (prompt) and
list of available
tools with their
descriptions.
Based on this data
tools should I use
to answer the
query?
Here are some
query (prompt) and
description of the
tool and it’s
parameters.
Based on this data
tell me what param
values should I use
for this tool ?
Here are some
query (prompt) and
extra context from
tool.
Based on this data
tell me what should
i answer next?

User
MCP Servers
MCP Servers
MCP Servers
Some software
with chat interface
(MCP host app)
Any web services
or storages
. . .
API
calls
LLM
MCP Client
How MCP works?
MCP
protocol
query
AI agent

User
Some software
with chat interface
(MCP host app)
AI agent
LLM
How MCP works?

User
MCP Servers
MCP Servers
MCP Servers
Some software
with chat interface
(MCP host app) Any web services
or storages
. . .
API
calls
AI agent
LLM
How MCP works?
Autonomous software tools
Use AI to pursue goals and complete tasks
Show reasoning, planning and have a level of
autonomy to make decisions
can work with other agents to coordinate and
perform more complex workﬂows.

Some software
with chat interface
(MCP host app)
User
MCP Servers
MCP Servers
MCP Servers
Any web services
or storages
. . .
query
How MCP works?
Hey, tell me
something about
my metrics…
I’m glad you asked
6:94:20

VictoriaMetrics
MCP Server
victoriametrics.com

mcp-victoriametrics
https://github.com/VictoriaMetrics-Community/mcp-victoriametrics

victoriametrics.com
How to install with docker
▶ docker pull ghcr.io/victoriametrics-community/mcp-victoriametrics:latest

victoriametrics.com
How to install with go
▶ go install github.com/VictoriaMetrics-Community/mcp-victoriametrics/cmd/mcp-victoriametrics@latest

victoriametrics.com
How to install from source code
▶ git clone https://github.com/VictoriaMetrics-Community/mcp-victoriametrics.git
▶ cd mcp-victoriametrics
▶ make build # if you want to build with local tooling
▶ docker build -t mcp-victoriametrics . # if you want to build with docker

victoriametrics.com
How to install using binaries or other methods

victoriametrics.com
Awesome MCP clients list
https://github.com/punkpeye/awesome-mcp-clients
https://www.pulsemcp.com/clients

victoriametrics.com
Which clients are supported?
Claude VS Code
Zed
JetBrains IDEs Windsurf
Cline Goose
Cursor
RooCode
ChatGPT
(in beta)
Continue
LibreChat
Warp

Arguments (args): command line arguments of binary ﬁle
Environment variables (env): parameters of MCP
How to set up in your client ?
Command to run: can be binary ﬁle of server or tool for running it i.e. docker)
Name of mcp-server: it’s just human readable name you’ll see in UI
stdio mode

{
"mcpServers": {
"VictoriaMetrics": {
"command": "/path/to/mcp-victoriametrics",
"env": {
"VM_INSTANCE_ENTRYPOINT": "<YOUR_VM_INSTANCE>",
"VM_INSTANCE_TYPE": "<YOUR_VM_INSTANCE_TYPE>"
}
}
}
}
Claude → claude_desktop_config.json

{
"mcpServers": {
"env": {
}
}
}
}
← the name of MCP Server

{
"mcpServers": {
"env": {
}
}
}
}
← the command to run

{
"mcpServers": {
"env": {
}
}
}
}
← parameters in form of environment variables

{
"servers": {
"type": "stdio",
"env": {
}
}
}
}
Visual Studio Code → settings.json → “mcp”

{
"mcpServers": {
"env": {
}
}
}
}
Cursor → mcp.json

"context_servers": {
"command": {
"path": "/path/to/mcp-victoriametrics",
"env": {
}
}
}
}
Zed → settings.json

{
"mcpServers": {
"env": {
}
}
}
}
JetBrains IDE → settings

{
"mcpServers": {
"victoriametrics": {
"command": "docker",
"args": [
"run", "-i", "--rm",
"-e", "VM_INSTANCE_ENTRYPOINT",
"-e", "VM_INSTANCE_TYPE",
"ghcr.io/victoriametrics-community/mcp-victoriametrics:latest",
],
"env": {
"VM_INSTANCE_ENTRYPOINT": "https://play.victoriametrics.com",
"VM_INSTANCE_TYPE": "cluster"
}
}
}
}

victoriametrics.com
Tools in
your client

mcp-victoriametrics conﬁguration

MCP Servers
MCP Servers

MCP Servers
MCP Servers
?

STDIO
The most common transport
Simple, secure, no network
Only local usage, no scalability
The stdio transport enables
communication through standard
input and output streams. This is
particularly useful for local
integrations and command-line
tools
Perfect for CLIs, desktop
apps and local integrations
The fastest (100μs)

STDIO
streamable
HTTP
tools
The Streamable HTTP transport
uses HTTP POST requests for
client-to-server communication and
optional Server-Sent Events (SSE)
streams for server-to-client
communication.
Good for web services, APIs
Multi-tenancy, concurrency
Scalability
Slower that others: ~1-10ms
(network + HTTP overhead)
The fastest (100μs) Still a small adoption

STDIO
streamable
HTTP
tools
The Streamable HTTP transport
uses HTTP POST requests for
client-to-server communication and
optional Server-Sent Events (SSE)
streams for server-to-client
communication.
Good for web services, APIs
Multi-tenancy, concurrency
Scalability
Slower that others: ~1-10ms
(network + HTTP overhead)
SSE
Deprecated
Replaced with Streamable HTTP
But still support as only way to
connect some of the clients
The legacy SSE transport enabled
server-to-client streaming with
HTTP POST requests for
client-to-server communication.
The fastest (100μs) Still a small adoption

Features overview
victoriametrics.com

Available tools
rules / alerts / test_rules
active_queries / top_queries
query / query_range
metrics_relabel_debug / downsampling_ﬁlters_debug / retention_ﬁlters_debug
explain_query / prettify_query
metrics / labels / label_values / series
tenants (for cluster)
documentation
tsdb_status (cardinality explorer) / metric_statistic

Usage scenarios
and examples
victoriametrics.com

query / query_range
Available tools
documentation

Which tenants do we have?

I'll check which tenants are available in your VictoriaMetrics instance.
tenants

Your VictoriaMetrics instance has 4 tenants:
● 0:0 (default tenant)
● 100:0
● 10:0
● 1:0
The tenant format follows the pattern accountID:projectID. The tenant "0:0" is typically the default tenant used when
no speciﬁc tenant is speciﬁed in queries.
tenants

● 100:0
● 10:0
● 1:0
tenants
from now on, let's work with default tenant

● 100:0
● 10:0
● 1:0
tenants
Understood! I'll use the default tenant "0" (which corresponds to "0:0") for all subsequent queries and operations. This
is actually the default behavior, so all VictoriaMetrics API calls will automatically use the default tenant unless you
specify otherwise.

query / query_range
Available tools
documentation

specify otherwise.

specify otherwise.
What metrics do we have?

specify otherwise.
I'll get the list of available metrics from your VictoriaMetrics instance.
metrics

I'll get the list of available metrics from your VictoriaMetrics instance.
Your VictoriaMetrics instance has a comprehensive set of metrics across various categories:
● Infrastructure & System Metrics:
○ Kubernetes components: Extensive metrics from API server (apiserver_*), kubelet (kubelet_*), controller
runtime (controller_runtime_*)
○ Node metrics: CPU, memory, disk, network statistics (node_*)
○ Container metrics: Resource usage, ﬁlesystem, network (container_*)
● Monitoring & Observability:
○ Prometheus ecosystem: Alertmanager (alertmanager_*), kube-state-metrics (kube_*)
○ VictoriaMetrics components: VM storage (vm_*), VMAgent (vmagent_*), VMAlert (vmalert_*), VMAuth
(vmauth_*)
○ VictoriaLogs: (vl_*) for log management
○ Vector log collector: (vector_*)
○ ArgoCD: Application and GitOps metrics (argocd_*)
● Application & Custom Metrics:
○ Go runtime: Garbage collection, memory, goroutines (go_*)
○ HTTP & networking: Request durations, connection stats (grpc_*, rest_client_*)
○ Custom applications: Various business metrics like all_visits_count, unique_visits_count, github_*, docker_*
You have 808 unique metrics covering infrastructure monitoring, application performance, and business analytics. This
suggests a robust monitoring setup with full observability across your stack.
Would you like me to analyze any speciﬁc metric category or help you explore particular aspects of your monitoring data?
metrics

Show me argcd metrics
Your VictoriaMetrics instance has a comprehensive set of metrics across various categories:
● Infrastructure & System Metrics:
○ Kubernetes components: Extensive metrics from API server (apiserver_*), kubelet (kubelet_*), controller
runtime (controller_runtime_*)
○ Node metrics: CPU, memory, disk, network statistics (node_*)
(vmauth_*)

(vmauth_*)
I'll get the ArgoCD metrics from your VictoriaMetrics instance.
metrics

I'll get the ArgoCD metrics from your VictoriaMetrics instance.
Perfect! Here are all the ArgoCD metrics available in your VictoriaMetrics instance:
ArgoCD Metrics Categories
Application Management
● argocd_app_info - Information about ArgoCD applications
● argocd_app_orphaned_resources_count - Count of orphaned resources per application
● argocd_app_reconcile_bucket/count/sum - Application reconciliation performance metrics
Cluster Management
● argocd_cluster_api_resource_objects - Number of API resource objects in clusters
● argocd_cluster_api_resources - Count of API resources in clusters
● argocd_cluster_cache_age_seconds - Age of cluster cache data
● argocd_cluster_connection_status - Connection status to managed clusters
● argocd_cluster_events_total - Total cluster events processed
● argocd_cluster_info - Information about managed clusters
Git Operations
● argocd_git_request_duration_seconds_* - Git request latency metrics (histogram)
● argocd_git_request_total - Total Git requests made
. . .
metrics

What labels does the argocd_app_info have?
Kubectl Operations
● argocd_kubectl_rate_limiter_duration_seconds_* - Rate limiting duration metrics
● argocd_kubectl_request_duration_seconds_* - Kubectl request performance
● argocd_kubectl_request_size_bytes_* - Size of kubectl requests
● argocd_kubectl_requests_total - Total kubectl requests
● argocd_kubectl_response_size_bytes_* - Size of kubectl responses
Redis Operations
● argocd_redis_request_duration_* - Redis request latency metrics
● argocd_redis_request_total - Total Redis requests
Repository Management
● argocd_repo_pending_request_total - Pending repository requests
Resource Processing
● argocd_resource_events_processed_in_batch - Git request latency metrics (histogram)
● argocd_resource_events_processing_* - Resource event processing performance
General Info
● argocd_info - General ArgoCD instance information
Would you like me to query speciﬁc ArgoCD metrics to see their current values or analyze particular aspects of your
ArgoCD deployment performance?

● argocd_kubectl_response_size_bytes_* - Size of kubectl responses
Redis Operations
● argocd_redis_request_duration_* - Redis request latency metrics
● argocd_redis_request_total - Total Redis requests
Repository Management
● argocd_repo_pending_request_total - Pending repository requests
Resource Processing
● argocd_resource_events_processed_in_batch - Git request latency metrics (histogram)
● argocd_resource_events_processing_* - Resource event processing performance
General Info
● argocd_info - General ArgoCD instance information
Would you like me to query speciﬁc ArgoCD metrics to see their current values or analyze particular aspects of your
ArgoCD deployment performance?
I'll query the series for argocd_app_info to see what labels are available.
series

I'll query the series for argocd_app_info to see what labels are available.
Based on the series data, the argocd_app_info metric has the following labels:
ArgoCD-Speciﬁc Labels
● name - The name of the ArgoCD application (e.g., "grafana", "monitoring", "argocd")
● autosync_enabled - Whether auto-sync is enabled ("true"/"false")
● dest_namespace - The destination namespace where the app is deployed
● health_status - Application health status (e.g., "Healthy")
● project - ArgoCD project name (e.g., "default")
● repo - Git repository URL or Helm chart repository
● sync_status - Synchronization status (e.g., "Synced")
Standard Kubernetes/Prometheus Labels
● namespace - Kubernetes namespace where ArgoCD is running ("argocd")
● cluster - Cluster identiﬁer ("sandbox")
● service - Kubernetes service name ("argocd-metrics")
● job - Prometheus job name ("argocd-metrics")
● instance - Instance endpoint ("10.71.10.3:8082")
● pod - Pod name ("argocd-application-controller-0")
● container - Container name ("argocd-application-controller")
● endpoint - Metrics endpoint ("metrics")
series
. . .

What values i have for label name of this metric?
● sync_status - Synchronization status (e.g., "Synced")
Standard Kubernetes/Prometheus Labels
● namespace - Kubernetes namespace where ArgoCD is running ("argocd")
● cluster - Cluster identiﬁer ("sandbox")
● service - Kubernetes service name ("argocd-metrics")
● job - Prometheus job name ("argocd-metrics")
● instance - Instance endpoint ("10.71.10.3:8082")
● exported_namespace - Original namespace label ("argocd")
● prometheus - Prometheus instance ("monitoring/monitoring-vmagent")
Your Current Applications
I can see you have 24 ArgoCD applications deployed, all with:
● ✅ Auto-sync enabled
● ✅ Healthy status
● ✅ Synced state
Some notable applications include monitoring tools (Grafana, Prometheus), VictoriaMetrics components, and various
operational tools.
Would you like me to query speciﬁc applications or analyze the health/sync status across your ArgoCD deployment?

● exported_namespace - Original namespace label ("argocd")
● prometheus - Prometheus instance ("monitoring/monitoring-vmagent")
Your Current Applications
I can see you have 24 ArgoCD applications deployed, all with:
● ✅ Auto-sync enabled
● ✅ Healthy status
● ✅ Synced state
Some notable applications include monitoring tools (Grafana, Prometheus), VictoriaMetrics components, and various
operational tools.
Would you like me to query speciﬁc applications or analyze the health/sync status across your ArgoCD deployment?
I'll get the distinct values for the name label of the argocd_app_info metric.
label_values

I'll get the distinct values for the name label of the argocd_app_info metric.
Here are all the values for the name label of the argocd_app_info metric, representing your deployed ArgoCD applications:
Your ArgoCD Applications (24 total)
Monitoring & Observability
● analytic-exporter
● grafana
● grafana-dashboards
● kube-state-metrics
● monitoring
● node-exporter
● play-grafana
● play-grafana-dashboards
VictoriaMetrics Ecosystem
● victoria-metrics-operator
● vmlicense
● vmlogs-single
● vmlogs-slack2logs
● benchmark-vm-cluster
MCP (Model Context Protocol) Services
● mcp-victorialogs
label_values
. . .

show me current alerts

I'll check the current alerts in your VictoriaMetrics instance.
alerts

I'll check the current alerts in your VictoriaMetrics instance.
Alert Summary
You have many firing alerts and several pending alerts currently active in your sandbox cluster.
Critical Alerts (Firing) 🔴
HostFilesystemDeviceError - Multiple instances
● Severity: Critical
● Status: Firing (since May 26, 2025)
● Issue: Permission denied errors on filesystem devices
● Affected instances: Multiple nodes (10.142.0.x:9100)
● Devices affected:
○ tmpfs filesystems (Kubernetes projected volumes, secrets, config maps)
○ /dev/sdb, /dev/sdc persistent disk mount points
○ Various Kubernetes volume mount paths
HostCpuIsUnderutilized - Some instances
● Severity: Info
● Status: Firing
● Issue: CPU utilization < 20% for extended periods
● Recommendation: Consider reducing CPU allocation
Pending Alerts ⏳
HostCpuIsUnderutilized - Multiple instances
alerts
. . .

show me the graph for the queries for these alerts
○ Various Kubernetes volume mount paths
HostCpuIsUnderutilized - Some instances
● Severity: Info
● Status: Firing
● Issue: CPU utilization < 20% for extended periods
● Recommendation: Consider reducing CPU allocation
Pending Alerts ⏳
HostCpuIsUnderutilized - Multiple instances
● Severity: Info
● Status: Firing
● Issue: CPU utilization < 20% for 1 week
● Recommendation: Various nodes showing underutilization
Key Issues to Address
1. Permission Issues: The filesystem device errors suggest there are permission problems accessing mounted volumes
on your Kubernetes nodes. This could impact application functionality.
2. Resource Optimization: Multiple CPU underutilization alerts suggest you may be able to optimize your cluster
resource allocation.
3. Cluster Health: The large number of filesystem permission errors across multiple nodes indicates a systemic issue
that should be investigated.
Most of these alerts are from your sandbox-gke environment in the sandbox cluster. The filesystem permission errors in
particular warrant immediate attention as they could affect application performance and stability.

what alerting rules i have?

I'll check what alerting rules are conﬁgured in your VictoriaMetrics instance.
rules

I'll check what alerting rules are configured in your VictoriaMetrics instance.
Great! Now I can see your alerting rule groups. You have quite a comprehensive alerting setup with 16 different rule
groups. Here's what I found:
Your Alerting Rule Groups
You have 16 alerting rule groups configured across several files:
Rule Groups by Category:
Infrastructure & Host Monitoring:
● node-exporter - Host-level monitoring alerts
● container-events - Container event monitoring
● kubernetes-apps - Kubernetes application alerts
● kubernetes-storage - Kubernetes storage alerts
VictoriaMetrics Monitoring:
● vm-health - VictoriaMetrics health monitoring
● vm-license - VictoriaMetrics license monitoring
● vmagent - VMAgent monitoring alerts
● vmalert - VMAlert monitoring alerts
● vmcluster - VMCluster monitoring alerts
Application Monitoring:
● ArgoCD - ArgoCD application alerts
● analitycs-exporter - Analytics exporter monitoring
rules
. . .

Show me 3 slowest queries

I'll help you ﬁnd the 3 slowest queries by checking the top queries with the biggest average execution duration.
top_queries

I'll help you ﬁnd the 3 slowest queries by checking the top queries with the biggest average execution duration.
Here are the 3 slowest queries by average execution duration in your VictoriaMetrics instance:
Top 3 Slowest Queries
1. Filesystem Space Prediction (4h) - 0.249 seconds
2. Filesystem Inodes Prediction (24h) - 0.206 seconds
top_queries
( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} /
node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15 and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0 and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )
( node_filesystem_files_free{job="node-exporter",fstype!=""} /
node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40 and
predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and
. . .

Let's work with the first one. Please prettify it
Key Observations
All of your slowest queries are related to filesystem monitoring and use predict_linear() functions, which are
computationally expensive as they perform linear regression calculations over time series data. These queries are likely
part of alerting rules for predicting when filesystems will run out of space or inodes.
The slowest query takes about 249ms on average, which suggests your VictoriaMetrics instance is performing well overall
with relatively fast query execution times.
node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15 and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0 and

Key Observations
All of your slowest queries are related to ﬁlesystem monitoring and use predict_linear() functions, which are
computationally expensive as they perform linear regression calculations over time series data. These queries are likely
part of alerting rules for predicting when ﬁlesystems will run out of space or inodes.
The slowest query takes about 249ms on average, which suggests your VictoriaMetrics instance is performing well overall
with relatively fast query execution times.
I'll prettify the slowest query for you using VictoriaMetrics' query formatting endpoint.
prettify_query

I'll prettify the slowest query for you using VictoriaMetrics' query formatting endpoint.
Here's the prettiﬁed version of your slowest query:
The prettiﬁed version makes the query structure much clearer:
prettify_query
(
(
(
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""}
/
node_filesystem_size_bytes{job="node-exporter",fstype!=""}
)
*
100
)
<
15
)
and
(
predict_linear(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],
(4 * 60) * 60
)
<
0
)
)
and
(node_filesystem_readonly{job="node-exporter",fstype!=""} == 0)
promql
. . .

now explain me this query
1. Main condition: Filesystem usage > 85% (less than 15% available)
2. Prediction condition: Linear prediction suggests the ﬁlesystem will be full within 4 hours (14,400 seconds)
3. Safety condition: Filesystem is not read-only
This query is designed to alert when a ﬁlesystem is both critically low on space AND predicted to run out completely
within 4 hours based on the current trend.
node_filesystem_avail_bytes{job="node-exporter",fstype!=""}
/
node_filesystem_size_bytes{job="node-exporter",fstype!=""}
)
*
100
)
<
15
)
and
(
predict_linear(
(4 * 60) * 60
)
<
0
)
)
and
promql

1. Main condition: Filesystem usage > 85% (less than 15% available)
2. Prediction condition: Linear prediction suggests the ﬁlesystem will be full within 4 hours (14,400 seconds)
3. Safety condition: Filesystem is not read-only
This query is designed to alert when a ﬁlesystem is both critically low on space AND predicted to run out completely
within 4 hours based on the current trend.
and
(
predict_linear(
(4 * 60) * 60
)
<
0
)
)
and
explain_query

What This Query Does
This query identifies filesystems that are critically low on disk space AND predicted to run out completely within 4
hours. It's a proactive alerting rule for preventing disk space exhaustion.
Step-by-Step Breakdown
1. Current Usage Check (First Condition)
What it does:
● node_filesystem_avail_bytes: Available space in bytes (space non-root users can use)
● node_filesystem_size_bytes: Total filesystem size in bytes
● Division: Calculates the ratio of available space to total space
● Multiply by 100: Converts to percentage
● < 15: Finds filesystems with less than 15% free space
Filters applied:
● job="node-exporter": Only data from Node Exporter
● fstype!="": Excludes entries without filesystem type (filters out pseudo-filesystems)
2. Future Prediction Check (Second Condition)
What it does:
explain_query
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""}) * 100 < 15 promql
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], (4 * 60) * 60) < 0 promql
. . .

● Division: Calculates the ratio of available space to total space
● Multiply by 100: Converts to percentage
● < 15: Finds filesystems with less than 15% free space
Filters applied:
● job="node-exporter": Only data from Node Exporter
● fstype!="": Excludes entries without filesystem type (filters out pseudo-filesystems)
2. Future Prediction Check (Second Condition)
What it does:
● predict_linear(): Uses linear regression to predict future values
● [6h]: Analyzes the last 6 hours of data to establish the trend
● (4 * 60) * 60 = 14,400 seconds: Predicts the value 4 hours into the future
● < 0: Checks if the predicted available space will be negative (completely full)
How prediction works:
● Looks at how disk space has been decreasing over the past 6 hours
● Extrapolates this trend 4 hours into the future
● If the trend suggests 0 or negative bytes available, it triggers
3. Read-Only Filter (Third Condition)
What it does:
● node_filesystem_readonly: Indicates if filesystem is mounted read-only (1) or read-write (0)
● == 0: Only includes read-write filesystems
● Why this matters: No point alerting on read-only filesystems since you can't free up space on them
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], (4 * 60) * 60) < 0 promql
. . .
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 promql

3. Read-Only Filter (Third Condition)
What it does:
● node_filesystem_readonly: Indicates if filesystem is mounted read-only (1) or read-write (0)
● == 0: Only includes read-write filesystems
● Why this matters: No point alerting on read-only filesystems since you can't free up space on them
Combined Logic
The query uses AND operators, so ALL three conditions must be true:
Why This Query is Slow
1. predict_linear() function: Computationally expensive as it performs linear regression analysis
2. 6-hour lookback window: Must analyze 6 hours of historical data for each filesystem
3. Multiple metric queries: Three different metrics with filtering
4. Label matching: Complex filtering across multiple label dimensions
Practical Use Case
This would typically fire alerts like:
● "Filesystem /var/log on server-01 is 12% full and predicted to be completely full in 4 hours based on current usage
trends"
This gives administrators time to take preventive action before the disk actually fills up, which is much better than alerting
only when the disk is already 95% full.
(Current free space < 15%)
AND
(Predicted to be full in 4 hours)
AND
(Filesystem is writable)

{
"functions_info": {
"predict_linear": {
"name": "predict_linear",
"description": "`predict_linear(series_selector[d], t)` is a [rollup function](#rollup-functions), which calculates the value `t` seconds
in the future usingnlinear interpolation over [raw samples](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#raw-samples) on the
given lookbehind window `d`.nThe predicted value is calculated individually per each time series returned from the given
[series_selector](https://docs.victoriametrics.com/victoriametrics/keyconcepts/#filtering).nThis function is supported by PromQL.nSee also
[range_linear_regression](#range_linear_regression).",
"category": "Rollup functions"
}
},
"metrics_info": {
"node_filesystem_avail_bytes": {
"group": "node-exporter",
"name": "node_filesystem_avail_bytes",
"help": "Filesystem space available to non-root users in bytes.",
"type": "gauge",
"labels": ["device", "device_error", "fstype", "mountpoint"]
},
"node_filesystem_readonly": { "group": "node-exporter", "name": "node_filesystem_readonly", "help": "Filesystem read-only status.", "type":
"gauge", "labels": ["device", "device_error", "fstype", "mountpoint"] },
"node_filesystem_size_bytes": { "group": "node-exporter", "name": "node_filesystem_size_bytes", "help": "Filesystem size in bytes.",
"type": "gauge", "labels": ["device", "device_error", "fstype", "mountpoint"] }
},
"syntax_tree": {
"bool": false,
"group_modifier": { "args": null, "op": "", "type": "ModifierExpr" },
"join_modifier": { "args": null, "op": "", "type": "ModifierExpr" },
"join_modifier_prefix": null,
"keep_metric_name": false,
"left": {
"bool": false,
"group_modifier": { "args": null, "op": "", "type": "ModifierExpr" },
"join_modifier": { "args": null, "op": "", "type": "ModifierExpr" },
"join_modifier_prefix": null,
explain_query → response.json

"types_info": {
"BinaryOpExpr": {
"name": "Binary operation",
"description": "BinaryOpExpr represents binary operation such as `+`, `-`, `*`, `/`, etc.",
"fields": {
"bool": { "name": "", "description": "Bool indicates whether `bool` modifier is present. For example, `foo u003ebool bar`",
"data_type": "bool" },
"group_modifier": { "name": "", "description": "GroupModifier contains modifier such as "on" or "ignoring".", "data_type":
"ModifierExpr" },
"join_modifier": { "name": "", "description": "JoinModifier contains modifier such as "group_left" or "group_right".", "data_type":
"ModifierExpr" },
"join_modifier_prefix": {
"name": "",
"description": "JoinModifierPrefix is an optional prefix to add to labels specified inside group_left() or group_right() lists. The
syntax is `group_left(foo,bar) prefix "abc"`",
"data_type": "StringExpr"
},
"keep_metric_names": { "name": "", "description": "If KeepMetricNames is set to true, then the operation should keep metric names.",
"left": { "name": "", "description": "Left contains left arg for the `left op right` expression.", "data_type": "Expr" },
"op": { "name": "", "description": "Op is the operation itself, i.e. `+`, `-`, `*`, etc.", "data_type": "string" },
"right": { "name": "", "description": "Right contains right arg for the `left op right` expression.", "data_type": "Expr" }
}
},
"DurationExpr": {
"name": "Duration",
"description": "DurationExpr represents a duration, e.g. `5m`, `1h`. Supported suffixes are `s` (seconds), `m` (minutes), `h` (hours),
`d` (days), `w` (weeks), and `y` (years).",
"fields": { "value": { "name": "", "description": "Value is the duration value as a string, e.g. `5m`, `1h`.", "data_type": "string" } }
},
"FuncExpr": {
"name": "Function",
"description": "uncExpr represents MetricsQL function such as `foo(...)`",
"fields": {
"args": { "name": "", "description": "Arguments of the function, which can be other expressions.", "data_type": "array of Expr" },
"keep_metric_names": { "name": "", "description": "If KeepMetricNames is set to true, then the function should keep metric names.",
explain_query → response.json

The query uses AND operators, so ALL three conditions must be true:
Practical Use Case
trends"
(Current free space < 15%)
AND
(Predicted to be full in 4 hours)
AND
please trace this query

Practical Use Case
trends"
I'll trace the execution of this query to show you exactly how VictoriaMetrics processes it step by step.
query

I'll trace the execution of this query to show you exactly how VictoriaMetrics processes it step by step.
Here's a detailed trace analysis of your query execution:
Query Execution Summary
● Total Time: 217.9ms
● Series Fetched: 3,309 time series
● Result: 0 matching filesystems (none currently meet all criteria)
● Data Scanned: Over 2.1 million samples
Execution Breakdown
1. Read-Only Filesystem Check (21ms)
● Time: 21ms
● Series Found: 1,110 filesystems
● Purpose: Filter out read-only filesystems
● Result: Found 1,110 writable filesystems
2. Current Usage Percentage (75ms)
query
(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15 promql
. . .

● Result: Found 1,110 writable filesystems
2. Current Usage Percentage (75ms)
● Time: 75ms
● Series Processed: 733 filesystems
● Data Fetched: 67,816 samples
● Storage Calls: Parallel calls to 2 storage nodes
● Result: Found 733 filesystems with <15% free space
3. Prediction Analysis (113ms) - The Bottleneck
● Time: 113ms (52% of total execution time)
● Series Processed: 733 filesystems
● Lookback Window: 6 hours (21,600 seconds)
● Data Analyzed: 1,244,406 samples over 6 hours
● Storage Impact: Fetched 9,913 blocks from storage
● Linear Regression: Performed on each of 733 time series
● Result: None predicted to be full in 4 hours
Primary Bottleneck: predict_linear() Function
● 94ms spent on linear regression calculations
● 1.2M+ samples analyzed for trend calculation
● 6-hour lookback requires significant historical data
● No optimization available for predict_linear() function
(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15 promql
. . .
predict_linear(node_filesystem_avail_bytes[6h], 14400) < 0 promql

● Efficient label filtering - used cached metricIDs
● Parallel processing - storage nodes worked simultaneously
● Memory efficient - only 744KB RAM needed for calculations
Performance Metrics
● Samples Scanned: 2,111,040 total
● Blocks Fetched: 25,635 blocks
● Network Efficiency: Good - parallel storage access
● Cache Hit Rate: High - used tag filters cache effectively
Final Result
Currently, no filesystems meet all three criteria:
● ✅ 1,110 filesystems are writable
● ✅ 733 filesystems have <15% free space
● ❌ 0 filesystems are predicted to be full in 4 hours
The prediction algorithm determined that none of the low-space filesystems have a consumption trend steep enough to
exhaust space within 4 hours, which is actually good news for your infrastructure!
show me the graph for this query for last 24 hours

Performance Metrics
● Samples Scanned: 2,111,040 total
● Blocks Fetched: 25,635 blocks
● Network Efﬁciency: Good - parallel storage access
● Cache Hit Rate: High - used tag ﬁlters cache effectively
Final Result
I'll fetch the data for this query over the last 24 hours and create a graph for you.
query_range

Final Result
The query returned no results over the past 24 hours, which means no ﬁlesystems met all three criteria simultaneously.
This is actually good news! Let me create a more informative dashboard by breaking down the query components to show
you what's happening with your ﬁlesystem monitoring.
query_range
query_range

query_range
query_range
query_range

query_range
query_range
query_range
query_range

Perfect! Now I have filesystem data to create a meaningful visualization. I can see the filesystem usage patterns over the
past few hours. Let me create an interactive dashboard showing the filesystem monitoring data.
I've created a comprehensive filesystem monitoring dashboard that visualizes your query results and explains why no
alerts were triggered. Here are the key findings:
query_range
query_range
query_range
query_range
Filesystem Monitoring Dashboard
Interactive artifact
. . .

query / query_range
Available tools
documentation

Give me the top metric names by cardinality

I'll get the top metric names by cardinality from your VictoriaMetrics instance.
tsdb_status

I'll get the top metric names by cardinality from your VictoriaMetrics instance.
Here are the top 20 metric names by cardinality in your VictoriaMetrics instance:
Key Observations:
tsdb_status
Rank Metric Name Series Count Requests Count
1 etcd_request_duration_seconds_bucket 11,304 2,658,477
2 apiserver_request_duration_seconds_bucket 10,920 3,606,927
3 github_downloads_total 8,387 4,181,762
4 apiserver_request_sli_duration_seconds_bucket 6,908 1,751,128
… … … …
20 kube_node_status_condition 755 590,471
. . .

Key Observations:
● Total series across all metrics: 249,452
● Top contributors: The etcd and apiserver duration bucket metrics have the highest cardinality, likely due to multiple
label combinations (different endpoints, methods, status codes, etc.)
● High query activity: go_sched_latencies_seconds_bucket has extremely high request count (381M+) despite
moderate cardinality
● Bucket metrics dominate: Most high-cardinality metrics are histogram buckets (_bucket sufﬁx), which is typical as
they create multiple series per metric
The histogram bucket metrics typically have high cardinality because they create separate time series for each bucket
boundary (le label) combined with other dimensional labels.
Which of them are never queried?
17 go_gc_pauses_seconds_bucket 1,517 446,691
18 go_sched_latencies_seconds_bucket 1,517 381,902,240
19 csi_operations_seconds_bucket 1,498 109,372

Key Observations:
● Total series across all metrics: 249,452
● Top contributors: The etcd and apiserver duration bucket metrics have the highest cardinality, likely due to multiple
label combinations (different endpoints, methods, status codes, etc.)
● High query activity: go_sched_latencies_seconds_bucket has extremely high request count (381M+) despite
moderate cardinality
● Bucket metrics dominate: Most high-cardinality metrics are histogram buckets (_bucket sufﬁx), which is typical as
they create multiple series per metric
The histogram bucket metrics typically have high cardinality because they create separate time series for each bucket
boundary (le label) combined with other dimensional labels.
19 csi_operations_seconds_bucket 1,498 109,372
I can see from the data that all of the top 20 metrics by cardinality have been used in queries (they all have non-zero
request counts). However, let me check for metrics that are rarely or never queried to give you a better picture.
metric_stastistics

I can see from the data that all of the top 20 metrics by cardinality have been used in queries (they all have non-zero
request counts). However, let me check for metrics that are rarely or never queried to give you a better picture.
Based on the query statistics, all of the top 20 metrics by cardinality are actively being used in queries. None of them
have zero query requests.
However, I did ﬁnd some completely unused metrics (0 queries) in your system, though these are not among the
high-cardinality ones:
Completely unused metrics (0 queries):
● apiserver_nodeport_repair_reconcile_errors_total
● kube_pod_init_container_status_last_terminated_reason
● vmanomaly_reader_response_parsing_seconds_created
Key ﬁndings about your top cardinality metrics:
● Most actively queried: go_sched_latencies_seconds_bucket with 381M+ queries
● Least queried among top 20: volume_operation_total_seconds_bucket with 96,335 queries
● All have substantial usage: Even the least queried high-cardinality metric has nearly 100k queries
This actually indicates a healthy situation - your high-cardinality metrics are being actively used, meaning they're
providing value and not just consuming storage space unnecessarily.
If you're concerned about storage costs, you might want to focus on:
● Retention policies for less frequently queried metrics
● Downsampling strategies for histogram buckets that don't need full granularity
● Label optimization to reduce unnecessary cardinality combinations
metric_stastistics

let's create relabeling rule to remove these
completely unused metrics
Based on the query statistics, all of the top 20 metrics by cardinality are actively being used in queries. None of them
have zero query requests.
However, I did ﬁnd some completely unused metrics (0 queries) in your system, though these are not among the

ok, let's create relabeling rule to drop all completely
unused metrics
Great idea! Let me ﬁrst get a complete list of completely unused metrics (0 queries), then we'll create relabeling rules to
drop them.
metric_stastistics

unused metrics
drop them.
Perfect! I found several completely unused metrics (0 queries). Let me create relabeling rules to drop them. Here are the
metrics with exactly 0 queries:
metric_stastistics
Relabeling Rules To Drop Unused Metrics
Code

unused metrics
drop them.
Now let me test these relabeling rules using the debug tool to ensure they work correctly:
metric_stastistics
Code
metric_relabel_debug

unused metrics
drop them.
Let me test another unused metric to verify the pattern works:
metric_stastistics
Code

unused metrics
drop them.
Perfect! Now let me test that we don't accidentally drop metrics that should be kept:
metric_stastistics
Code

drop them.
Perfect! Now let me test that we don't accidentally drop metrics that should be kept:
Excellent! The relabeling rules are working perfectly. Here's a summary:
✅ Validation Results
metric_stastistics
Code
. . .

How to write data to VictoriaMetrics Cloud?

drop them.
Thinking about writing metrics to VictoriaMetrics Cloud platform
documentation

drop them.
documentation
Based on the documentation search results, i can see information about various ways to…
documentation

drop them.
documentation
documentation
Now i have comprehensive information about writing data to VictoriaMetrics Cloud.
documentation

drop them.
Based on the documentation, here's a comprehensive guide on how to write data to VictoriaMetrics Cloud:
documentation
documentation
Now i have comprehensive information about writing data to VictoriaMetrics Cloud.
documentation
Synthesized comprehensive strategy for VictoriaMetrics Cloud data writing.
. . .

https://github.com/modelcontextprotocol/modelcontextprotocol/issues/204

Integration with
VictoriaMetrics Cloud
victoriametrics.com

mcp-victoriametrics + VictoriaMetrics Cloud
https://docs.victoriametrics.com/victoriametrics-cloud/api/

{
"mcpServers": {
"env": {
"VMC_API_KEY": "<YOUR_CLOUD_API_KEY>"
}
}
}
}

regions supported regions
tiers supported tier for VM instances
access_tokens access tokens of VM instance
rule_file content of specific alerting or recording rules file of VM instance
Additional cloud-specific tools
cloud_providers supported cloud providers
deployments available VM instances
rule_filenames alerting and recording rules filenames of VM instance

Future plans
victoriametrics.com

Roadmap
Support tool for query execution stats analysis
Add extra knowledge to the server
Implement multi-instance version
Support tools for VMAgent API
Support Alertmanager API
Support metrics metadata
Improve tools for VMAlert API

mcp-victorialogs
victoriametrics.com

mcp-victorialogs
https://github.com/VictoriaMetrics-Community/mcp-victorialogs

flags non-default flags of the VictoriaLogs instance
hits number of matching log entries grouped by buckets
query query logs with LogsQL
stream_field_names / stream_field_values list field names or values for specified log stream
stream_ids / streams list stream IDs or streams for specified query
Available tools
field_names / field_values list field names or values for specified query
facets most frequent values per each log field
documentation search in embedded VictoriaLogs documentation
stats_query / stats_query_range instant or range stats query (for alerting)

Before the end
victoriametrics.com

How to improve answers
Quality of responses strongly depends on:
LLM (count of parameters, context size, thinking / reasoning option, etc…)
Client (how well it works with the MCP tools)
Wording

Power of LLMs in the keeping context, move step-by-step and combine different
tools
If you know the available capabilities (tools), terms and keywords and form
queries based on them rather than using abstract queries - the results will be
more accurate
Be tolerant of LLM “hallucinations”, try different approaches to improve your
results

It works much accurate with known metrics from OpenSource software,
because LLMs already has context about them in internal knowledge
Combining the capabilities of different MCP servers from the same domain (for
example mcp-victoriametrics + mcp-victorialogs + mcp for k8s) should enhance
the capabilities and open up new complex scenarios
you should help LLMs with plan/context in abstract queries

Can AI/ML Guide Observability?
https://www.youtube.com/watch?v=7t0ZMhX7OIY

VictoriaMetrics
Cloud
Thank you for listening
marshalov@victoriametrics.com
victoriametrics.com

How to use AI Assistants with your Monitoring system Q2 Meet Up 2025

More Related Content

Similar to How to use AI Assistants with your Monitoring system Q2 Meet Up 2025

More from VictoriaMetrics

Recently uploaded

How to use AI Assistants with your Monitoring system Q2 Meet Up 2025