Skip to content

[reliability] Daily Reliability Review - 2026-06-30 #42586

Description

@github-actions

Executive Summary

Reliability triage for github/gh-aw over the last 24h (Sentry org github, project gh-aw, run §28481991416).

Telemetry is flowing — the spans dataset has fresh data through 2026-06-30T21:56Z. Health is degraded, not down: most spans are ok, and long-running agent spans (up to ~39 min, status ok) are normal. Failures concentrate in three recurring classes: (1) a failed agent invocation with repeated model-call errors, (2) a recurring 60-second timeout producing error spans across many runs, and (3) short-lived gateway request errors.

Two correctness caveats shape this report:

  • The Sentry MCP build exposed here has no search_events and no get_trace_details — all queries used list_events (Sentry query syntax) with client-side aggregation.
  • list_events renders only a fixed field subset and does not surface custom attributes, so per-workflow-name attribution was not possible from the query side. Findings are reported at transaction granularity (invoke_agent, gateway.request).

Companion datasets errors and logs both returned No results (24h) — see Notes; treat as an observability gap, not proof of zero errors.

Top Reliability Findings

Priority Workflow / Scope Problem Evidence Next Action
P1 — broken user-visible behavior invoke_agent (agent run) Agent invocation failed after repeated model-call errors Trace 7570a964...: root invoke_agent error, dur 31,686 ms; 12 gen_ai child spans error clustered at 785–788 ms (max 788.55 ms) + some ok. 2nd trace 46769f6b...: invoke_agent 77 ms error + 2 gen_ai error Pull provider error/status on the 786 ms gen_ai spans; confirm whether retry/back-off or a 4xx/5xx hard-fail; verify gh-aw.run.status is ERROR for these runs
P2 — timeouts gateway.requestPOST /mcp/agenticworkflows Recurring 60-second timeout producing error spans 23 default-op error spans pinned at ~59.99 s (max 59,996 ms) across ~13 distinct traces (dff68ea...×4, 34d005f6...×3, 4072957004ae×3, d17d3c3c...×2, a52e1349...×2, a8526a4d...×2, b7ac182e...×2, + singles). In dff68ea... parent http.server gateway.request reports ok at 60.0 s while child default spans error — status-propagation gap Locate the 60 s deadline on the agenticworkflows MCP path; raise/handle it and propagate child error to the parent span status
P3 — transport errors gateway.request (HTTP layer) Short-lived gateway request failures default-op error spans ~7 ms–0.9 s across many traces (0717caf0..., 8534bd3d..., 6017ae24..., c2a81c7a..., 85a44aec..., a070c915...) Classify by HTTP status / client-disconnect; low severity unless rate climbs
P4 — instrumentation / correlation project-wide Release & truncation correlation not usable from query side find_releasesNo releases despite service.version being emitted (resource attr → Sentry release). has:gen_ai.response.finish_reasons0 though emitted on the conclusion span (send_otlp_span.cjs:2146). gh_aw.workflow_name absent — correct key is gh-aw.workflow.name (present) Register releases / map service.version→release; verify finish-reasons indexing; report truncation as inconclusive until queryable

Representative Traces

View representative traces (continuity verified)

P1 — failed agent invocation · 7570a964f4c8045176ccd886c805ef1a

  • Root invoke_agent error, 31,686 ms. Children: 12× gen_ai error @ ~785–788 ms (tight cluster ⇒ same repeated failure), plus gen_ai ok @ 23,425 ms / 7,508 ms. Parent→child lineage intact under one invoke_agent transaction.

P2 — 60 s timeout · dff68ea53f6e74d5bea86470a277acf9

  • Transaction gateway.requestPOST /mcp/agenticworkflows. 5× http.server @ ~60,013 ms (ok), 4× gen_ai @ ~60,002 ms (ok), default @ ~59,993 ms (error), plus gh-aw.activation.setup @ 11,618 ms / 2,014 ms. Continuity intact; the timeout failure is visible only on the default children, not the parent.

P3 — gateway transport error · 0717caf0b117d7a7b142def7f78ab8f6

  • gateway.request default-op span error @ 6.7 ms — fast-fail at the HTTP layer.

Recommendations

  1. Surface the agent-run failure cause (smallest first). Inspect the 786 ms gen_ai error spans in trace 7570a964... for the provider status/message; the uniform duration strongly implies one repeated error (rate-limit/4xx) rather than diverse failures. Confirm gh-aw.run.status=failure is recorded so these are dashboard-visible.
  2. Fix the 60 s timeout + status propagation on the agenticworkflows MCP path. The deadline recurs across ~13 traces. Even if the limit stays, propagate the child default error up to the parent gateway.request span so the parent is not reported ok at 60.0 s (currently it is).
  3. Make release correlation usable. service.version is emitted (send_otlp_span.cjs:360) but find_releases is empty — register releases or fix the service.version→Sentry-release mapping so regressions can be compared across versions.
  4. Close the truncation/observability blind spot. gen_ai.response.finish_reasons is emitted (send_otlp_span.cjs:2146) yet not queryable here, and errors/logs datasets are empty — verify ingestion/indexing so runaway-token and export failures are detectable rather than inconclusive.

Notes

View notes — missing telemetry, ambiguous fields, tool limits
  • MCP build limitations: search_events and get_trace_details are not available; used list_events + client-side aggregation. list_events caps results (~100/query) and renders a fixed field set, so counts are from sampled queries and per-workflow-name attribution was not possible from the query side.
  • errors dataset: No results (24h) — no Sentry Issues/error events ingested for gh-aw. Reliability signal here rests entirely on the spans dataset.
  • logs dataset: No results (24h).
  • Attribute presence (verified via has: on spans): span.status ✅ · gh-aw.workflow.name ✅ (note: gh_aw.workflow_name ❌ — naming mismatch) · gh-aw.run.status ✅ · release ✅ (but no registered Releases) · service.version not queryable as a span attr (resource→release) · gen_ai.response.finish_reasons ❌ via has: despite being emitted.
  • Truncation / runaway tokens: Inconclusivegen_ai.response.finish_reasons:length returned no results, but the attribute is not queryable in this build, so neither presence nor absence of truncation is confirmed.
  • Cancellations: none explicitly observed; the 60 s pattern presents as error timeouts, not cancelled.
  • Long spans: gen_ai spans up to ~2,353 s (~39 min) are status ok and treated as normal agent execution, not failures.
  • Emit-side cross-checks against actions/setup/js/send_otlp_span.cjs: workflow id gh-aw.workflow.name (1297/2068), gh-aw.run.status (2076), OTLP status.code/message ERROR=2 (301–333/2049), finish-reasons (2145–2146), service.version resource attr (360).

References: §28481991416

Generated by 🚨 Daily Reliability Review · 180.9 AIC · ⌖ 43 AIC · ⊞ 5.5K ·

  • expires on Jul 2, 2026, 3:27 PM UTC-08:00

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions