[reliability] Daily Reliability Review - 2026-06-30

Executive Summary

Reliability triage for github/gh-aw over the last 24h (Sentry org github, project gh-aw, run §28481991416).

Telemetry is flowing — the spans dataset has fresh data through 2026-06-30T21:56Z. Health is degraded, not down: most spans are ok, and long-running agent spans (up to ~39 min, status ok) are normal. Failures concentrate in three recurring classes: (1) a failed agent invocation with repeated model-call errors, (2) a recurring 60-second timeout producing error spans across many runs, and (3) short-lived gateway request errors.

Two correctness caveats shape this report:

The Sentry MCP build exposed here has no search_events and no get_trace_details — all queries used list_events (Sentry query syntax) with client-side aggregation.
list_events renders only a fixed field subset and does not surface custom attributes, so per-workflow-name attribution was not possible from the query side. Findings are reported at transaction granularity (invoke_agent, gateway.request).

Companion datasets errors and logs both returned No results (24h) — see Notes; treat as an observability gap, not proof of zero errors.

Top Reliability Findings

Priority	Workflow / Scope	Problem	Evidence	Next Action
P1 — broken user-visible behavior	`invoke_agent` (agent run)	Agent invocation failed after repeated model-call errors	Trace `7570a964...`: root `invoke_agent` `error`, dur 31,686 ms; 12 `gen_ai` child spans `error` clustered at 785–788 ms (max 788.55 ms) + some `ok`. 2nd trace `46769f6b...`: `invoke_agent` 77 ms `error` + 2 `gen_ai` `error`	Pull provider error/status on the 786 ms `gen_ai` spans; confirm whether retry/back-off or a 4xx/5xx hard-fail; verify `gh-aw.run.status` is ERROR for these runs
P2 — timeouts	`gateway.request` → `POST /mcp/agenticworkflows`	Recurring 60-second timeout producing `error` spans	23 `default`-op `error` spans pinned at ~59.99 s (max 59,996 ms) across ~13 distinct traces (`dff68ea...`×4, `34d005f6...`×3, `4072957004ae`×3, `d17d3c3c...`×2, `a52e1349...`×2, `a8526a4d...`×2, `b7ac182e...`×2, + singles). In `dff68ea...` parent `http.server gateway.request` reports `ok` at 60.0 s while child `default` spans `error` — status-propagation gap	Locate the 60 s deadline on the agenticworkflows MCP path; raise/handle it and propagate child `error` to the parent span status
P3 — transport errors	`gateway.request` (HTTP layer)	Short-lived gateway request failures	`default`-op `error` spans ~7 ms–0.9 s across many traces (`0717caf0...`, `8534bd3d...`, `6017ae24...`, `c2a81c7a...`, `85a44aec...`, `a070c915...`)	Classify by HTTP status / client-disconnect; low severity unless rate climbs
P4 — instrumentation / correlation	project-wide	Release & truncation correlation not usable from query side	`find_releases` → No releases despite `service.version` being emitted (resource attr → Sentry `release`). `has:gen_ai.response.finish_reasons` → 0 though emitted on the conclusion span (`send_otlp_span.cjs:2146`). `gh_aw.workflow_name` absent — correct key is `gh-aw.workflow.name` (present)	Register releases / map `service.version`→release; verify finish-reasons indexing; report truncation as inconclusive until queryable

Representative Traces

View representative traces (continuity verified)

P1 — failed agent invocation · 7570a964f4c8045176ccd886c805ef1a

Root invoke_agent error, 31,686 ms. Children: 12× gen_ai error @ ~785–788 ms (tight cluster ⇒ same repeated failure), plus gen_ai ok @ 23,425 ms / 7,508 ms. Parent→child lineage intact under one invoke_agent transaction.

P2 — 60 s timeout · dff68ea53f6e74d5bea86470a277acf9

Transaction gateway.request → POST /mcp/agenticworkflows. 5× http.server @ ~60,013 ms (ok), 4× gen_ai @ ~60,002 ms (ok), 4× default @ ~59,993 ms (error), plus gh-aw.activation.setup @ 11,618 ms / 2,014 ms. Continuity intact; the timeout failure is visible only on the default children, not the parent.

P3 — gateway transport error · 0717caf0b117d7a7b142def7f78ab8f6

gateway.request default-op span error @ 6.7 ms — fast-fail at the HTTP layer.

Recommendations

Surface the agent-run failure cause (smallest first). Inspect the 786 ms gen_ai error spans in trace 7570a964... for the provider status/message; the uniform duration strongly implies one repeated error (rate-limit/4xx) rather than diverse failures. Confirm gh-aw.run.status=failure is recorded so these are dashboard-visible.
Fix the 60 s timeout + status propagation on the agenticworkflows MCP path. The deadline recurs across ~13 traces. Even if the limit stays, propagate the child default error up to the parent gateway.request span so the parent is not reported ok at 60.0 s (currently it is).
Make release correlation usable. service.version is emitted (send_otlp_span.cjs:360) but find_releases is empty — register releases or fix the service.version→Sentry-release mapping so regressions can be compared across versions.
Close the truncation/observability blind spot. gen_ai.response.finish_reasons is emitted (send_otlp_span.cjs:2146) yet not queryable here, and errors/logs datasets are empty — verify ingestion/indexing so runaway-token and export failures are detectable rather than inconclusive.

Notes

View notes — missing telemetry, ambiguous fields, tool limits

MCP build limitations: search_events and get_trace_details are not available; used list_events + client-side aggregation. list_events caps results (~100/query) and renders a fixed field set, so counts are from sampled queries and per-workflow-name attribution was not possible from the query side.
errors dataset: No results (24h) — no Sentry Issues/error events ingested for gh-aw. Reliability signal here rests entirely on the spans dataset.
logs dataset: No results (24h).
Attribute presence (verified via has: on spans): span.status ✅ · gh-aw.workflow.name ✅ (note: gh_aw.workflow_name ❌ — naming mismatch) · gh-aw.run.status ✅ · release ✅ (but no registered Releases) · service.version not queryable as a span attr (resource→release) · gen_ai.response.finish_reasons ❌ via has: despite being emitted.
Truncation / runaway tokens: Inconclusive — gen_ai.response.finish_reasons:length returned no results, but the attribute is not queryable in this build, so neither presence nor absence of truncation is confirmed.
Cancellations: none explicitly observed; the 60 s pattern presents as error timeouts, not cancelled.
Long spans: gen_ai spans up to ~2,353 s (~39 min) are status ok and treated as normal agent execution, not failures.
Emit-side cross-checks against actions/setup/js/send_otlp_span.cjs: workflow id gh-aw.workflow.name (1297/2068), gh-aw.run.status (2076), OTLP status.code/message ERROR=2 (301–333/2049), finish-reasons (2145–2146), service.version resource attr (360).

References: §28481991416

Generated by 🚨 Daily Reliability Review · 180.9 AIC · ⌖ 43 AIC · ⊞ 5.5K · ◷

expires on Jul 2, 2026, 3:27 PM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[reliability] Daily Reliability Review - 2026-06-30 #42586

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[reliability] Daily Reliability Review - 2026-06-30 #42586

Description

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions