Skip to content

Conversation

@imogenkraak
Copy link
Contributor

@imogenkraak imogenkraak commented Oct 23, 2025

Description

This change prevents the gateway from blocking on RPC calls to MDCB during org session expiry checks.

  • Added emergency mode check for fast-path when MDCB is down
  • Implemented background refresh pattern

Related Issue

TT-15954

Motivation and Context

How This Has Been Tested

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Refactoring or add test (improvements in base code or adds test coverage to functionality)

Checklist

  • I ensured that the documentation is up to date
  • I explained why this PR updates go.mod in detail with reasoning why it's required
  • I would like a code coverage CI quality gate exception and have explained why

Ticket Details

TT-15954
Status In Code Review
Summary Request pipeline blocked by synchronous RPC calls every 10 minutes when MDCB unavailable

Generated at: 2025-10-27 12:01:45

@probelabs
Copy link

probelabs bot commented Oct 23, 2025

🔍 Code Analysis Results

This PR introduces a non-blocking mechanism for checking organization session expiry, preventing the gateway from stalling on RPC calls to the Multi-Data Center Bridge (MDCB). This is achieved by implementing a background refresh pattern and adding a fast-path check for emergency mode. On a cache miss for an organization's session expiry, a default value is returned immediately, while a background goroutine fetches the actual value from MDCB to populate the cache for subsequent requests.

Files Changed Analysis

The changes are localized to two files:

  • gateway/middleware.go: Contains the core logic modification. The OrgSessionExpiry function is refactored to be non-blocking. It now checks for emergency mode, then for a cached value. On a cache miss, it returns a default value and triggers the new refreshOrgSessionExpiry function in a goroutine.
  • gateway/middleware_test.go: The test suite for OrgSessionExpiry has been significantly expanded to cover the new asynchronous behavior. New test cases validate the logic for cache hits, cache misses, successful background refreshes, handling of non-existent org sessions, and the emergency mode bypass.

The number of additions is substantially higher than deletions, primarily due to the comprehensive new test cases required to validate the asynchronous logic.

Architecture & Impact Assessment

What this PR accomplishes

This PR enhances the gateway's resilience and performance by decoupling the request-processing path from the latency of synchronous RPC calls to MDCB. By making the organization session expiry check non-blocking, it ensures that slow or unavailable MDCB services do not degrade the gateway's ability to handle traffic.

Key technical changes introduced

  • Non-blocking Cache Refresh: Implements a "stale-while-revalidate" pattern. On a cache miss, a default value is served, and the cache is updated asynchronously.
  • Background Goroutine: A new function, refreshOrgSessionExpiry, is executed in a background goroutine to handle the RPC call, preventing the main request thread from blocking.
  • Emergency Mode Check: An rpc.IsEmergencyMode() check is added to immediately return a default value, providing a fast-path failure mode when MDCB is known to be down.
  • Single-flight Execution: The background refresh continues to use a singleflight.Group (orgSessionExpiryCache) to ensure that only one refresh operation is triggered for a given organization ID, even if multiple requests trigger a cache miss concurrently.

Affected system components

  • Gateway: The primary component affected. Its request processing becomes more resilient to backend latency.
  • MDCB: The interaction pattern changes from synchronous to asynchronous from the gateway's perspective. The gateway's dependency on its immediate availability is reduced.
  • Analytics: The accuracy of the expiry timestamp for analytics records may be temporarily affected. For requests involving an uncached organization, the initial analytics record will use a default expiry time until the background refresh completes.

Flow Diagram

sequenceDiagram
    participant Requester
    participant Gateway
    participant Cache
    participant MDCB (RPC)

    Requester->>Gateway: API Request
    Gateway->>Gateway: Analytics Handler needs Org Expiry
    alt Emergency Mode Active
        Gateway-->>Gateway: Return Default Expiry
    else Normal Mode
        Gateway->>Cache: Get expiry for org
        alt Cache Hit
            Cache-->>Gateway: Return cached expiry
        else Cache Miss
            Cache-->>Gateway: Not found
            Gateway-->>Gateway: Return Default Expiry (non-blocking)
            par Background Refresh
                Gateway->>Gateway: go refreshOrgSessionExpiry()
                Gateway->>MDCB (RPC): Get org session
                MDCB (RPC)-->>Gateway: Return session data
                Gateway->>Cache: Set new expiry from data
            end
        end
    end
    Gateway->>Requester: API Response
Loading

Scope Discovery & Context Expansion

The function OrgSessionExpiry is primarily called by the analytics subsystem within SuccessHandler.RecordHit and ErrorHandler.HandleError. Its purpose is to determine the Time-To-Live (TTL) for analytics records. This change means that when an organization's expiry is not cached, the analytics record will initially be created with a default TTL. The correct TTL will be fetched and cached in the background for subsequent requests. This is a reasonable trade-off, prioritizing low request latency over the immediate accuracy of an analytics record's TTL, which is not critical to the request-response cycle.

The continued use of singleflight.Group within refreshOrgSessionExpiry is a crucial detail, as it prevents a "thundering herd" of RPC calls when many concurrent requests for the same uncached organization arrive.

Metadata
  • Review Effort: 3 / 5
  • Primary Label: bug

Powered by Visor from Probelabs

Last updated: 2025-10-27T12:07:06.047Z | Triggered by: synchronize | Commit: e9d61cf

💡 TIP: You can chat with Visor using /visor ask <your question>

@probelabs
Copy link

probelabs bot commented Oct 23, 2025

🔍 Code Analysis Results

Security Issues (2)

Severity Location Issue
🟠 Error gateway/middleware.go:389-391
The change from a blocking RPC call to a non-blocking background refresh introduces a "fail-open" behavior. On a cache miss for an organization's session expiry, the gateway now immediately returns `DEFAULT_ORG_SESSION_EXPIRATION` and attempts to refresh the value in the background. If the authoritative backend (MDCB) is unavailable, an organization whose access should be revoked could be granted a new session lifetime, allowing continued access.
💡 SuggestionTo mitigate the risk of unauthorized access, consider making the fail-open behavior configurable with a secure-by-default setting (fail-closed), or fall back to a much shorter grace period (e.g., 1-5 minutes) instead of the long default expiration. This limits the window of potential unauthorized access while still preventing long-term blocking of requests.
🟡 Warning gateway/middleware.go:389
The line `go t.refreshOrgSessionExpiry(orgid)` spawns a new goroutine for every incoming request that results in a cache miss. While the `singleflight.Group` used within the function prevents redundant backend calls for the *same* organization ID, it does not limit the creation of goroutines. A malicious actor could exploit this by sending a high volume of requests for many *different*, uncached organization IDs, potentially leading to resource exhaustion and a denial-of-service condition.
💡 SuggestionTo prevent unbounded resource consumption, implement a concurrency-limiting pattern such as a worker pool. Instead of spawning a goroutine directly for each request, push refresh jobs for unique organization IDs onto a buffered channel. A fixed-size pool of worker goroutines can then process these jobs, capping the number of concurrent refreshes and preventing resource exhaustion.

Architecture Issues (1)

Severity Location Issue
🟡 Warning gateway/middleware.go:375
The `OrgSessionExpiry` function launches a background goroutine on a cache miss, which is a side effect not obvious from its signature. This asynchronous behavior should be documented to improve code clarity and maintainability for future developers.
💡 SuggestionAdd a comment to the `OrgSessionExpiry` function explaining its non-blocking, asynchronous nature. Clarify that it returns a default value on cache miss and triggers a background refresh.

Example:

// OrgSessionExpiry returns the expiry for a given orgID.
// It returns a cached value if available. On a cache miss, it returns a default
// value immediately and triggers a background refresh to fetch the latest value.
// This ensures the request path is not blocked by RPC calls.
func (t *BaseMiddleware) OrgSessionExpiry(orgid string) int64 {

Performance Issues (1)

Severity Location Issue
🟡 Warning gateway/middleware.go:389
The `OrgSessionExpiry` function spawns a new goroutine via `go t.refreshOrgSessionExpiry(orgid)` for every cache miss. In a scenario with a large number of distinct organizations and a cold cache (e.g., after a gateway restart), a sudden influx of requests could trigger a 'thundering herd' of goroutines. This can lead to excessive resource consumption (memory, CPU) and could overwhelm the downstream RPC service. While the `singleflight` mechanism within `refreshOrgSessionExpiry` prevents multiple concurrent refreshes for the *same* `orgid`, it does not limit concurrent refreshes for *different* `orgid`s.
💡 SuggestionTo mitigate this risk, implement a worker pool or use a semaphore to limit the number of concurrent background refresh tasks. This will control the concurrency of the RPC calls and prevent resource exhaustion. For example, a semaphore could be used to cap the number of active refresh goroutines at a reasonable, configurable limit.

Quality Issues (2)

Severity Location Issue
🟡 Warning gateway/middleware.go:404
The error check `if err != nil` is unreachable because the function passed to `orgSessionExpiryCache.Do` always returns a `nil` error. This constitutes dead code and can be misleading.
💡 SuggestionRemove the `if err != nil` block as it will never be executed.
🟡 Warning gateway/middleware.go:385
Spawning a new goroutine for each cache miss via `go t.refreshOrgSessionExpiry(orgid)` can lead to a large number of concurrent goroutines under high load for many different uncached organizations. This could cause resource contention or a thundering herd problem against the downstream service.
💡 SuggestionConsider using a worker pool or another concurrency-limiting pattern to manage the number of concurrent background refreshes.

✅ Dependency Check Passed

No dependency issues found – changes LGTM.

Connectivity Issues (2)

Severity Location Issue
🟠 Error gateway/middleware.go:394
The `refreshOrgSessionExpiry` function is invoked in a new goroutine without a panic recovery mechanism. An unhandled panic within this goroutine, which contains an RPC call (`t.OrgSession`), will terminate the entire gateway process. This introduces a significant reliability risk, as a transient issue during the background RPC call could cause a service outage.
💡 SuggestionAdd a `defer` block with `recover()` at the beginning of the `refreshOrgSessionExpiry` function. This will catch any potential panics within the goroutine, log the error, and prevent the gateway from crashing, thereby isolating the background task from the main application's stability.
🟡 Warning gateway/middleware.go:388
The `OrgSessionExpiry` function spawns a new goroutine for every cache miss via `go t.refreshOrgSessionExpiry(orgid)`. While `singleflight` protects against multiple concurrent refreshes for the same organization ID, a large number of requests for many different organization IDs with expired caches could lead to an unbounded number of goroutines being created simultaneously. This could cause a 'thundering herd' of RPC calls to the downstream MDCB service, potentially overwhelming it, and could also lead to excessive resource consumption (memory, CPU) on the gateway itself under high load.
💡 SuggestionTo mitigate this risk, use a worker pool to limit the number of concurrent background refresh operations. A worker pool with a configurable size would control the concurrency of the outgoing RPC calls, protecting both the gateway and the MDCB service from load spikes. A similar pattern is already used for analytics processing in the gateway and could be adapted here.

Powered by Visor from Probelabs

Last updated: 2025-10-27T12:07:07.477Z | Triggered by: synchronize | Commit: e9d61cf

💡 TIP: You can chat with Visor using /visor ask <your question>

@imogenkraak imogenkraak marked this pull request as ready for review October 24, 2025 10:48
@github-actions
Copy link
Contributor

API Changes

no api changes detected
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment