[TT-15954] Make org session expiry check non-blocking #7480

imogenkraak · 2025-10-23T10:43:14Z

Description

This change prevents the gateway from blocking on RPC calls to MDCB during org session expiry checks.

Added emergency mode check for fast-path when MDCB is down
Implemented background refresh pattern

Related Issue

TT-15954

Motivation and Context

How This Has Been Tested

Screenshots (if appropriate)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Refactoring or add test (improvements in base code or adds test coverage to functionality)

Checklist

I ensured that the documentation is up to date
I explained why this PR updates go.mod in detail with reasoning why it's required
I would like a code coverage CI quality gate exception and have explained why

Ticket Details

TT-15954


Status	In Code Review
Summary	Request pipeline blocked by synchronous RPC calls every 10 minutes when MDCB unavailable

Generated at: 2025-10-27 12:01:45

probelabs · 2025-10-23T10:47:26Z

🔍 Code Analysis Results

This PR introduces a non-blocking mechanism for checking organization session expiry, preventing the gateway from stalling on RPC calls to the Multi-Data Center Bridge (MDCB). This is achieved by implementing a background refresh pattern and adding a fast-path check for emergency mode. On a cache miss for an organization's session expiry, a default value is returned immediately, while a background goroutine fetches the actual value from MDCB to populate the cache for subsequent requests.

Files Changed Analysis

The changes are localized to two files:

gateway/middleware.go: Contains the core logic modification. The OrgSessionExpiry function is refactored to be non-blocking. It now checks for emergency mode, then for a cached value. On a cache miss, it returns a default value and triggers the new refreshOrgSessionExpiry function in a goroutine.
gateway/middleware_test.go: The test suite for OrgSessionExpiry has been significantly expanded to cover the new asynchronous behavior. New test cases validate the logic for cache hits, cache misses, successful background refreshes, handling of non-existent org sessions, and the emergency mode bypass.

The number of additions is substantially higher than deletions, primarily due to the comprehensive new test cases required to validate the asynchronous logic.

Architecture & Impact Assessment

What this PR accomplishes

This PR enhances the gateway's resilience and performance by decoupling the request-processing path from the latency of synchronous RPC calls to MDCB. By making the organization session expiry check non-blocking, it ensures that slow or unavailable MDCB services do not degrade the gateway's ability to handle traffic.

Key technical changes introduced

Non-blocking Cache Refresh: Implements a "stale-while-revalidate" pattern. On a cache miss, a default value is served, and the cache is updated asynchronously.
Background Goroutine: A new function, refreshOrgSessionExpiry, is executed in a background goroutine to handle the RPC call, preventing the main request thread from blocking.
Emergency Mode Check: An rpc.IsEmergencyMode() check is added to immediately return a default value, providing a fast-path failure mode when MDCB is known to be down.
Single-flight Execution: The background refresh continues to use a singleflight.Group (orgSessionExpiryCache) to ensure that only one refresh operation is triggered for a given organization ID, even if multiple requests trigger a cache miss concurrently.

Affected system components

Gateway: The primary component affected. Its request processing becomes more resilient to backend latency.
MDCB: The interaction pattern changes from synchronous to asynchronous from the gateway's perspective. The gateway's dependency on its immediate availability is reduced.
Analytics: The accuracy of the expiry timestamp for analytics records may be temporarily affected. For requests involving an uncached organization, the initial analytics record will use a default expiry time until the background refresh completes.

Flow Diagram

sequenceDiagram
    participant Requester
    participant Gateway
    participant Cache
    participant MDCB (RPC)

    Requester->>Gateway: API Request
    Gateway->>Gateway: Analytics Handler needs Org Expiry
    alt Emergency Mode Active
        Gateway-->>Gateway: Return Default Expiry
    else Normal Mode
        Gateway->>Cache: Get expiry for org
        alt Cache Hit
            Cache-->>Gateway: Return cached expiry
        else Cache Miss
            Cache-->>Gateway: Not found
            Gateway-->>Gateway: Return Default Expiry (non-blocking)
            par Background Refresh
                Gateway->>Gateway: go refreshOrgSessionExpiry()
                Gateway->>MDCB (RPC): Get org session
                MDCB (RPC)-->>Gateway: Return session data
                Gateway->>Cache: Set new expiry from data
            end
        end
    end
    Gateway->>Requester: API Response

Scope Discovery & Context Expansion

The function OrgSessionExpiry is primarily called by the analytics subsystem within SuccessHandler.RecordHit and ErrorHandler.HandleError. Its purpose is to determine the Time-To-Live (TTL) for analytics records. This change means that when an organization's expiry is not cached, the analytics record will initially be created with a default TTL. The correct TTL will be fetched and cached in the background for subsequent requests. This is a reasonable trade-off, prioritizing low request latency over the immediate accuracy of an analytics record's TTL, which is not critical to the request-response cycle.

The continued use of singleflight.Group within refreshOrgSessionExpiry is a crucial detail, as it prevents a "thundering herd" of RPC calls when many concurrent requests for the same uncached organization arrive.

Metadata

Review Effort: 3 / 5
Primary Label: bug

Powered by Visor from Probelabs

Last updated: 2025-10-27T12:07:06.047Z | Triggered by: synchronize | Commit: e9d61cf

💡 TIP: You can chat with Visor using /visor ask <your question>

probelabs · 2025-10-23T10:47:27Z

🔍 Code Analysis Results

Security Issues (2)

Severity	Location	Issue
🟠 Error	`gateway/middleware.go:389-391`	The change from a blocking RPC call to a non-blocking background refresh introduces a "fail-open" behavior. On a cache miss for an organization's session expiry, the gateway now immediately returns `DEFAULT_ORG_SESSION_EXPIRATION` and attempts to refresh the value in the background. If the authoritative backend (MDCB) is unavailable, an organization whose access should be revoked could be granted a new session lifetime, allowing continued access. 💡 Suggestion To mitigate the risk of unauthorized access, consider making the fail-open behavior configurable with a secure-by-default setting (fail-closed), or fall back to a much shorter grace period (e.g., 1-5 minutes) instead of the long default expiration. This limits the window of potential unauthorized access while still preventing long-term blocking of requests.
🟡 Warning	`gateway/middleware.go:389`	The line `go t.refreshOrgSessionExpiry(orgid)` spawns a new goroutine for every incoming request that results in a cache miss. While the `singleflight.Group` used within the function prevents redundant backend calls for the same organization ID, it does not limit the creation of goroutines. A malicious actor could exploit this by sending a high volume of requests for many different, uncached organization IDs, potentially leading to resource exhaustion and a denial-of-service condition. 💡 Suggestion To prevent unbounded resource consumption, implement a concurrency-limiting pattern such as a worker pool. Instead of spawning a goroutine directly for each request, push refresh jobs for unique organization IDs onto a buffered channel. A fixed-size pool of worker goroutines can then process these jobs, capping the number of concurrent refreshes and preventing resource exhaustion.

Architecture Issues (1)

Severity	Location	Issue
🟡 Warning	`gateway/middleware.go:375`	The `OrgSessionExpiry` function launches a background goroutine on a cache miss, which is a side effect not obvious from its signature. This asynchronous behavior should be documented to improve code clarity and maintainability for future developers. 💡 Suggestion Add a comment to the `OrgSessionExpiry` function explaining its non-blocking, asynchronous nature. Clarify that it returns a default value on cache miss and triggers a background refresh. Example: // OrgSessionExpiry returns the expiry for a given orgID. // It returns a cached value if available. On a cache miss, it returns a default // value immediately and triggers a background refresh to fetch the latest value. // This ensures the request path is not blocked by RPC calls. func (t *BaseMiddleware) OrgSessionExpiry(orgid string) int64 {

Performance Issues (1)

Severity	Location	Issue
🟡 Warning	`gateway/middleware.go:389`	The `OrgSessionExpiry` function spawns a new goroutine via `go t.refreshOrgSessionExpiry(orgid)` for every cache miss. In a scenario with a large number of distinct organizations and a cold cache (e.g., after a gateway restart), a sudden influx of requests could trigger a 'thundering herd' of goroutines. This can lead to excessive resource consumption (memory, CPU) and could overwhelm the downstream RPC service. While the `singleflight` mechanism within `refreshOrgSessionExpiry` prevents multiple concurrent refreshes for the same `orgid`, it does not limit concurrent refreshes for different `orgid`s. 💡 Suggestion To mitigate this risk, implement a worker pool or use a semaphore to limit the number of concurrent background refresh tasks. This will control the concurrency of the RPC calls and prevent resource exhaustion. For example, a semaphore could be used to cap the number of active refresh goroutines at a reasonable, configurable limit.

Quality Issues (2)

Severity	Location	Issue
🟡 Warning	`gateway/middleware.go:404`	The error check `if err != nil` is unreachable because the function passed to `orgSessionExpiryCache.Do` always returns a `nil` error. This constitutes dead code and can be misleading. 💡 Suggestion Remove the `if err != nil` block as it will never be executed.
🟡 Warning	`gateway/middleware.go:385`	Spawning a new goroutine for each cache miss via `go t.refreshOrgSessionExpiry(orgid)` can lead to a large number of concurrent goroutines under high load for many different uncached organizations. This could cause resource contention or a thundering herd problem against the downstream service. 💡 Suggestion Consider using a worker pool or another concurrency-limiting pattern to manage the number of concurrent background refreshes.

✅ Dependency Check Passed

No dependency issues found – changes LGTM.

Connectivity Issues (2)

Severity	Location	Issue
🟠 Error	`gateway/middleware.go:394`	The `refreshOrgSessionExpiry` function is invoked in a new goroutine without a panic recovery mechanism. An unhandled panic within this goroutine, which contains an RPC call (`t.OrgSession`), will terminate the entire gateway process. This introduces a significant reliability risk, as a transient issue during the background RPC call could cause a service outage. 💡 Suggestion Add a `defer` block with `recover()` at the beginning of the `refreshOrgSessionExpiry` function. This will catch any potential panics within the goroutine, log the error, and prevent the gateway from crashing, thereby isolating the background task from the main application's stability.
🟡 Warning	`gateway/middleware.go:388`	The `OrgSessionExpiry` function spawns a new goroutine for every cache miss via `go t.refreshOrgSessionExpiry(orgid)`. While `singleflight` protects against multiple concurrent refreshes for the same organization ID, a large number of requests for many different organization IDs with expired caches could lead to an unbounded number of goroutines being created simultaneously. This could cause a 'thundering herd' of RPC calls to the downstream MDCB service, potentially overwhelming it, and could also lead to excessive resource consumption (memory, CPU) on the gateway itself under high load. 💡 Suggestion To mitigate this risk, use a worker pool to limit the number of concurrent background refresh operations. A worker pool with a configurable size would control the concurrency of the outgoing RPC calls, protecting both the gateway and the MDCB service from load spikes. A similar pattern is already used for analytics processing in the gateway and could be adapted here.

Powered by Visor from Probelabs

Last updated: 2025-10-27T12:07:07.477Z | Triggered by: synchronize | Commit: e9d61cf

💡 TIP: You can chat with Visor using /visor ask <your question>

github-actions · 2025-10-24T12:37:16Z

API Changes

no api changes detected

sonarqubecloud · 2025-10-27T12:12:40Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
90.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions bot added enhancement review/effort:3 labels Oct 23, 2025

Add refresh pattern and emergency mode awareness.

23024cd

imogenkraak force-pushed the TT-15954 branch from 7c9da48 to 23024cd Compare October 23, 2025 10:48

Address probe comments

b403371

imogenkraak force-pushed the TT-15954 branch from 97939c6 to b403371 Compare October 24, 2025 08:47

imogenkraak marked this pull request as ready for review October 24, 2025 10:48

Improve logging

6093d4c

imogenkraak added 3 commits October 27, 2025 08:29

Add error check back to fix lint warning

239f217

Add check

4f3af28

Remove duplicate check

e9d61cf

imogenkraak requested a review from andrei-tyk October 27, 2025 12:21

andrei-tyk approved these changes Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TT-15954] Make org session expiry check non-blocking #7480

[TT-15954] Make org session expiry check non-blocking #7480

Uh oh!

imogenkraak commented Oct 23, 2025 •

edited by github-actions bot

Loading

probelabs bot commented Oct 23, 2025 •

edited

Loading

probelabs bot commented Oct 23, 2025 •

edited

Loading

github-actions bot commented Oct 24, 2025

sonarqubecloud bot commented Oct 27, 2025

Labels

3 participants

[TT-15954] Make org session expiry check non-blocking #7480

Are you sure you want to change the base?

[TT-15954] Make org session expiry check non-blocking #7480

Uh oh!

Conversation

imogenkraak commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Motivation and Context

How This Has Been Tested

Screenshots (if appropriate)

Types of changes

Checklist

Ticket Details

probelabs bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Code Analysis Results

Files Changed Analysis

Architecture & Impact Assessment

What this PR accomplishes

Key technical changes introduced

Affected system components

Flow Diagram

Scope Discovery & Context Expansion

probelabs bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Code Analysis Results

Security Issues (2)

Architecture Issues (1)

Performance Issues (1)

Quality Issues (2)

✅ Dependency Check Passed

Connectivity Issues (2)

github-actions bot commented Oct 24, 2025

sonarqubecloud bot commented Oct 27, 2025

Quality Gate passed

Labels

3 participants

imogenkraak commented Oct 23, 2025 •

edited by github-actions bot

Loading

probelabs bot commented Oct 23, 2025 •

edited

Loading

probelabs bot commented Oct 23, 2025 •

edited

Loading