-
Notifications
You must be signed in to change notification settings - Fork 124
Description
Context
AI providers periodically experience elevated error rates and partial outages, as reflected on their public status pages.
During these periods, applications relying heavily on AI can experience repeated timeouts and degraded performance.
The current failover implementation in the AI SDK is well designed.
It correctly:
- Attempts providers in order
- Fails over only for
FailoverableException - Emits failover events
This works well for handling individual request failures.
However, during sustained provider instability such as timeouts, 5xx errors, overloads, or rate limits, every request still begins with the primary provider. Even if the provider is clearly unhealthy, each request waits for failure before falling back.
For AI-intensive applications, especially those relying on real-time AI responses, this can lead to:
- Repeated timeouts across requests
- Slower user-facing responses
- Queue congestion under load
- Increased pressure on already unstable providers
The current design is reactive per request and does not track provider health across requests.
Suggested Improvement
Introduce an optional circuit breaker layer that tracks provider failures over time and temporarily skips providers during sustained instability.
When enabled, the SDK could:
- Track failures in a rolling window
- Mark a provider as temporarily unhealthy after N failures
- Skip unhealthy providers immediately for fast failover
- Periodically allow probe calls to detect recovery
- Automatically restore providers when healthy
This would be:
- Fully opt-in
- Disabled by default
- Minimal in scope
- Backed by Laravel Cache
Fallback protects a single request.
A circuit breaker protects the system.
Would this be worth discussing as an optional resilience enhancement for production AI workloads?