fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes #20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
This commit is contained in:
taw0002
2026-02-23 01:01:57 -07:00
committed by GitHub
parent 07edadfa8a
commit 3c57bf4c85
5 changed files with 7 additions and 3 deletions

View File

@@ -163,7 +163,7 @@ export function resolveFailoverReasonFromError(err: unknown): FailoverReason | n
if (status === 408) {
return "timeout";
}
if (status === 503) {
if (status === 502 || status === 503 || status === 504) {
return "timeout";
}
if (status === 400) {