fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes #20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
This commit is contained in:
taw0002
2026-02-23 01:01:57 -07:00
committed by GitHub
parent 07edadfa8a
commit 3c57bf4c85
5 changed files with 7 additions and 3 deletions

View File

@@ -270,12 +270,12 @@ describe("isTransientHttpError", () => {
expect(isTransientHttpError("500 Internal Server Error")).toBe(true);
expect(isTransientHttpError("502 Bad Gateway")).toBe(true);
expect(isTransientHttpError("503 Service Unavailable")).toBe(true);
expect(isTransientHttpError("504 Gateway Timeout")).toBe(true);
expect(isTransientHttpError("521 <!DOCTYPE html><html></html>")).toBe(true);
expect(isTransientHttpError("529 Overloaded")).toBe(true);
});
it("returns false for non-retryable or non-http text", () => {
expect(isTransientHttpError("504 Gateway Timeout")).toBe(false);
expect(isTransientHttpError("429 Too Many Requests")).toBe(false);
expect(isTransientHttpError("network timeout")).toBe(false);
});