openclaw

mirror of https://github.com/openclaw/openclaw.git synced 2026-05-11 08:31:41 +00:00

Author	SHA1	Message	Date
Altay	6e962d8b9e	fix(agents): handle overloaded failover separately (#38301 ) * fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry	2026-03-07 01:42:11 +03:00
Xinhua Gu	01b20172b8	fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484 ) (#36802 ) * fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484) Some providers (notably Anthropic Claude Max plan) surface temporary usage/rate-limit failures as HTTP 402 instead of 429. Before this change, all 402s were unconditionally mapped to 'billing', which produced a misleading 'run out of credits' warning for Max plan users who simply hit their usage window. This follows the same pattern introduced for HTTP 400 in #36783: check the error message for an explicit rate-limit signal before falling back to the default status-code classification. - classifyFailoverReasonFromHttpStatus now returns 'rate_limit' for 402 when isRateLimitErrorMessage matches the payload text - Added regression tests covering both the rate-limit and billing paths on 402 * fix: narrow 402 rate-limit matcher to prevent billing misclassification The original implementation used isRateLimitErrorMessage(), which matches phrases like 'quota exceeded' that legitimately appear in billing errors. This commit replaces it with a narrow, 402-specific matcher that requires BOTH retry language (try again/retry/temporary/cooldown) AND limit terminology (usage limit/rate limit/organization usage). Prevents misclassification of errors like: 'HTTP 402: exceeded quota, please add credits' -> billing (not rate_limit) Added regression test for the ambiguous case. --------- Co-authored-by: Val Alexander <bunsthedev@gmail.com>	2026-03-06 03:45:36 -06:00
zhouhe-xydt	a65d70f84b	Fix failover for zhipuai 1310 Weekly/Monthly Limit Exhausted (#33813 ) Merged via squash. Prepared head SHA: `3dc441e58d` Co-authored-by: zhouhe-xydt <265407618+zhouhe-xydt@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-06 12:04:09 +03:00
Altay	49acb07f9f	fix(agents): classify insufficient_quota 400s as billing (#36783 )	2026-03-06 01:17:48 +03:00
jiangnan	029c473727	fix(failover): narrow service-unavailable to require overload indicator (#32828 ) (#36646 ) Merged via squash. Prepared head SHA: `46fb430612` Co-authored-by: jnMetaCode <12096460+jnMetaCode@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-06 00:01:57 +03:00
Altay	f014e255df	refactor(agents): share failover HTTP status classification (#36615 ) * fix(agents): classify transient failover statuses consistently * fix(agents): preserve legacy failover status mapping	2026-03-05 23:50:36 +03:00
不做了睡大觉	8ac7ce73b3	fix: avoid false global rate-limit classification from generic cooldown text (#32972 ) Merged via squash. Prepared head SHA: `813c16f5af` Co-authored-by: stakeswky <64798754+stakeswky@users.noreply.github.com> Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com> Reviewed-by: @altaywtf	2026-03-05 22:58:21 +03:00
Kai	60a6d11116	fix(embedded): classify model_context_window_exceeded as context overflow, trigger compaction (#35934 ) Merged via squash. Prepared head SHA: `20fa77289c` Co-authored-by: RealKai42 <44634134+RealKai42@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman	2026-03-05 11:30:24 -08:00
Tak Hoffman	9889c6da53	Runtime: stabilize tool/run state transitions under compaction and backpressure Synthesize runtime state transition fixes for compaction tool-use integrity and long-running handler backpressure. Sources: #33630, #33583 Co-authored-by: Kevin Shenghui <shenghuikevin@gmail.com> Co-authored-by: Theo Tarr <theodore@tarr.com>	2026-03-03 21:25:32 -06:00
Gustavo Madeira Santana	e4b4486a96	Agent: unify bootstrap truncation warning handling (#32769 ) Merged via squash. Prepared head SHA: `5d6d4ddfa6` Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-03-03 16:28:38 -05:00
Peter Steinberger	6472e03412	refactor(agents): share failover error matchers	2026-03-03 02:51:00 +00:00
AI南柯(KingMo)	30ab9b2068	fix(agents): recognize connection errors as retryable timeout failures (#31697 ) * fix(agents): recognize connection errors as retryable timeout failures ## Problem When a model endpoint becomes unreachable (e.g., local proxy down, relay server offline), the failover system fails to switch to the next candidate model. Errors like "Connection error." are not classified as retryable, causing the session to hang on a broken endpoint instead of falling back to healthy alternatives. ## Root Cause Connection/network errors are not recognized by the current failover classifier: - Text patterns like "Connection error.", "fetch failed", "network error" - Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text) While `failover-error.ts` handles these as error codes (err.code), it misses them when they appear as plain text in error messages. ## Solution Extend timeout error patterns to include connection/network failures: In `errors.ts` (ERROR_PATTERNS.timeout): - Text: "connection error", "network error", "fetch failed", etc. - Regex: /\beconn(?:refused\|reset\|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i In `failover-error.ts` (TIMEOUT_HINT_RE): - Same patterns for non-assistant error paths ## Testing Added test cases covering: - "Connection error." - "fetch failed" - "network error: ECONNREFUSED" - "ENOTFOUND" / "EAI_AGAIN" in message text ## Impact - Compatibility: High - only expands retryable error detection - Behavior: Connection failures now trigger automatic fallback - Risk: Low - changes are additive and well-tested * style: fix code formatting for test file	2026-03-03 02:37:23 +00:00
Peter Steinberger	1bd20dbdb6	fix(failover): treat stop reason error as timeout	2026-03-03 01:05:24 +00:00
Peter Steinberger	a2fdc3415f	fix(failover): handle unhandled stop reason error	2026-03-03 01:05:24 +00:00
bmendonca3	a6489ab5e9	fix(agents): cap openai-completions tool call ids to provider-safe format (#31947 ) Co-authored-by: bmendonca3 <bmendonca3@users.noreply.github.com>	2026-03-02 18:08:20 +00:00
Sid	40e078a567	fix(auth): classify permission_error as auth_permanent for profile fallback (#31324 ) When an OAuth auth profile returns HTTP 403 with permission_error (e.g. expired plan), the error was not matched by the authPermanent patterns. This caused the profile to receive only a short cooldown instead of being disabled, so the gateway kept retrying the same broken profile indefinitely. Add "permission_error" and "not allowed for this organization" to the authPermanent error patterns so these errors trigger the longer billing/auth_permanent disable window and proper profile rotation. Closes #31306 Made-with: Cursor Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-03-01 22:26:05 -08:00
Charles Dusek	92199ac129	fix(agents): unblock gpt-5.3-codex API-key routing and replay (#31083 ) * fix(agents): unblock gpt-5.3-codex API-key replay path * fix(agents): scope OpenAI replay ID rewrites per turn * test: fix nodes-tool mock typing and reformat telegram accounts	2026-03-02 03:45:12 +00:00
Frank Yang	ed86252aa5	fix: handle CLI session expired errors gracefully instead of crashing gateway (#31090 ) * fix: handle CLI session expired errors gracefully - Add session_expired to FailoverReason type - Add isCliSessionExpiredErrorMessage to detect expired CLI sessions - Modify runCliAgent to retry with new session when session expires - Update agentCommand to clear expired session IDs from session store - Add proper error handling to prevent gateway crashes on expired sessions Fixes #30986 * fix: add session_expired to AuthProfileFailureReason and missing log import * fix: type cli-runner usage field to match EmbeddedPiAgentMeta * fix: harden CLI session-expiry recovery handling * build: regenerate host env security policy swift --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>	2026-03-02 01:11:05 +00:00
Peter Steinberger	81ca309ee6	fix(agents): land #31002 from @yfge Co-authored-by: yfge <geyunfei@gmail.com>	2026-03-02 01:08:58 +00:00
Peter Steinberger	250f9e15f5	fix(agents): land #31007 from @HOYALIM Co-authored-by: Ho Lim <subhoya@gmail.com>	2026-03-02 01:06:00 +00:00
Aleksandrs Tihenko	c0026274d9	fix(auth): distinguish revoked API keys from transient auth errors (#25754 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: `8f9c07a200` Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-02-25 19:47:16 -05:00
Peter Steinberger	d2597d5ecf	fix(agents): harden model fallback failover paths	2026-02-25 03:46:34 +00:00
Peter Steinberger	43f318cd9a	fix(agents): reduce billing false positives on long text (#25680 ) Land PR #25680 from @lairtonlelis. Retain explicit status/code/http 402 detection for oversized structured payloads. Co-authored-by: Ailton <lairton@telnyx.com>	2026-02-25 01:22:17 +00:00
Peter Machona	9ced64054f	fix(auth): classify missing OAuth scopes as auth failures (#24761 )	2026-02-24 03:33:44 +00:00
Clawborn	544809b6f6	Add Chinese context overflow patterns to isContextOverflowError (#22855 ) Proxy providers returning Chinese error messages (e.g. Chinese LLM gateways) use patterns like '上下文过长' or '上下文超出' that are not matched by the existing English-only patterns in isContextOverflowError. This prevents auto-compaction from triggering, leaving the session stuck. Add the most common Chinese proxy patterns: - 上下文过长 (context too long) - 上下文超出 (context exceeded) - 上下文长度超 (context length exceeds) - 超出最大上下文 (exceeds maximum context) - 请压缩上下文 (please compress context) Chinese characters are unaffected by toLowerCase() so check the original message directly. Closes #22849	2026-02-23 10:54:24 -05:00
Vincent Koc	4f340b8812	fix(agents): avoid classifying reasoning-required errors as context overflow (#24593 ) * Agents: exclude reasoning-required errors from overflow detection * Tests: cover reasoning-required overflow classification guard * Tests: format reasoning-required endpoint errors	2026-02-23 10:38:49 -05:00
Alice Losasso	652099cd5c	fix: correctly identify Groq TPM limits as rate limits instead of context overflow (#16176 ) Co-authored-by: Howard <dddabtc@users.noreply.github.com>	2026-02-23 10:32:53 -05:00
青雲	69692d0d3a	fix: detect additional context overflow error patterns to prevent leak to user (#20539 ) * fix: detect additional context overflow error patterns to prevent leak to user Fixes #9951 The error 'input length and max_tokens exceed context limit: 170636 + 34048 > 200000' was not caught by isContextOverflowError() and leaked to users via formatAssistantErrorText()'s invalidRequest fallback. Add three new patterns to isContextOverflowError(): - 'exceed context limit' (direct match) - 'exceeds the model\'s maximum context' - max_tokens/input length + exceed + context (compound match) These are now rewritten to the friendly context overflow message. * Overflow: add regression tests and changelog credits * Update CHANGELOG.md * Update pi-embedded-helpers.isbillingerrormessage.test.ts --------- Co-authored-by: echoVic <AkiraVic@outlook.com> Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-23 10:03:56 -05:00
Peter Steinberger	9bd04849ed	fix(agents): detect Kimi model-token-limit overflows Co-authored-by: Danilo Falcão <danilo@falcao.org>	2026-02-23 12:44:23 +00:00
taw0002	3c57bf4c85	fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017 ) * fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) When a model API returns 502 Bad Gateway, 503 Service Unavailable, or 504 Gateway Timeout, the error object carries the status code directly. resolveFailoverReasonFromError() only checked 402/429/401/403/408/400, so 5xx server errors fell through to message-based classification which requires the status code to appear at the start of the error message. Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing the message with '503', so the message classifier never matched and failover never triggered — the run retried the same broken model. Add 502/503/504 to the status-code branch, returning 'timeout' (matching the existing behavior of isTransientHttpError in the message classifier). Fixes #20999 * Changelog: add failover 502/503/504 note with credits * Failover: classify HTTP 504 as transient in message parser * Changelog: credit taw0002 and vincentkoc for failover fix --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-23 03:01:57 -05:00
Peter Steinberger	382fe8009a	refactor!: remove google-antigravity provider support	2026-02-23 05:20:14 +01:00
青雲	3dfee78d72	fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595 ) (#23698 ) * fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595) Mistral requires tool call IDs to be exactly 9 alphanumeric characters ([a-zA-Z0-9]{9}). The existing sanitizeToolCallIdsForCloudCodeAssist mechanism only ran on historical messages at attempt start via sanitizeSessionHistory, but the pi-agent-core agent loop's internal tool call → tool result cycles bypassed that path entirely. Changes: - Wrap streamFn (like dropThinkingBlocks) so every outbound request sees sanitized tool call IDs when the transcript policy requires it - Replace call_${Date.now()} in pendingToolCalls with a 9-char hex ID generated from crypto.randomBytes - Add Mistral tool call ID error pattern to ERROR_PATTERNS.format so the error is correctly classified for retry/rotation * Changelog: document Mistral strict9 tool-call ID fix --------- Co-authored-by: echoVic <AkiraVic@outlook.com> Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-22 13:37:12 -05:00
Artale	51e9c54f09	fix(agents): skip bootstrap files with undefined path (#22698 ) * fix(agents): skip bootstrap files with undefined path buildBootstrapContextFiles() called file.path.replace() without checking that path was defined. If a hook pushed a bootstrap file using 'filePath' instead of 'path', the function threw TypeError and crashed every agent session — not just the misconfigured hook. Fix: add a null-guard before the path.replace() call. Files with undefined path are skipped with a warning so one bad hook can't take down all agents. Also adds a test covering the undefined-path case. Fixes #22693 * fix: harden bootstrap path validation and report guards (#22698) (thanks @arosstale) --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>	2026-02-22 13:17:07 +01:00
Vignesh Natarajan	35fe33aa90	Agents: classify Anthropic api_error internal server failures for fallback	2026-02-21 19:22:16 -08:00
Harry Cui Kepler	ffa63173e0	refactor(agents): migrate console.warn/error/info to subsystem logger (#22906 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: `a806c4cb27` Co-authored-by: Kepler2024 <166882517+Kepler2024@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-02-21 17:11:47 -05:00
niceysam	5e423b596c	fix: remove false-positive billing error rewrite on normal assistant text (openclaw#17834) thanks @niceysam Verified: - pnpm install --frozen-lockfile - pnpm build - pnpm check - pnpm test:macmini Co-authored-by: niceysam <256747835+niceysam@users.noreply.github.com> Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>	2026-02-21 12:17:39 -06:00
mudrii	7ecfc1d93c	fix(auth): bidirectional mode/type compat + sync OAuth to all agents (#12692 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: `2dee8e1174` Co-authored-by: mudrii <220262+mudrii@users.noreply.github.com> Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com> Reviewed-by: @obviyus	2026-02-20 16:01:09 +05:30
Protocol Zero	2af3415fac	fix: treat HTTP 503 as failover-eligible for LLM provider errors (#21086 ) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-19 12:45:09 -08:00
青雲	3d4ef56044	fix: include provider and model name in billing error message (#20510 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: `40dbdf62e8` Co-authored-by: echoVic <16428813+echoVic@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-02-18 21:56:00 -05:00
Peter Steinberger	1934eebbf0	refactor(agents): dedupe lifecycle send assertions and stable payload stringify	2026-02-18 14:15:14 +00:00
Peter Steinberger	b8b43175c5	style: align formatting with oxfmt 0.33	2026-02-18 01:34:35 +00:00
Peter Steinberger	31f9be126c	style: run oxfmt and fix gate failures	2026-02-18 01:29:02 +00:00
Peter Steinberger	b05e89e5e6	fix(agents): make image sanitization dimension configurable	2026-02-18 00:54:20 +01:00
cpojer	d0cb8c19b2	chore: wtf.	2026-02-17 13:36:48 +09:00
Sebastian	ed11e93cf2	chore(format)	2026-02-16 23:20:16 -05:00
cpojer	90ef2d6bdf	chore: Update formatting.	2026-02-17 09:18:40 +09:00
Daniel Sauer	12ce358da5	fix(failover): recognize 'abort' stop reason as timeout for model fallback When streaming providers (GLM, OpenRouter, etc.) return 'stop reason: abort' due to stream interruption, OpenClaw's failover mechanism did not recognize this as a timeout condition. This prevented fallback models from being triggered, leaving users with failed requests instead of graceful failover. Changes: - Add abort patterns to ERROR_PATTERNS.timeout in pi-embedded-helpers/errors.ts - Extend TIMEOUT_HINT_RE regex to include abort patterns in failover-error.ts Fixes #18453 Co-authored-by: James <james@openclaw.ai>	2026-02-16 23:49:51 +01:00
Gustavo Madeira Santana	8a67016646	Agents: raise bootstrap total cap and warn on /context truncation (#18229 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: `f6620526df` Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-02-16 12:04:53 -05:00
the sun gif man	68ea063958	🤖 fix: preserve openai reasoning replay ids (#17792 ) What: - disable tool-call id sanitization for OpenAI/OpenAI Codex transcript policy - gate id sanitization in image sanitizer to full mode only - keep orphan reasoning downgrade scoped to OpenAI model-switch replay path - update transcript policy, session-history, sanitizer, and reasoning replay tests - document OpenAI model-switch orphan-reasoning cleanup behavior in transcript hygiene reference Why: - OpenAI Responses replay depends on canonical call_id\|fc_id pairings for reasoning followers - strict id rewriting in OpenAI path breaks follower matching and triggers rs_* orphan 400s - limiting scope avoids behavior expansion while fixing the identified regression Tests: - pnpm vitest run src/agents/transcript-policy.test.ts src/agents/pi-embedded-runner.sanitize-session-history.test.ts src/agents/openai-responses.reasoning-replay.test.ts - pnpm vitest run --config vitest.e2e.config.ts src/agents/transcript-policy.e2e.test.ts src/agents/pi-embedded-runner.sanitize-session-history.e2e.test.ts src/agents/pi-embedded-helpers.sanitize-session-messages-images.removes-empty-assistant-text-blocks-but-preserves.e2e.test.ts src/agents/pi-embedded-helpers.sanitizeuserfacingtext.e2e.test.ts - pnpm lint - pnpm format:check - pnpm check:docs - pnpm test (fails in current macOS bash 3.2 env at test/git-hooks-pre-commit.integration.test.ts: mapfile not found)	2026-02-15 22:45:01 -08:00
Gustavo Madeira Santana	bd9d35c720	chore: remove defensive logic	2026-02-15 09:54:04 -05:00

1 2 3

113 Commits