fix(auth): auto-expire stale auth profile cooldowns and reset error count

When an auth profile hits a rate limit, `errorCount` is incremented and
`cooldownUntil` is set with exponential backoff. After the cooldown
expires, the time-based check correctly returns false — but `errorCount`
persists. The next transient failure immediately escalates to a much
longer cooldown because the backoff formula uses the stale count:

  60s × 5^(errorCount-1), max 1h

This creates a positive feedback loop where profiles appear permanently
stuck after rate limits, requiring manual JSON editing to recover.

Add `clearExpiredCooldowns()` which sweeps all profiles on every call to
`resolveAuthProfileOrder()` and clears expired `cooldownUntil` /
`disabledUntil` values along with resetting `errorCount` and
`failureCounts` — giving the profile a fair retry window (circuit-breaker
half-open → closed transition).

Key design decisions:
- `cooldownUntil` and `disabledUntil` handled independently (a profile
  can have both; only the expired one is cleared)
- `errorCount` reset only when ALL unusable windows have expired
- `lastFailureAt` preserved for the existing failureWindowMs decay logic
- In-memory mutation; disk persistence happens lazily on the next store
  write, matching the existing save pattern

Fixes #3604
Related: #13623, #15851, #11972, #8434
This commit is contained in:
nabbilkhan
2026-02-16 07:27:27 +00:00
committed by Shadow
parent d3707147c0
commit 03cadc4b7a
6 changed files with 507 additions and 1 deletions

View File

@@ -2,7 +2,7 @@ import type { OpenClawConfig } from "../../config/config.js";
import type { AuthProfileStore } from "./types.js";
import { normalizeProviderId } from "../model-selection.js";
import { listProfilesForProvider } from "./profiles.js";
import { isProfileInCooldown } from "./usage.js";
import { clearExpiredCooldowns, isProfileInCooldown } from "./usage.js";
function resolveProfileUnusableUntil(stats: {
cooldownUntil?: number;
@@ -26,6 +26,11 @@ export function resolveAuthProfileOrder(params: {
const { cfg, store, provider, preferredProfile } = params;
const providerKey = normalizeProviderId(provider);
const now = Date.now();
// Clear any cooldowns that have expired since the last check so profiles
// get a fresh error count and are not immediately re-penalized on the
// next transient failure. See #3604.
clearExpiredCooldowns(store, now);
const storedOrder = (() => {
const order = store.order;
if (!order) {