fix(auth): auto-expire stale auth profile cooldowns and reset error count

When an auth profile hits a rate limit, `errorCount` is incremented and
`cooldownUntil` is set with exponential backoff. After the cooldown
expires, the time-based check correctly returns false — but `errorCount`
persists. The next transient failure immediately escalates to a much
longer cooldown because the backoff formula uses the stale count:

  60s × 5^(errorCount-1), max 1h

This creates a positive feedback loop where profiles appear permanently
stuck after rate limits, requiring manual JSON editing to recover.

Add `clearExpiredCooldowns()` which sweeps all profiles on every call to
`resolveAuthProfileOrder()` and clears expired `cooldownUntil` /
`disabledUntil` values along with resetting `errorCount` and
`failureCounts` — giving the profile a fair retry window (circuit-breaker
half-open → closed transition).

Key design decisions:
- `cooldownUntil` and `disabledUntil` handled independently (a profile
  can have both; only the expired one is cleared)
- `errorCount` reset only when ALL unusable windows have expired
- `lastFailureAt` preserved for the existing failureWindowMs decay logic
- In-memory mutation; disk persistence happens lazily on the next store
  write, matching the existing save pattern

Fixes #3604
Related: #13623, #15851, #11972, #8434
This commit is contained in:
nabbilkhan
2026-02-16 07:27:27 +00:00
committed by Shadow
parent d3707147c0
commit 03cadc4b7a
6 changed files with 507 additions and 1 deletions

View File

@@ -51,6 +51,77 @@ export function getSoonestCooldownExpiry(
return soonest;
}
/**
* Clear expired cooldowns from all profiles in the store.
*
* When `cooldownUntil` or `disabledUntil` has passed, the corresponding fields
* are removed and error counters are reset so the profile gets a fresh start
* (circuit-breaker half-open → closed). Without this, a stale `errorCount`
* causes the *next* transient failure to immediately escalate to a much longer
* cooldown — the root cause of profiles appearing "stuck" after rate limits.
*
* `cooldownUntil` and `disabledUntil` are handled independently: if a profile
* has both and only one has expired, only that field is cleared.
*
* Mutates the in-memory store; disk persistence happens lazily on the next
* store write (e.g. `markAuthProfileUsed` / `markAuthProfileFailure`), which
* matches the existing save pattern throughout the auth-profiles module.
*
* @returns `true` if any profile was modified.
*/
export function clearExpiredCooldowns(store: AuthProfileStore, now?: number): boolean {
const usageStats = store.usageStats;
if (!usageStats) {
return false;
}
const ts = now ?? Date.now();
let mutated = false;
for (const [profileId, stats] of Object.entries(usageStats)) {
if (!stats) {
continue;
}
let profileMutated = false;
const cooldownExpired =
typeof stats.cooldownUntil === "number" &&
Number.isFinite(stats.cooldownUntil) &&
stats.cooldownUntil > 0 &&
ts >= stats.cooldownUntil;
const disabledExpired =
typeof stats.disabledUntil === "number" &&
Number.isFinite(stats.disabledUntil) &&
stats.disabledUntil > 0 &&
ts >= stats.disabledUntil;
if (cooldownExpired) {
stats.cooldownUntil = undefined;
profileMutated = true;
}
if (disabledExpired) {
stats.disabledUntil = undefined;
stats.disabledReason = undefined;
profileMutated = true;
}
// Reset error counters when ALL cooldowns have expired so the profile gets
// a fair retry window. Preserves lastFailureAt for the failureWindowMs
// decay check in computeNextProfileUsageStats.
if (profileMutated && !resolveProfileUnusableUntil(stats)) {
stats.errorCount = 0;
stats.failureCounts = undefined;
}
if (profileMutated) {
usageStats[profileId] = stats;
mutated = true;
}
}
return mutated;
}
/**
* Mark a profile as successfully used. Resets error count and updates lastUsed.
* Uses store lock to avoid overwriting concurrent usage updates.