Exponential Backoff with Jitter in asyncio¶
A dependency hiccups for two seconds. Every client in your fleet fails at the same instant, waits the same fixed half-second, and retries at the same instant — re-overloading the service just as it was recovering. The retries themselves become the outage. This is the thundering-herd failure mode of naive fixed retries, and the fix is exponential backoff with jitter, bounded by a deadline so a struggling dependency never holds your callers past their latency budget.
This guide builds that retry from scratch: exponential delay, full jitter to de-synchronize clients, hard caps on delay and attempts, deadline-awareness via asyncio.timeout(), and a retryable-exception predicate so you only retry transient failures.
Prerequisites¶
- Python 3.11+ — for
asyncio.timeout()as the deadline mechanism. - Familiarity with the broader trade-offs in Retry & Backoff Strategies (idempotency, retry budgets, circuit breakers).
- Comfort with the Resilience, Cancellation & Error Handling model — particularly that
asyncio.sleepyields to the loop and is a cancellation point. - The operation you retry must be idempotent. Retrying a non-idempotent write duplicates side effects; fix idempotency first.
1. Base exponential backoff¶
Start with the schedule: each retry waits base * 2 ** attempt. Attempt 0 waits base, attempt 1 waits 2 * base, and so on. This gives a degrading dependency exponentially more recovery time while keeping the first retry fast.
Verify: log each delay and confirm it doubles per attempt. Note the problem we will fix next — with a fixed base, every client computes the same sequence.
2. Add full jitter¶
Treat the exponential value as a ceiling and sleep a uniformly random amount up to it: random.uniform(0, delay). This is AWS's "full jitter" scheme — it de-synchronizes clients completely and lowers the mean delay versus equal jitter.
Verify: run several concurrent callers and log retry timestamps. They should now scatter across each window instead of landing together.
3. Cap the delay and attempt count¶
Unbounded 2 ** attempt reaches minutes after ~10 failures. Clamp the ceiling with a cap, and keep the explicit attempts limit so a permanent failure surfaces instead of looping forever.
Verify: with cap=5.0, confirm no sampled delay exceeds 5 seconds even on attempt 5+, and that the loop raises after attempts failures rather than hanging.
4. Make it deadline-aware with asyncio.timeout¶
Attempt count doesn't bound wall-clock latency. Wrap the whole loop in asyncio.timeout() so the caller's deadline is enforced even mid-backoff: if the budget expires during an asyncio.sleep, it raises TimeoutError and the loop unwinds.
Verify: point factory at an always-failing call and confirm the helper raises TimeoutError at ~total_deadline, not after attempts or after a backoff overshoots.
5. Restrict to retryable exceptions¶
A bare except ConnectionError is a start, but production needs an explicit predicate so you retry only transient failures and let bugs and client errors propagate immediately. Critically, re-raise CancelledError (and let TimeoutError from the deadline escape) instead of treating them as retryable.
Verify: raise a ValueError from factory and confirm it propagates on the first attempt with no backoff; raise a ConnectionError and confirm it retries.
Verification¶
- Spread, not spikes. Launch 50 callers against a dependency that fails for ~1s then recovers. Log
loop.time()at each retry and bucket the timestamps. With jitter the histogram is flat; remove jitter and you'll see sharp peaks at the fixed intervals — the herd. - Retries stop at the deadline. Drive
factoryto always fail and assert the call raisesTimeoutErrorwithin a few milliseconds oftotal_deadline, never seconds late. A late finish means a backoff slept past the budget — the deadline wrap is what prevents it. - Non-retryable fast-fail. Assert a
ValueErrorsurfaces on attempt 1 with zeroasyncio.sleepcalls.
Pitfalls & edge cases¶
- Jitter omitted "for predictability." Deterministic backoff is the thundering herd. Always jitter fan-out retries; only a single-client cron job can safely skip it.
- Retrying
CancelledError. Catchingexcept Exceptionwon't catchCancelledErrorin 3.11+, butexcept BaseExceptionwill — and retrying cancellation defeats your deadline and shutdown logic. Always re-raise it explicitly. - Non-idempotent writes. This helper replays the call. If
factoryperforms a non-idempotentPOST/INSERT, retries duplicate it. Add an idempotency key or make the write conditional before retrying. - Unbounded growth. Without
cap,2 ** attemptreaches absurd delays; withoutattemptsor a deadline, a permanent failure loops forever and hides an outage as "slowness." - Wall clock vs loop clock. Compute budgets from
asyncio.get_running_loop().time()(monotonic), nottime.time().asyncio.timeout()already uses the loop clock — don't mix in wall-clock arithmetic, which can jump on NTP corrections.
Frequently Asked Questions¶
Why use full jitter instead of equal jitter or no jitter?
No jitter synchronizes every client into a thundering herd that re-overloads a recovering dependency. Full jitter sleeps a uniformly random amount up to the exponential ceiling, which spreads retries across the window and also lowers the mean delay compared with equal jitter (which keeps half the delay fixed). Full jitter is the AWS-recommended default for fan-out clients.
How do I stop the retry from overrunning the caller's deadline?
Wrap the entire retry loop in async with asyncio.timeout(total_deadline). Because asyncio.sleep is a suspension point, the timeout fires even during a backoff, raising TimeoutError and unwinding the loop. This bounds total wall-clock latency regardless of attempt count, so a backoff can never sleep past the budget.
Why must I avoid retrying CancelledError?
CancelledError signals that the task or its deadline is being torn down. Retrying it defeats cancellation, deadline enforcement, and graceful shutdown. In Python 3.11+ it is not an Exception subclass, so except Exception won't catch it, but except BaseException will — so re-raise it explicitly before any retryable-exception handling.
Related¶
- Retry & Backoff Strategies — up to the overview for retry budgets, circuit breakers, and the full pattern catalogue.
- Resilience, Cancellation & Error Handling — the parent reference tying retries to timeouts and cancellation.
- Timeouts & Deadlines — how
asyncio.timeout()nests with per-attempt timeouts to bound total latency.