Retry & Backoff Strategies¶
Retrying is the cheapest reliability mechanism you can add to an async client, and the easiest one to get catastrophically wrong. A retry loop that ignores idempotency duplicates writes; one without jitter synchronizes every client in your fleet into a thundering herd; one that ignores the deadline turns a 200ms timeout into a 30-second hang. This guide covers retrying failed async operations safely under load: deciding what is retryable, how long to wait between attempts, when to stop, and how to stop hammering a dependency that has already fallen over.
This is a focused topic within Resilience, Cancellation & Error Handling. Retries do not exist in isolation — they share a budget with timeouts and deadlines, and every backoff sleep is a cancellation point governed by the cancellation patterns you adopt elsewhere.
Scope of this guide:
- Idempotency as the precondition for any retry
- Classifying retryable vs non-retryable failures
- Backoff schedules: fixed interval, exponential, and jittered
- Bounding retries by attempt count and by deadline
- Circuit breakers to short-circuit a dead dependency
Architectural principles¶
Before writing a single retry loop, internalize these. Most retry incidents are violations of one of them.
- Retry only idempotent operations. A retry replays a request whose outcome you never observed. If replaying it can change state twice — a non-idempotent
POSTthat charges a card, anINSERTwithout a unique key — a retry is a correctness bug, not a resilience feature. Make writes idempotent (idempotency keys, conditional updates) before you make them retryable. - Cap total time, not just attempts. "Retry 5 times" says nothing about how long the caller waits. Five attempts with exponential backoff can span 30+ seconds. Bound retries by a wall-clock deadline so the worst case is predictable and bounded.
- Add jitter or you synchronize a thundering herd. When a dependency blips, every client fails at roughly the same instant. Deterministic backoff makes them all retry at the same instant too, re-creating the exact load spike that caused the failure. Randomized delay spreads the retries out.
- A retry that ignores the deadline is a bug. If the caller's deadline is 2 seconds away, there is no point sleeping 4 seconds before the next attempt. Every backoff must be clipped to the remaining budget, and a retry must never be scheduled past the deadline.
- Stop hammering a dead dependency. Retries assume transient failure. When a dependency is hard-down, retries amplify load on something already on its knees. A circuit breaker detects sustained failure and fails fast until the dependency recovers.
How retries integrate with the event loop¶
Backoff is implemented with await asyncio.sleep(delay). This is the critical detail that makes async retries cheap: the sleep yields control back to the loop, so a coroutine waiting out a 4-second backoff consumes no thread and lets thousands of other tasks run. Contrast this with time.sleep() in a threaded client, where every backing-off worker holds a whole OS thread hostage. The resilience and error-handling overview covers the broader model for how the loop schedules these suspensions.
Because the backoff is a suspension point, it is also a cancellation point. If the enclosing task is cancelled — or an outer asyncio.timeout() fires — during the sleep, asyncio.sleep raises CancelledError and your retry loop unwinds. This is exactly what you want for deadline enforcement, but it means a try/except Exception around the loop must never swallow CancelledError (it isn't an Exception subclass in 3.11+, but defensive code still gets this wrong). It also means retries compose naturally with deadlines: wrap the whole retry loop in one asyncio.timeout() and the budget is enforced for free, even mid-backoff.
The interaction with timeouts is bidirectional. A per-attempt timeout bounds a single slow call; a total deadline bounds the whole retry sequence. You almost always want both — see timeouts and deadlines for how the two nest.
Pattern catalogue¶
Each pattern below builds on the last. Start with the simplest schedule your dependency tolerates and add complexity only when load testing demands it.
Fixed-interval retry¶
The simplest schedule: wait a constant delay between attempts. Adequate for low-concurrency callers (a cron job, a single worker) where a thundering herd is impossible because there is only one client. Avoid it for anything fan-out: N clients on a fixed interval are the herd.
Exponential backoff¶
Double the delay after each failure (base * 2 ** attempt), capped at a maximum. This gives a struggling dependency exponentially more breathing room while keeping early retries snappy. It is the right default for fan-out clients — but on its own, with a deterministic schedule, it still synchronizes the herd. Cap the delay so a long failure streak doesn't push the next attempt minutes out.
Full-jitter backoff¶
Take the exponential delay as a ceiling and sleep a uniformly random amount up to it: random.uniform(0, min(cap, base * 2 ** attempt)). This is the AWS-recommended "full jitter" scheme. It de-synchronizes clients completely — two clients that failed at the same instant pick independent delays and spread their retries across the window. Full jitter both spaces retries out and lowers the mean delay versus equal jitter, which usually wins. The dedicated guide on exponential backoff with jitter in asyncio walks the full implementation and the jitter-variant trade-offs.
Deadline-bounded retry¶
Attempt count alone makes worst-case latency unpredictable. Bind the entire sequence to a wall-clock deadline computed from loop.time() (the monotonic loop clock — immune to wall-clock jumps). Before each sleep, check how much budget remains; if the next backoff would overrun the deadline, stop now rather than sleeping into a guaranteed timeout.
Circuit breaker wrapping retries¶
When a dependency is hard-down, retrying every call just piles load onto a corpse. A circuit breaker tracks recent failures; after a threshold it opens and fails fast (raising immediately without a network call) for a cool-down period, then half-opens to let a single probe through. If the probe succeeds it closes; if it fails it re-opens. Retries live inside a closed breaker; an open breaker stops them entirely.
Resource boundaries under retry¶
Retries change the shape of your load, and the failure modes show up at the resource layer.
- Retry budgets, not just per-call caps. A per-call "retry 3 times" applied across a stampede still triples fleet-wide load on a struggling dependency. Adopt a retry budget — e.g. retries may not exceed 10% of total requests in a rolling window — so the system caps amplification globally even when every individual call is "allowed" to retry.
- Concurrency under retry storms. A retry holds its slot in the calling coroutine for the full backoff. With thousands of in-flight requests all backing off, you can exhaust an
asyncio.Semaphoreor worker pool with tasks that are sleeping, not working. Size concurrency limits for the backoff-inflated in-flight count, not the steady-state one. - Connection-pool interaction. A retried request needs a connection from the pool on every attempt. Under a retry storm the pool churns and can saturate, turning transient downstream errors into pool-acquisition timeouts upstream. Keep retries and pool sizing co-designed — see connection pooling and keepalive and the patterns for async HTTP clients and servers, where most retry traffic actually lands.
Integrated production example¶
A generic retry_async helper combining everything: a retryable-exception predicate, exponential backoff with full jitter, an overall deadline, an optional circuit breaker, and a diagnostic hook that emits per-call telemetry.
Diagnostic Hook. Emit, per call: attempts (a p99 above 1 means a dependency is degrading), retry rate (retried calls ÷ total — the input to your retry budget; a sudden climb is an early warning), time-in-backoff (how much latency retries add to the request budget), and breaker state transitions (every CLOSED→OPEN edge is an incident signal). Tag these with the target dependency so you can see which downstream is forcing retries. If retry rate climbs while success rate doesn't recover, your retries are masking a hard failure — open the breaker faster.
Failure modes¶
| Failure mode | Root cause | Detection | Fix |
|---|---|---|---|
| Retry storm / thundering herd | Deterministic backoff with no jitter synchronizes every client | Downstream sees periodic load spikes aligned with retry intervals; request rate oscillates | Apply full jitter; add a fleet-wide retry budget |
| Duplicated side effects | Retrying a non-idempotent op (POST/charge/INSERT) | Duplicate rows, double charges, mismatched counts after a blip | Add idempotency keys / conditional writes; restrict retries to idempotent verbs |
| Retrying past the deadline | Backoff scheduled without checking remaining budget | Calls exceed their stated timeout; a sleep runs into a guaranteed TimeoutError |
Clip each delay to deadline - loop.time(); wrap loop in asyncio.timeout() |
| Infinite / silent retries | Unbounded loop or over-broad except masks a permanent failure |
Calls never surface errors; latency creeps; an outage looks like "slowness" | Cap attempts and deadline; only catch a retryable predicate; alert on retry rate |
| Concurrency / pool exhaustion under storm | Backing-off tasks hold slots/connections while sleeping | Semaphore or pool acquisition timeouts upstream during a downstream blip | Size limits for backoff-inflated in-flight count; co-design with pool sizing |
Frequently Asked Questions¶
Which exceptions should I actually retry?
Only transient failures: connection resets, read/connect timeouts, and server-side 5xx (especially 502/503/504) plus 429 with a Retry-After. Never retry 4xx client errors (400, 401, 403, 404) — they are deterministic and will fail identically on every attempt. Never retry programming errors (TypeError, ValueError) or CancelledError. Restrict retries to an explicit allowlist predicate rather than a broad except Exception, which silently retries bugs.
Why does backoff need jitter?
When a dependency blips, many clients fail at almost the same instant. With a deterministic backoff schedule they all wait the same amount and retry at the same instant, re-creating the exact load spike that caused the failure — a thundering herd. Full jitter (sleep a uniformly random amount up to the exponential ceiling) gives each client an independent delay, spreading retries across the window so the recovering dependency sees smooth load instead of synchronized spikes.
How do retries interact with timeouts and cancellation?
Backoff is implemented with asyncio.sleep, which is a suspension and therefore a cancellation point. If an outer asyncio.timeout() fires or the task is cancelled during a backoff, asyncio.sleep raises CancelledError and the retry loop unwinds. Wrap the whole retry loop in one asyncio.timeout() to enforce a total deadline for free, and never let a try/except swallow CancelledError, or you defeat both cancellation and deadline enforcement.
When should I use a circuit breaker instead of just retrying?
Retries assume transient failure. When a dependency is hard-down, retrying every call amplifies load on something already failing and slows your own callers. A circuit breaker tracks recent failures; after a threshold it opens and fails fast without a network call for a cool-down period, then half-opens to let one probe through. Put retries inside a closed breaker; the open breaker stops them entirely until the dependency recovers.
Should I hand-roll retries or use tenacity?
tenacity is excellent for declarative policies — retry conditions, wait strategies (wait_exponential_jitter), stop conditions, and before/after hooks compose cleanly via decorators and it supports async. Hand-roll when you need tight control over deadline-aware clipping against loop.time(), a shared fleet-wide retry budget, or integration with a custom circuit breaker and metrics pipeline. Either way the principles are identical: idempotency, jitter, a deadline, and a retryable-exception predicate.
Related¶
- Resilience, Cancellation & Error Handling — up to the overview for the full reliability mental model.
- Timeouts & Deadlines — the budget every retry must respect; nest per-attempt and total deadlines.
- Cancellation Patterns — why backoff sleeps are cancellation points and how to unwind cleanly.
- Exponential Backoff with Jitter in asyncio — the step-by-step build of the jittered, deadline-aware schedule.
- Connection Pooling & Keepalive — how retry storms interact with and saturate connection pools.