Retry & Backoff Strategies¶

Retrying is the cheapest reliability mechanism you can add to an async client, and the easiest one to get catastrophically wrong. A retry loop that ignores idempotency duplicates writes; one without jitter synchronizes every client in your fleet into a thundering herd; one that ignores the deadline turns a 200ms timeout into a 30-second hang. This guide covers retrying failed async operations safely under load: deciding what is retryable, how long to wait between attempts, when to stop, and how to stop hammering a dependency that has already fallen over.

This is a focused topic within Resilience, Cancellation & Error Handling. Retries do not exist in isolation — they share a budget with timeouts and deadlines, and every backoff sleep is a cancellation point governed by the cancellation patterns you adopt elsewhere.

Scope of this guide:

Idempotency as the precondition for any retry
Classifying retryable vs non-retryable failures
Backoff schedules: fixed interval, exponential, and jittered
Bounding retries by attempt count and by deadline
Circuit breakers to short-circuit a dead dependency

Architectural principles¶

Before writing a single retry loop, internalize these. Most retry incidents are violations of one of them.

Retry only idempotent operations. A retry replays a request whose outcome you never observed. If replaying it can change state twice — a non-idempotent POST that charges a card, an INSERT without a unique key — a retry is a correctness bug, not a resilience feature. Make writes idempotent (idempotency keys, conditional updates) before you make them retryable.
Cap total time, not just attempts. "Retry 5 times" says nothing about how long the caller waits. Five attempts with exponential backoff can span 30+ seconds. Bound retries by a wall-clock deadline so the worst case is predictable and bounded.
Add jitter or you synchronize a thundering herd. When a dependency blips, every client fails at roughly the same instant. Deterministic backoff makes them all retry at the same instant too, re-creating the exact load spike that caused the failure. Randomized delay spreads the retries out.
A retry that ignores the deadline is a bug. If the caller's deadline is 2 seconds away, there is no point sleeping 4 seconds before the next attempt. Every backoff must be clipped to the remaining budget, and a retry must never be scheduled past the deadline.
Stop hammering a dead dependency. Retries assume transient failure. When a dependency is hard-down, retries amplify load on something already on its knees. A circuit breaker detects sustained failure and fails fast until the dependency recovers.

How retries integrate with the event loop¶

Backoff is implemented with await asyncio.sleep(delay). This is the critical detail that makes async retries cheap: the sleep yields control back to the loop, so a coroutine waiting out a 4-second backoff consumes no thread and lets thousands of other tasks run. Contrast this with time.sleep() in a threaded client, where every backing-off worker holds a whole OS thread hostage. The resilience and error-handling overview covers the broader model for how the loop schedules these suspensions.

Because the backoff is a suspension point, it is also a cancellation point. If the enclosing task is cancelled — or an outer asyncio.timeout() fires — during the sleep, asyncio.sleep raises CancelledError and your retry loop unwinds. This is exactly what you want for deadline enforcement, but it means a try/except Exception around the loop must never swallow CancelledError (it isn't an Exception subclass in 3.11+, but defensive code still gets this wrong). It also means retries compose naturally with deadlines: wrap the whole retry loop in one asyncio.timeout() and the budget is enforced for free, even mid-backoff.

The interaction with timeouts is bidirectional. A per-attempt timeout bounds a single slow call; a total deadline bounds the whole retry sequence. You almost always want both — see timeouts and deadlines for how the two nest.

Pattern catalogue¶

Each pattern below builds on the last. Start with the simplest schedule your dependency tolerates and add complexity only when load testing demands it.

Fixed-interval retry¶

The simplest schedule: wait a constant delay between attempts. Adequate for low-concurrency callers (a cron job, a single worker) where a thundering herd is impossible because there is only one client. Avoid it for anything fan-out: N clients on a fixed interval are the herd.

import asyncio


async def fixed_retry(coro_factory, *, attempts: int = 3, delay: float = 0.5):
    """Retry a coroutine a fixed number of times with a constant gap."""
    last_exc: Exception | None = None
    for attempt in range(attempts):
        try:
            return await coro_factory()
        except (ConnectionError, TimeoutError) as exc:
            last_exc = exc
            if attempt < attempts - 1:
                await asyncio.sleep(delay)  # yields to the loop
    raise last_exc  # exhausted; surface the last transient failure

Exponential backoff¶

Double the delay after each failure (base * 2 ** attempt), capped at a maximum. This gives a struggling dependency exponentially more breathing room while keeping early retries snappy. It is the right default for fan-out clients — but on its own, with a deterministic schedule, it still synchronizes the herd. Cap the delay so a long failure streak doesn't push the next attempt minutes out.

import asyncio


async def exponential_retry(coro_factory, *, attempts: int = 5,
                            base: float = 0.2, cap: float = 10.0):
    last_exc: Exception | None = None
    for attempt in range(attempts):
        try:
            return await coro_factory()
        except (ConnectionError, TimeoutError) as exc:
            last_exc = exc
            if attempt < attempts - 1:
                delay = min(cap, base * 2 ** attempt)  # 0.2, 0.4, 0.8, ...
                await asyncio.sleep(delay)
    raise last_exc

Full-jitter backoff¶

Take the exponential delay as a ceiling and sleep a uniformly random amount up to it: random.uniform(0, min(cap, base * 2 ** attempt)). This is the AWS-recommended "full jitter" scheme. It de-synchronizes clients completely — two clients that failed at the same instant pick independent delays and spread their retries across the window. Full jitter both spaces retries out and lowers the mean delay versus equal jitter, which usually wins. The dedicated guide on exponential backoff with jitter in asyncio walks the full implementation and the jitter-variant trade-offs.

import asyncio
import random


async def jittered_retry(coro_factory, *, attempts: int = 5,
                         base: float = 0.2, cap: float = 10.0):
    last_exc: Exception | None = None
    for attempt in range(attempts):
        try:
            return await coro_factory()
        except (ConnectionError, TimeoutError) as exc:
            last_exc = exc
            if attempt < attempts - 1:
                ceiling = min(cap, base * 2 ** attempt)
                await asyncio.sleep(random.uniform(0, ceiling))  # full jitter
    raise last_exc

Deadline-bounded retry¶

Attempt count alone makes worst-case latency unpredictable. Bind the entire sequence to a wall-clock deadline computed from loop.time() (the monotonic loop clock — immune to wall-clock jumps). Before each sleep, check how much budget remains; if the next backoff would overrun the deadline, stop now rather than sleeping into a guaranteed timeout.

import asyncio
import random


async def deadline_retry(coro_factory, *, total_deadline: float = 5.0,
                         base: float = 0.2, cap: float = 10.0):
    loop = asyncio.get_running_loop()
    deadline = loop.time() + total_deadline
    attempt = 0
    last_exc: Exception | None = None
    while True:
        try:
            return await coro_factory()
        except (ConnectionError, TimeoutError) as exc:
            last_exc = exc
            ceiling = min(cap, base * 2 ** attempt)
            delay = random.uniform(0, ceiling)
            remaining = deadline - loop.time()
            if delay >= remaining:   # don't sleep past the budget
                raise last_exc
            await asyncio.sleep(delay)
            attempt += 1

Circuit breaker wrapping retries¶

When a dependency is hard-down, retrying every call just piles load onto a corpse. A circuit breaker tracks recent failures; after a threshold it opens and fails fast (raising immediately without a network call) for a cool-down period, then half-opens to let a single probe through. If the probe succeeds it closes; if it fails it re-opens. Retries live inside a closed breaker; an open breaker stops them entirely.

import asyncio


class CircuitBreaker:
    def __init__(self, *, fail_threshold: int = 5, reset_after: float = 30.0):
        self.fail_threshold = fail_threshold
        self.reset_after = reset_after
        self._failures = 0
        self._opened_at: float | None = None

    def _now(self) -> float:
        return asyncio.get_running_loop().time()

    async def call(self, coro_factory):
        if self._opened_at is not None:
            if self._now() - self._opened_at < self.reset_after:
                raise RuntimeError("circuit open: failing fast")
            self._opened_at = None  # half-open: allow one probe
        try:
            result = await coro_factory()
        except Exception:
            self._failures += 1
            if self._failures >= self.fail_threshold:
                self._opened_at = self._now()  # trip open
            raise
        self._failures = 0  # success closes the breaker
        return result

Resource boundaries under retry¶

Retries change the shape of your load, and the failure modes show up at the resource layer.

Retry budgets, not just per-call caps. A per-call "retry 3 times" applied across a stampede still triples fleet-wide load on a struggling dependency. Adopt a retry budget — e.g. retries may not exceed 10% of total requests in a rolling window — so the system caps amplification globally even when every individual call is "allowed" to retry.
Concurrency under retry storms. A retry holds its slot in the calling coroutine for the full backoff. With thousands of in-flight requests all backing off, you can exhaust an asyncio.Semaphore or worker pool with tasks that are sleeping, not working. Size concurrency limits for the backoff-inflated in-flight count, not the steady-state one.
Connection-pool interaction. A retried request needs a connection from the pool on every attempt. Under a retry storm the pool churns and can saturate, turning transient downstream errors into pool-acquisition timeouts upstream. Keep retries and pool sizing co-designed — see connection pooling and keepalive and the patterns for async HTTP clients and servers, where most retry traffic actually lands.

Integrated production example¶

A generic retry_async helper combining everything: a retryable-exception predicate, exponential backoff with full jitter, an overall deadline, an optional circuit breaker, and a diagnostic hook that emits per-call telemetry.

import asyncio
import logging
import random
from dataclasses import dataclass, field
from typing import Awaitable, Callable, TypeVar

T = TypeVar("T")
log = logging.getLogger("retry")


@dataclass
class RetryStats:
    attempts: int = 0
    time_in_backoff: float = 0.0
    succeeded: bool = False


@dataclass
class RetryPolicy:
    base: float = 0.2
    cap: float = 10.0
    total_deadline: float = 5.0
    retryable: tuple[type[BaseException], ...] = (ConnectionError, TimeoutError)
    breaker: "CircuitBreaker | None" = field(default=None)


async def retry_async(
    factory: Callable[[], Awaitable[T]],
    policy: RetryPolicy,
) -> T:
    """Retry an idempotent coroutine with jittered backoff, bounded by a deadline."""
    loop = asyncio.get_running_loop()
    deadline = loop.time() + policy.total_deadline
    stats = RetryStats()
    attempt = 0
    last_exc: BaseException | None = None
    try:
        while True:
            stats.attempts += 1
            try:
                call = (policy.breaker.call(factory) if policy.breaker
                        else factory())
                result = await call
                stats.succeeded = True
                return result
            except asyncio.CancelledError:
                raise  # never swallow cancellation / deadline
            except policy.retryable as exc:
                last_exc = exc
                ceiling = min(policy.cap, policy.base * 2 ** attempt)
                delay = random.uniform(0, ceiling)  # full jitter
                remaining = deadline - loop.time()
                if delay >= remaining:
                    raise  # budget exhausted; surface the transient error
                stats.time_in_backoff += delay
                await asyncio.sleep(delay)
                attempt += 1
            # non-retryable exceptions propagate immediately (no except clause)
    finally:
        # --- Diagnostic Hook ---
        log.info(
            "retry_call attempts=%d backoff_s=%.3f outcome=%s breaker_open=%s",
            stats.attempts, stats.time_in_backoff,
            "ok" if stats.succeeded else "fail",
            policy.breaker._opened_at is not None if policy.breaker else False,
        )


# --- usage ---
async def main() -> None:
    async def flaky() -> str:
        # replace with a real idempotent call (GET, conditional PUT, ...)
        if random.random() < 0.6:
            raise ConnectionError("transient")
        return "ok"

    policy = RetryPolicy(total_deadline=3.0, breaker=CircuitBreaker())
    # wrap the whole thing in an outer timeout for a hard ceiling
    async with asyncio.timeout(policy.total_deadline + 0.5):
        print(await retry_async(flaky, policy))


if __name__ == "__main__":
    asyncio.run(main())

Diagnostic Hook. Emit, per call: attempts (a p99 above 1 means a dependency is degrading), retry rate (retried calls ÷ total — the input to your retry budget; a sudden climb is an early warning), time-in-backoff (how much latency retries add to the request budget), and breaker state transitions (every CLOSED→OPEN edge is an incident signal). Tag these with the target dependency so you can see which downstream is forcing retries. If retry rate climbs while success rate doesn't recover, your retries are masking a hard failure — open the breaker faster.

Failure modes¶

Failure mode	Root cause	Detection	Fix
Retry storm / thundering herd	Deterministic backoff with no jitter synchronizes every client	Downstream sees periodic load spikes aligned with retry intervals; request rate oscillates	Apply full jitter; add a fleet-wide retry budget
Duplicated side effects	Retrying a non-idempotent op (POST/charge/INSERT)	Duplicate rows, double charges, mismatched counts after a blip	Add idempotency keys / conditional writes; restrict retries to idempotent verbs
Retrying past the deadline	Backoff scheduled without checking remaining budget	Calls exceed their stated timeout; a sleep runs into a guaranteed `TimeoutError`	Clip each delay to `deadline - loop.time()`; wrap loop in `asyncio.timeout()`
Infinite / silent retries	Unbounded loop or over-broad `except` masks a permanent failure	Calls never surface errors; latency creeps; an outage looks like "slowness"	Cap attempts and deadline; only catch a retryable predicate; alert on retry rate
Concurrency / pool exhaustion under storm	Backing-off tasks hold slots/connections while sleeping	Semaphore or pool acquisition timeouts upstream during a downstream blip	Size limits for backoff-inflated in-flight count; co-design with pool sizing

Frequently Asked Questions¶

Which exceptions should I actually retry?

Only transient failures: connection resets, read/connect timeouts, and server-side 5xx (especially 502/503/504) plus 429 with a Retry-After. Never retry 4xx client errors (400, 401, 403, 404) — they are deterministic and will fail identically on every attempt. Never retry programming errors (TypeError, ValueError) or CancelledError. Restrict retries to an explicit allowlist predicate rather than a broad except Exception, which silently retries bugs.

Why does backoff need jitter?

When a dependency blips, many clients fail at almost the same instant. With a deterministic backoff schedule they all wait the same amount and retry at the same instant, re-creating the exact load spike that caused the failure — a thundering herd. Full jitter (sleep a uniformly random amount up to the exponential ceiling) gives each client an independent delay, spreading retries across the window so the recovering dependency sees smooth load instead of synchronized spikes.

How do retries interact with timeouts and cancellation?

Backoff is implemented with asyncio.sleep, which is a suspension and therefore a cancellation point. If an outer asyncio.timeout() fires or the task is cancelled during a backoff, asyncio.sleep raises CancelledError and the retry loop unwinds. Wrap the whole retry loop in one asyncio.timeout() to enforce a total deadline for free, and never let a try/except swallow CancelledError, or you defeat both cancellation and deadline enforcement.

When should I use a circuit breaker instead of just retrying?

Retries assume transient failure. When a dependency is hard-down, retrying every call amplifies load on something already failing and slows your own callers. A circuit breaker tracks recent failures; after a threshold it opens and fails fast without a network call for a cool-down period, then half-opens to let one probe through. Put retries inside a closed breaker; the open breaker stops them entirely until the dependency recovers.

Should I hand-roll retries or use tenacity?

tenacity is excellent for declarative policies — retry conditions, wait strategies (wait_exponential_jitter), stop conditions, and before/after hooks compose cleanly via decorators and it supports async. Hand-roll when you need tight control over deadline-aware clipping against loop.time(), a shared fleet-wide retry budget, or integration with a custom circuit breaker and metrics pipeline. Either way the principles are identical: idempotency, jitter, a deadline, and a retryable-exception predicate.

Resilience, Cancellation & Error Handling — up to the overview for the full reliability mental model.
Timeouts & Deadlines — the budget every retry must respect; nest per-attempt and total deadlines.
Cancellation Patterns — why backoff sleeps are cancellation points and how to unwind cleanly.
Exponential Backoff with Jitter in asyncio — the step-by-step build of the jittered, deadline-aware schedule.
Connection Pooling & Keepalive — how retry storms interact with and saturate connection pools.