Skip to content

Exponential Backoff with Jitter in asyncio

A dependency hiccups for two seconds. Every client in your fleet fails at the same instant, waits the same fixed half-second, and retries at the same instant — re-overloading the service just as it was recovering. The retries themselves become the outage. This is the thundering-herd failure mode of naive fixed retries, and the fix is exponential backoff with jitter, bounded by a deadline so a struggling dependency never holds your callers past their latency budget.

This guide builds that retry from scratch: exponential delay, full jitter to de-synchronize clients, hard caps on delay and attempts, deadline-awareness via asyncio.timeout(), and a retryable-exception predicate so you only retry transient failures.

Prerequisites

  • Python 3.11+ — for asyncio.timeout() as the deadline mechanism.
  • Familiarity with the broader trade-offs in Retry & Backoff Strategies (idempotency, retry budgets, circuit breakers).
  • Comfort with the Resilience, Cancellation & Error Handling model — particularly that asyncio.sleep yields to the loop and is a cancellation point.
  • The operation you retry must be idempotent. Retrying a non-idempotent write duplicates side effects; fix idempotency first.
Jittered vs non-jittered retry spacing over time Top timeline: fixed retries land at the same moments forming spikes. Bottom timeline: jittered retries spread across the window. Fixed vs jittered retries over time Fixed delay clients synchronize → herd spike spike Exp + full jitter delays spread, growing window time →

1. Base exponential backoff

Start with the schedule: each retry waits base * 2 ** attempt. Attempt 0 waits base, attempt 1 waits 2 * base, and so on. This gives a degrading dependency exponentially more recovery time while keeping the first retry fast.

import asyncio


async def call_with_backoff(factory, *, attempts: int = 5, base: float = 0.2):
    for attempt in range(attempts):
        try:
            return await factory()
        except ConnectionError:
            if attempt == attempts - 1:
                raise
            delay = base * 2 ** attempt  # 0.2, 0.4, 0.8, 1.6, ...
            await asyncio.sleep(delay)   # yields control to the event loop

Verify: log each delay and confirm it doubles per attempt. Note the problem we will fix next — with a fixed base, every client computes the same sequence.

2. Add full jitter

Treat the exponential value as a ceiling and sleep a uniformly random amount up to it: random.uniform(0, delay). This is AWS's "full jitter" scheme — it de-synchronizes clients completely and lowers the mean delay versus equal jitter.

import asyncio
import random


async def call_with_jitter(factory, *, attempts: int = 5, base: float = 0.2):
    for attempt in range(attempts):
        try:
            return await factory()
        except ConnectionError:
            if attempt == attempts - 1:
                raise
            ceiling = base * 2 ** attempt
            await asyncio.sleep(random.uniform(0, ceiling))  # full jitter

Verify: run several concurrent callers and log retry timestamps. They should now scatter across each window instead of landing together.

3. Cap the delay and attempt count

Unbounded 2 ** attempt reaches minutes after ~10 failures. Clamp the ceiling with a cap, and keep the explicit attempts limit so a permanent failure surfaces instead of looping forever.

import asyncio
import random


async def call_capped(factory, *, attempts: int = 6,
                      base: float = 0.2, cap: float = 5.0):
    for attempt in range(attempts):
        try:
            return await factory()
        except ConnectionError:
            if attempt == attempts - 1:
                raise
            ceiling = min(cap, base * 2 ** attempt)  # never exceed cap seconds
            await asyncio.sleep(random.uniform(0, ceiling))

Verify: with cap=5.0, confirm no sampled delay exceeds 5 seconds even on attempt 5+, and that the loop raises after attempts failures rather than hanging.

4. Make it deadline-aware with asyncio.timeout

Attempt count doesn't bound wall-clock latency. Wrap the whole loop in asyncio.timeout() so the caller's deadline is enforced even mid-backoff: if the budget expires during an asyncio.sleep, it raises TimeoutError and the loop unwinds.

import asyncio
import random


async def call_with_deadline(factory, *, total_deadline: float = 4.0,
                             base: float = 0.2, cap: float = 5.0):
    async with asyncio.timeout(total_deadline):   # hard ceiling on everything
        attempt = 0
        while True:
            try:
                return await factory()
            except ConnectionError:
                ceiling = min(cap, base * 2 ** attempt)
                await asyncio.sleep(random.uniform(0, ceiling))
                attempt += 1

Verify: point factory at an always-failing call and confirm the helper raises TimeoutError at ~total_deadline, not after attempts or after a backoff overshoots.

5. Restrict to retryable exceptions

A bare except ConnectionError is a start, but production needs an explicit predicate so you retry only transient failures and let bugs and client errors propagate immediately. Critically, re-raise CancelledError (and let TimeoutError from the deadline escape) instead of treating them as retryable.

import asyncio
import random

RETRYABLE = (ConnectionError, TimeoutError)


def is_retryable(exc: BaseException) -> bool:
    # e.g. also: getattr(exc, "status", None) in {502, 503, 504, 429}
    return isinstance(exc, RETRYABLE)


async def retry(factory, *, total_deadline: float = 4.0,
                base: float = 0.2, cap: float = 5.0):
    async with asyncio.timeout(total_deadline):
        attempt = 0
        while True:
            try:
                return await factory()
            except asyncio.CancelledError:
                raise                       # never retry cancellation
            except Exception as exc:
                if not is_retryable(exc):
                    raise                   # bugs / 4xx propagate at once
                ceiling = min(cap, base * 2 ** attempt)
                await asyncio.sleep(random.uniform(0, ceiling))
                attempt += 1

Verify: raise a ValueError from factory and confirm it propagates on the first attempt with no backoff; raise a ConnectionError and confirm it retries.

Verification

  • Spread, not spikes. Launch 50 callers against a dependency that fails for ~1s then recovers. Log loop.time() at each retry and bucket the timestamps. With jitter the histogram is flat; remove jitter and you'll see sharp peaks at the fixed intervals — the herd.
  • Retries stop at the deadline. Drive factory to always fail and assert the call raises TimeoutError within a few milliseconds of total_deadline, never seconds late. A late finish means a backoff slept past the budget — the deadline wrap is what prevents it.
  • Non-retryable fast-fail. Assert a ValueError surfaces on attempt 1 with zero asyncio.sleep calls.

Pitfalls & edge cases

  • Jitter omitted "for predictability." Deterministic backoff is the thundering herd. Always jitter fan-out retries; only a single-client cron job can safely skip it.
  • Retrying CancelledError. Catching except Exception won't catch CancelledError in 3.11+, but except BaseException will — and retrying cancellation defeats your deadline and shutdown logic. Always re-raise it explicitly.
  • Non-idempotent writes. This helper replays the call. If factory performs a non-idempotent POST/INSERT, retries duplicate it. Add an idempotency key or make the write conditional before retrying.
  • Unbounded growth. Without cap, 2 ** attempt reaches absurd delays; without attempts or a deadline, a permanent failure loops forever and hides an outage as "slowness."
  • Wall clock vs loop clock. Compute budgets from asyncio.get_running_loop().time() (monotonic), not time.time(). asyncio.timeout() already uses the loop clock — don't mix in wall-clock arithmetic, which can jump on NTP corrections.

Frequently Asked Questions

Why use full jitter instead of equal jitter or no jitter?

No jitter synchronizes every client into a thundering herd that re-overloads a recovering dependency. Full jitter sleeps a uniformly random amount up to the exponential ceiling, which spreads retries across the window and also lowers the mean delay compared with equal jitter (which keeps half the delay fixed). Full jitter is the AWS-recommended default for fan-out clients.

How do I stop the retry from overrunning the caller's deadline?

Wrap the entire retry loop in async with asyncio.timeout(total_deadline). Because asyncio.sleep is a suspension point, the timeout fires even during a backoff, raising TimeoutError and unwinding the loop. This bounds total wall-clock latency regardless of attempt count, so a backoff can never sleep past the budget.

Why must I avoid retrying CancelledError?

CancelledError signals that the task or its deadline is being torn down. Retrying it defeats cancellation, deadline enforcement, and graceful shutdown. In Python 3.11+ it is not an Exception subclass, so except Exception won't catch it, but except BaseException will — so re-raise it explicitly before any retryable-exception handling.