Resilience, Cancellation & Error Handling¶
In a cooperative concurrency model, failure handling is itself cooperative. The same property that makes asyncio efficient — a single thread that only switches context at await points — also dictates how a program reacts to trouble. Nothing preempts a running coroutine. Cancellation is not a signal that kills a task mid-statement; it is a CancelledError injected at the next suspension point, which the coroutine then chooses to honor or resist. A timeout is the same mechanism with a clock attached: when the deadline fires, the loop schedules a cancellation against the waiting task. And error propagation, historically the messiest part of concurrent Python, becomes deterministic once tasks are grouped under a structured parent that aggregates failures into an ExceptionGroup.
Understanding these three forces as variations on one theme — control delivered at await boundaries — is what separates a service that degrades gracefully from one that leaks tasks, hangs on a dead socket, or swallows the very signal that was supposed to shut it down. This reference covers the four pillars of asyncio resilience: cooperative cancellation patterns, timeouts and deadlines, retry and backoff strategies, and structured errors via exception groups and TaskGroups.
Why this matters for production systems: an HTTP gateway that cannot cancel a stalled upstream call will exhaust its connection pool. A worker that retries a non-idempotent write without a budget will amplify an outage into data corruption. A fan-out that ignores one child's exception will return partial results that look complete. Resilience in asyncio is not a library you bolt on; it is a discipline encoded in where you place await, what you do in finally, and how you scope your tasks.
The Conceptual Model: Control Delivered at Await Points¶
Every resilience primitive in asyncio reduces to one of three operations against a Task: requesting its cancellation, bounding its wall-clock lifetime, or collecting its outcome. The event loop never interrupts code between two synchronous statements — it can only act when the coroutine yields control. This is why a tight CPU loop with no await is invulnerable to both timeouts and cancellation, and why the canonical fix for an unkillable task is to add a yield point or offload the work to an executor.
Cancellation flows down a task tree. When you cancel a parent, structured constructs propagate the request to children; when a child raises, structured constructs cancel the siblings. CancelledError is special: since Python 3.8 it inherits from BaseException, not Exception, precisely so that a blanket except Exception does not accidentally swallow a shutdown request. Treat it as a control-flow signal that must reach the top of the task, not as an error to be handled and discarded.
The table below maps the primitives this reference covers. Each is a thin wrapper over the same loop machinery; choosing correctly is mostly about scope and semantics, not performance.
| Primitive | Added | Role | Key semantics |
|---|---|---|---|
asyncio.timeout() |
3.11 | Deadline context manager | Cancels the body on expiry; re-raises as TimeoutError. Reschedulable. |
asyncio.timeout_at() |
3.11 | Absolute-deadline context manager | Same as above but takes a loop.time() instant — ideal for shared deadlines. |
asyncio.wait_for() |
3.4 | Deadline wrapper for one awaitable | Cancels and awaits the inner task; raises TimeoutError. Less composable than timeout(). |
Task.cancel() |
3.4 | Request cancellation | Schedules a CancelledError at the task's next await; returns whether the request was delivered. |
asyncio.CancelledError |
3.8 (BaseException) | The cancellation signal | Must propagate to the top of the task; never swallowed by except Exception. |
asyncio.shield() |
3.4 | Cancellation firewall | Protects an inner awaitable from outer cancellation; the outer scope still raises. |
Task.uncancel() |
3.11 | Decrement cancel count | Lets a TaskGroup distinguish a swallowed cancel from a deliberate recovery. |
asyncio.TaskGroup |
3.11 | Structured task scope | First child failure cancels siblings; exits raise an ExceptionGroup. |
ExceptionGroup / except* |
3.11 | Aggregated errors | Carries multiple concurrent failures; except* filters by type. |
Cooperative Cancellation¶
Cancellation is the foundation of every other resilience pattern, because timeouts and structured groups both implement themselves by cancelling tasks. Calling task.cancel() does not stop the task immediately — it sets a flag and arranges for CancelledError to be raised inside the coroutine the next time it suspends at an await. Until then, the task keeps running. This is the cooperative contract: the runtime asks, and well-behaved code complies promptly by letting the exception travel upward.
Lifecycle. A cancellation request moves through three states. First, cancel() is recorded against the task and a cancellation is requested. Second, at the next suspension point the loop throws CancelledError into the coroutine; finally blocks and async with exits run as the exception unwinds the stack. Third, the task transitions to CANCELLED once the exception reaches its boundary — unless the coroutine suppressed it, in which case the task may complete normally and the cancellation is silently lost. The detailed task scheduling and lifecycle reference covers these state transitions in depth.
The single most important rule: CancelledError must propagate. Cleanup belongs in finally or in an except asyncio.CancelledError: block that re-raises. Catching it to "handle the error" turns a cooperative shutdown into a hang.
Common misuse. The anti-patterns cluster around two mistakes. The first is swallowing: a bare except: or except Exception that catches CancelledError (pre-3.8 habits die hard) or an explicit except asyncio.CancelledError: pass that never re-raises. The second is over-shielding. asyncio.shield() wraps an awaitable so that cancelling the outer scope does not cancel the inner operation — useful for a critical write that must finish atomically. But shield() does not make the outer await immune; the caller still receives CancelledError while the shielded coroutine keeps running in the background, now orphaned from its awaiter. Use it surgically, and always keep a reference so you can await the shielded task during shutdown. For recovery scenarios where a task legitimately absorbs a cancellation and continues, Task.uncancel() (3.11) decrements the internal cancel counter so an enclosing TaskGroup does not misread the state. Detailed recipes live in the cancellation patterns guide.
Timeouts & Deadlines¶
A timeout is cancellation with a clock. asyncio.timeout() (3.11) is the modern, composable form: an async context manager that arms a deadline relative to now, and on expiry cancels whatever is awaiting inside its block, then re-raises the cancellation as a TimeoutError at the async with boundary. Because it is just a context manager, it nests cleanly and you can reschedule its deadline mid-flight.
Lifecycle. On entry, timeout() schedules a loop.call_at() for the deadline. If the body finishes first, the timer is cancelled and nothing happens. If the deadline fires, the loop cancels the current task; the CancelledError unwinds through the body (running finally blocks), and at the boundary the context manager converts it to TimeoutError. This conversion is why you catch TimeoutError, not CancelledError, around a timeout() block — a frequent source of confusion. The asyncio.timeout_at(when) variant takes an absolute loop.time() instant, which is the correct tool for a shared deadline: compute one deadline for an entire request and pass the same instant to every downstream call so the total budget is honored rather than multiplied.
asyncio.wait_for() is the older single-awaitable form. It wraps one coroutine, cancels it on expiry, and raises TimeoutError. It is fine for simple cases but composes poorly: you cannot wrap multiple awaits in one budget without nesting, and historically it had subtle issues racing cancellation against completion. Prefer timeout()/timeout_at() on 3.11+; reserve wait_for() for protecting a single awaitable on older code or where its narrower scope is exactly what you want.
Common misuse. Two failure modes dominate. Mistuned timeouts — a deadline shorter than the realistic p99 latency turns transient slowness into a storm of retries, while one set to infinity defeats the purpose. Derive timeouts from measured latency percentiles, not round numbers. Stacked timeouts that multiply — wrapping each retry attempt in its own fresh timeout() without an overall budget means three retries at 2s each can block for 6s plus backoff, far longer than the caller expects. Always cap the total operation with an outer timeout_at() using a shared deadline. The timeouts and deadlines guide works through deriving budgets from histograms.
Retries & Backoff¶
A retry is simply re-awaiting an operation after it failed, gated by three policies: when to retry (which exceptions are transient), how long to wait between attempts (the backoff curve), and when to stop (the budget). Done well, retries absorb blips. Done naively, they convert a struggling dependency into a self-inflicted denial-of-service.
Idempotency first. Only retry operations that are safe to repeat. A GET or an idempotent PUT with a stable key can retry freely; a POST that creates a record cannot, unless you supply an idempotency key the server deduplicates on. Retrying a non-idempotent write after a timeout — when the first attempt may have actually succeeded — is how you get double charges and duplicate rows. Classify every retryable call before adding a loop around it.
Backoff and jitter. Constant-interval retries from many clients synchronize into thundering herds. Exponential backoff (base * 2 ** attempt) spreads load over time; jitter — randomizing each delay within a range — spreads it across clients so they do not all retry on the same tick. The well-known "full jitter" policy picks each delay uniformly from [0, base * 2 ** attempt], which empirically minimizes contention. Cap the exponential growth so a high attempt count does not produce absurd sleeps.
Budgets. Bound retries by both a max attempt count and an overall deadline, and make the per-attempt timeout part of the shared budget rather than additive. A retry budget should also be aware of system-wide health: if a circuit-breaker is open, fail fast instead of queuing more doomed attempts. The retry and backoff strategies guide covers token-bucket retry budgets and integrating breakers. Cancellation must remain first-class inside a retry loop: never sleep on time.sleep() (it blocks the loop) and never catch CancelledError between attempts — a shutdown should abort the loop, not restart it.
Structured Errors with TaskGroup & ExceptionGroup¶
Before 3.11, concurrent error handling was a minefield: asyncio.gather() with default settings raises the first exception but leaves siblings running unobserved, and return_exceptions=True hides failures inside the result list where they are easy to ignore. asyncio.TaskGroup (3.11) makes propagation deterministic. It is an async context manager: you spawn children with tg.create_task(), and on exit the group awaits all of them. If any child raises, the group cancels the remaining siblings and, at the async with boundary, raises an ExceptionGroup bundling every non-cancellation error that occurred.
Lifecycle. The group has three phases. During the body, children run concurrently. At async with exit, the group blocks until all children finish. If all succeed, control passes through normally. If one or more fail, the first failure triggers cancellation of the siblings (via the cancellation machinery above — which is why your children must let CancelledError propagate), and the collected exceptions surface as an ExceptionGroup. This is structured concurrency: no task outlives its scope, and no failure is silently dropped. The structured concurrency with TaskGroup guide builds on the broader coroutine design patterns reference.
Handling the group. Use except* to filter an ExceptionGroup by member type — each matching except* block receives a sub-group of the matching exceptions, and unmatched members re-raise automatically:
Common misuse. The recurring mistakes: catching CancelledError inside a child (which prevents the group from cleanly cancelling siblings and can deadlock the exit), expecting a plain except SomeError to catch a single error from a group (it will not — a group is not its members; use except*), and reaching for a TaskGroup when you actually want fire-and-forget with independent failures, where gather(..., return_exceptions=True) is the better fit. The exception groups and TaskGroups guide details these. Structured groups also pair naturally with synchronization primitives when children share mutable state.
A Resilient Remote Call: Composing the Primitives¶
The production pattern combines everything above: a bounded retry loop, each attempt wrapped in a per-attempt timeout, the whole operation capped by a shared deadline via timeout_at(), full-jitter exponential backoff, cancellation-safe cleanup in finally, and a clean separation of retryable from fatal errors. The snippet below is stdlib-only and runnable; swap _simulate_call for a real aiohttp/httpx request in practice.
Note how the finally runs on every path — success, retryable failure, total-budget timeout, and external cancellation — guaranteeing the pooled resource is released. The inner TimeoutError is treated as retryable, while a cancellation from outside the function (shutdown, parent TaskGroup failure) is re-raised untouched.
Diagnostic Hook: Emit a counter per outcome label — success, retry, budget_exhausted, cancelled — and a histogram of attempt counts. A rising retry rate with stable success signals a degrading dependency before it fully fails; a spike in budget_exhausted means your total_budget is below the upstream's recovered latency, or your per-attempt timeout is too generous to leave room for retries. Track cancelled separately so shutdown-driven aborts never pollute your error rate.
Concurrency Control Under Deadlines¶
Resilience and concurrency limits are the same problem viewed from two angles: both bound the blast radius of a slow or failing dependency. A deadline caps time; a semaphore caps simultaneous load. Combine them and a single sick backend cannot both pile up unbounded in-flight requests and hang forever.
| Primitive | Use case | Trade-off |
|---|---|---|
asyncio.Semaphore(n) |
Cap concurrent calls to one dependency | Excess work waits in FIFO; a stuck holder starves the queue unless paired with a timeout |
asyncio.timeout_at(deadline) |
Enforce one shared budget across a fan-out | Requires computing the deadline once and threading it through; per-call timeouts are simpler but multiply |
asyncio.TaskGroup |
Structured fan-out with all-or-nothing failure | First failure cancels siblings — wrong choice when partial results are acceptable |
asyncio.shield() |
Protect a critical commit from outer cancellation | Orphans the inner task from its awaiter; must be tracked and drained on shutdown |
gather(return_exceptions=True) |
Independent fan-out, failures collected not propagated | Easy to ignore errors buried in the result list; no automatic sibling cancellation |
| Retry budget (token bucket) | Cap retry amplification system-wide | Adds shared state and contention; needs tuning against traffic |
The example below fans out under both a shared deadline and a concurrency cap. Every child inherits the same absolute deadline, so the entire batch — not each call — is bounded, and the semaphore ensures no more than limit requests touch the backend at once.
Catching the per-key error inside the child keeps one slow key from cancelling the whole group — a deliberate choice when partial results are acceptable. Omit the inner try and the TaskGroup reverts to all-or-nothing. These patterns sit at the boundary of the concurrent execution and worker patterns and network I/O and protocol handling references, which cover pool sizing and connection reuse.
Diagnostic Hook: Sample the semaphore's wait time, not just its value. Acquisition latency rising toward your deadline means the cap is too tight for the offered load — or the backend is slow and holders are not releasing. Pair it with the number of TimeoutErrors per batch to distinguish "we are throttling ourselves" from "the dependency is down."
Diagnostics & Tuning¶
Resilience bugs are quiet: a swallowed CancelledError shows up as a service that takes 30 seconds to shut down, not as a stack trace. Use this workflow to surface the three classic failures — swallowed cancellation, cancellation leaks, and mistuned timeouts.
- Enable debug mode and slow-callback logging. Run with
PYTHONASYNCIODEBUG=1or callloop.set_debug(True)and setloop.slow_callback_duration = 0.05. This flags coroutines that block the loop between awaits — the same code that is invulnerable to your timeouts — and surfacesTask was destroyed but it is pending!warnings that indicate leaked tasks. - Audit live tasks under load. Periodically dump
asyncio.all_tasks()and, for any task older than expected, calltask.get_coro()andtask.get_stack()to see where it is parked. A task stuck at anawaitthat should have been cancelled minutes ago is a swallowed-CancelledErrorsmoking gun. A steadily growing count of tasks is a cancellation leak — shielded or fire-and-forget tasks that no one drains. - Trace cancellation delivery. Wrap suspect coroutines so they log on
CancelledErrorentry and re-raise. If you see the cancel logged but the task never reachesCANCELLED, something downstream is suppressing it. On 3.11+, inspecttask.cancelling()to see how many cancellation requests are outstanding versus honored. - Correlate timeouts with latency metrics. Plot your per-attempt and total-budget timeout values against the dependency's measured p50/p99. A timeout below p99 guarantees a baseline retry rate even when healthy; a timeout above the upstream's own deadline means you wait for failures the upstream already gave up on.
- Export outcome metrics. Counter per outcome (
success/retry/timeout/cancelled) plus a histogram of retry attempts and backoff sleep durations. Route to Prometheus/Grafana and alert onretryrate, not just error rate — retries are the early warning.
A minimal instrumentation hook for steps 2–3:
Diagnostic Hook: Run audit_pending_tasks on a timer in non-production and on-demand (via a signal handler) in production. A task that appears in successive audits with cancelling > 0 but never disappears is definitively swallowing CancelledError — go fix its except block.
Common Pitfalls¶
| Anti-Pattern | Impact | Mitigation |
|---|---|---|
Catching CancelledError and not re-raising |
Hung shutdowns, zombie tasks, the loop never drains | Catch only to clean up, then raise; put unconditional cleanup in finally |
except Exception wrapping an await |
On 3.8+ this misses CancelledError, but custom base-exception handlers can still swallow it |
Order handlers so CancelledError is re-raised first; never use bare except: |
| Per-retry timeouts with no overall budget | Total latency multiplies; caller's deadline is blown silently | Wrap the whole retry loop in timeout_at() with a shared deadline |
| Retrying non-idempotent operations | Duplicate writes, double charges, corrupted state after a timeout | Classify idempotency first; use server-side idempotency keys for writes |
| Constant-interval retries without jitter | Thundering herd synchronizes clients, amplifying the outage | Full-jitter exponential backoff with a capped ceiling |
time.sleep() for backoff |
Blocks the entire event loop; starves every other task | Always await asyncio.sleep() — it is cancellable and cooperative |
Treating an ExceptionGroup like a single exception |
except SomeError misses grouped failures; errors slip through |
Use except* to filter group members by type |
shield() without tracking the inner task |
Orphaned coroutine runs unobserved after its awaiter is cancelled | Keep the task reference and await it during graceful shutdown |
Frequently Asked Questions¶
Why does my asyncio.timeout() block raise TimeoutError instead of CancelledError?
By design. asyncio.timeout() implements the deadline by cancelling the task inside its body, but at the async with boundary it converts that CancelledError into a TimeoutError so callers can distinguish a deadline from an external cancellation. Catch TimeoutError around the block. If you see CancelledError escape instead, the cancellation came from outside the timeout (a parent or shutdown), and you should let it propagate.
Is it ever safe to catch CancelledError?
Yes — to run cleanup, never to suppress. The valid pattern is except asyncio.CancelledError: <clean up>; raise, or simply putting cleanup in a finally block. The only case where you legitimately do not re-raise is a deliberate recovery (e.g. you cancelled a sub-task yourself and want to continue), and there you should call Task.uncancel() on 3.11+ so an enclosing TaskGroup reads the state correctly.
When should I use asyncio.timeout() versus asyncio.wait_for()?
On Python 3.11+, prefer asyncio.timeout() (and timeout_at() for shared deadlines): it is a context manager, so it wraps an arbitrary block, nests cleanly, and can be rescheduled. Use wait_for() only for protecting a single awaitable on older runtimes or where its narrower one-coroutine scope is exactly the semantics you want.
How do I retry a POST request safely?
Only with an idempotency key the server deduplicates on. After a timeout you cannot know whether the first POST succeeded, so a blind retry risks a duplicate write. Generate a stable key per logical operation, send it on every attempt, and let the server collapse duplicates. Without that guarantee, do not retry non-idempotent writes — fail and surface the ambiguity.
Why does except ValueError not catch the error from my TaskGroup?
A TaskGroup raises an ExceptionGroup, which is not an instance of its members. Use except* ValueError to match grouped exceptions by type — each except* block receives a sub-group of matching errors, and anything unmatched re-raises automatically. A plain except ValueError will only fire if you have a lone, non-grouped error.
How do I detect a cancellation leak in production?
Periodically dump asyncio.all_tasks() and look for tasks that persist across audits with task.cancelling() > 0 but never reach the CANCELLED state — they are swallowing CancelledError. A monotonically growing task count points to undrained shield()ed or fire-and-forget tasks. Enable loop.set_debug(True) to also catch Task was destroyed but it is pending! warnings.
Related¶
- Asyncio Fundamentals & Event Loop Architecture — the loop mechanics and await semantics every resilience primitive builds on.
- Timeouts & Deadlines — deriving budgets from latency percentiles and propagating shared deadlines.
- Cancellation Patterns — cooperative shutdown, shielding, and cleanup recipes that keep
CancelledErrorflowing. - Retry & Backoff Strategies — idempotency, full-jitter backoff, and retry budgets with circuit breakers.
- Exception Groups & TaskGroups — structured fan-out,
except*filtering, and aggregated failure handling. - Concurrent Execution & Worker Patterns — pool sizing and backpressure for the bounded-concurrency patterns shown here.