Skip to content

Exception Groups & TaskGroups

This reference covers structured error handling for concurrent asyncio work — specifically how asyncio.TaskGroup (Python 3.11+) supervises a set of child tasks as a single unit and how the failures it collects arrive as an ExceptionGroup that you unpack with except*. The scope is narrow on purpose: not error handling in general, but the precise contract of a group of tasks that succeed together or fail together, the catalogue of patterns for reacting to one-or-many concurrent failures, and the resource boundaries that keep a group from over-committing the loop. Get this wrong and you see the signature symptoms of the 3.11 transition — a try/except SomeError that suddenly catches nothing because the error is now wrapped, sibling tasks that keep mutating state after their peer has already failed, or a non-matching subgroup that re-raises out of a handler you thought was exhaustive.

A TaskGroup is the asyncio realization of structured concurrency: every task it spawns has a lifetime bounded by the async with block, and the block cannot exit while any child is still running. Everything in this guide flows from that single invariant. Because no child can outlive the scope, the group can offer an all-or-nothing guarantee; because failures may happen in several children at once, it cannot surface them one at a time, so it aggregates them into a group exception. The new control-flow construct, except*, exists precisely to filter that aggregate by type.

Architectural principles

  • A group is all-or-nothing. The async with asyncio.TaskGroup() block does not return until every child task has finished — successfully, by raising, or by being cancelled. There is no path where the block exits with a child still in flight, which is what makes a group safe to reason about as a unit.
  • One failure cancels the group. The instant any child raises (anything other than CancelledError), the group cancels every other child and stops accepting new create_task() calls. Siblings stop at their next suspension point rather than running to completion behind a failure.
  • Errors arrive as a group, not singly. Because several children can fail in the same scheduling window, the group raises an ExceptionGroup (or BaseExceptionGroup) bundling all the non-cancellation errors. Even a single failure is delivered wrapped — a plain except SomeError will not catch it.
  • except* filters by type and re-raises the rest. Each except* clause peels the matching exceptions out of the group and runs once with a sub-group of just those. Anything no clause matched is automatically re-raised as a residual group, so unhandled error types are never silently swallowed.
  • Never start a task you do not await. A TaskGroup only supervises tasks created through its own tg.create_task(). A bare asyncio.create_task() spawned inside the block escapes the group's lifetime and cancellation guarantees entirely — it is a leak waiting to happen.

How a TaskGroup integrates with the loop

A TaskGroup is a thin supervisor built on top of the same primitives covered in task scheduling & lifecycle: loop.create_task to schedule each child and structured cancellation to retire them. When you call tg.create_task(coro), the group calls the loop's task factory exactly as a bare create_task() would — the child is appended to the ready queue and runs in discrete steps like any other task — but the group also registers a done-callback on it and holds a strong reference, so the child can neither vanish nor go unobserved. The async with block's __aexit__ is where the structure lives: it awaits an internal future that completes only when the count of unfinished children reaches zero.

The failure path is the interesting one. When a child's done-callback fires with an exception that is not CancelledError, the group records it, then calls cancel() on every other still-pending child. Those cancellations propagate through the normal cooperative mechanism — a CancelledError is scheduled to raise at each sibling's next await — which is why a sibling stuck in synchronous CPU work will not stop until it yields. Once all children have settled (the failed one, the cancelled ones, and any that finished before the cancel landed), __aexit__ collects every recorded non-cancellation exception and raises them as one ExceptionGroup. The CancelledErrors injected by the group itself are deliberately stripped out, so you do not see your own teardown noise in the result. For the broader framing of why fail-fast cancellation is the safe default and how to make cleanup honour it, this section sits under Resilience, Cancellation & Error Handling, and the cooperative mechanics it depends on are detailed in cancellation patterns.

TaskGroup failure propagation into an ExceptionGroup A TaskGroup spawns three children; child B raises, the group cancels the surviving sibling, and the collected non-cancellation errors surface as an ExceptionGroup unpacked by except star. One failure cancels the group; errors surface together async with TaskGroup() supervises children child A returns ok child B raises ValueError child C cancelled cancels sibling ExceptionGroup handled by except* ValueError bubbles up

Pattern catalogue

Each pattern is a different answer to one question: how should one-or-many concurrent failures be surfaced and handled? Choose by the failure semantics you need first.

Basic TaskGroup fan-out

Use this as the default for concurrent work where every task must succeed. The block awaits all children; if all return, you read their results after the block. There is no return_exceptions knob — success is the only non-raising outcome.

import asyncio


async def fetch(endpoint: str) -> dict:
    await asyncio.sleep(0.05)  # the actual I/O
    return {"endpoint": endpoint, "status": "ok"}


async def main() -> list[dict]:
    endpoints = ["/users", "/orders", "/inventory"]
    async with asyncio.TaskGroup() as tg:
        tasks = [tg.create_task(fetch(e)) for e in endpoints]
    # Reached only if every child succeeded.
    return [t.result() for t in tasks]


print(asyncio.run(main()))

The trade-off is strictness: a single failure aborts the whole batch. That is exactly what you want when the results are interdependent and a partial set is useless.

Handling ExceptionGroup with except*

Use except* whenever a group can fail, which is any time you wrap a TaskGroup. Each clause receives a sub-group containing only the matching exception types and runs at most once. This is the canonical way to react to grouped failures; the dedicated walkthrough is handling ExceptionGroup from TaskGroup.

import asyncio


async def step(name: str, fail: bool) -> str:
    await asyncio.sleep(0.02)
    if fail:
        raise ConnectionError(f"{name} unreachable")
    return name


async def main() -> None:
    try:
        async with asyncio.TaskGroup() as tg:
            tg.create_task(step("cache", fail=True))
            tg.create_task(step("db", fail=False))
            tg.create_task(step("queue", fail=True))
    except* ConnectionError as eg:
        for exc in eg.exceptions:
            print("connection failure:", exc)
    except* ValueError as eg:
        print("validation failures:", len(eg.exceptions))


asyncio.run(main())

The trade-off versus a plain except is that you must enumerate the types you expect; anything you do not match is automatically re-raised, which is a feature — it prevents you from accidentally swallowing an error class you did not plan for.

gather vs TaskGroup trade-offs

Use gather(return_exceptions=True) when you genuinely want every task to run to completion regardless of failures and to receive a positional list of results-or-exceptions. Use TaskGroup when a failure should abort the rest. The semantic gap is the whole reason the two coexist.

import asyncio


async def task(i: int) -> int:
    await asyncio.sleep(0.01 * i)
    if i == 2:
        raise RuntimeError(f"task {i} failed")
    return i


async def with_gather() -> list:
    # All tasks run to completion; failures become result values.
    return await asyncio.gather(*(task(i) for i in range(4)),
                                return_exceptions=True)


async def with_taskgroup() -> None:
    # First failure cancels the rest; errors arrive grouped.
    async with asyncio.TaskGroup() as tg:
        for i in range(4):
            tg.create_task(task(i))


print(asyncio.run(with_gather()))
# [0, 1, RuntimeError('task 2 failed'), 3]

gather with return_exceptions=False (the default) propagates the first exception but leaves siblings running in the background — the asymmetry that motivates TaskGroup for new code. The contrast in scheduling semantics is laid out fully in task scheduling & lifecycle.

Collecting partial results despite a sibling failure

Use this when you want both the all-or-nothing supervision of a group and whatever results managed to complete before the abort. Have each child write into shared structures rather than relying on the group's collective result, then inspect them in the handler.

import asyncio


async def fetch(shard: int, out: dict[int, int]) -> None:
    await asyncio.sleep(0.01 * shard)
    if shard == 3:
        raise TimeoutError(f"shard {shard} timed out")
    out[shard] = shard * 100  # record success before any sibling aborts


async def main() -> dict[int, int]:
    results: dict[int, int] = {}
    try:
        async with asyncio.TaskGroup() as tg:
            for s in range(6):
                tg.create_task(fetch(s, results))
    except* TimeoutError as eg:
        print(f"{len(eg.exceptions)} shard(s) failed; kept {len(results)}")
    return results  # the shards that completed before the cancel landed


print(asyncio.run(main()))

The trade-off is determinism: which siblings completed before cancellation propagated depends on scheduling order, so treat the partial set as best-effort, never as a guaranteed prefix.

TaskGroup under an outer asyncio.timeout

Use this to bound the wall-clock time of an entire fan-out. Wrapping the group in asyncio.timeout() cancels every child on deadline; the resulting TimeoutError may arrive on its own or, if a child also failed in the same window, inside the group.

import asyncio


async def slow(name: str, t: float) -> str:
    await asyncio.sleep(t)
    return name


async def main() -> None:
    try:
        async with asyncio.timeout(0.1):
            async with asyncio.TaskGroup() as tg:
                tg.create_task(slow("fast", 0.02))
                tg.create_task(slow("slow", 5.0))  # exceeds deadline
    except TimeoutError:
        print("deadline hit; group cancelled")
    except* Exception as eg:
        print("group errors:", [type(e).__name__ for e in eg.exceptions])


asyncio.run(main())

Order matters: the deadline cancellation from asyncio.timeout is delivered by the context manager itself, so a clean timeout surfaces as a plain TimeoutError, not a group — keep both a normal except TimeoutError and an except* clause when combining the two. The choice between asyncio.timeout and wait_for for this is covered in timeouts & deadlines.

Resource boundaries

Patterns decide how failures surface; boundaries decide how much the group runs at once. A TaskGroup imposes no concurrency cap of its own — tg.create_task() in a loop over a large input materializes one task per item immediately, exactly like a bare create_task() fan-out.

Bounding a group with a Semaphore. Gate the body of each child behind an asyncio.Semaphore so only N children hold a connection or run their critical section concurrently, while the group still supervises all of them:

import asyncio

sem = asyncio.Semaphore(10)  # at most 10 concurrent fetches


async def bounded_fetch(url: str, out: list) -> None:
    async with sem:
        await asyncio.sleep(0.05)  # the actual I/O
        out.append(url)


async def main() -> None:
    urls = [f"https://api/{i}" for i in range(500)]
    results: list = []
    async with asyncio.TaskGroup() as tg:
        for u in urls:
            tg.create_task(bounded_fetch(u, results))
    print("fetched", len(results))


asyncio.run(main())

The semaphore caps in-flight I/O without limiting how many tasks exist. For genuinely large or streaming inputs, do not create one task per item even inside a group — feed a bounded queue drained by a fixed worker pool so memory stays flat, as detailed in task scheduling & lifecycle. Note one consequence of the semaphore boundary on failure: when a child raises, the group cancels the others, but children still blocked waiting on sem.acquire() are cancelled cleanly there — they never enter their critical section, so no half-open connection leaks.

Integrated production example

The following ties the catalogue together: a fan-out that aggregates results into a shared structure, bounds concurrency with a semaphore, runs under an overall deadline, and surfaces multiple distinct failure types through except* with full diagnostic logging.

import asyncio
import logging
from dataclasses import dataclass, field

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("aggregator")


@dataclass
class Report:
    ok: dict[str, dict] = field(default_factory=dict)
    failures: list[tuple[str, str]] = field(default_factory=list)


async def call_service(name: str, sem: asyncio.Semaphore, report: Report) -> None:
    async with sem:                       # bound concurrent in-flight calls
        await asyncio.sleep(0.05)
        if name == "billing":
            raise ConnectionError("billing pool exhausted")
        if name == "fraud":
            raise ValueError("fraud model returned NaN score")
        report.ok[name] = {"status": "ok"}  # record before any sibling aborts


def log_group(eg: BaseException, depth: int = 0) -> None:
    """Walk the full ExceptionGroup tree so nothing is lost in logs."""
    pad = "  " * depth
    if isinstance(eg, BaseExceptionGroup):
        logger.error("%sgroup: %s (%d sub)", pad, eg.message, len(eg.exceptions))
        for sub in eg.exceptions:
            log_group(sub, depth + 1)
    else:
        logger.error("%sleaf: %r", pad, eg)


async def aggregate(services: list[str]) -> Report:
    report = Report()
    sem = asyncio.Semaphore(8)
    try:
        async with asyncio.timeout(2.0):           # overall wall-clock budget
            async with asyncio.TaskGroup() as tg:
                for name in services:
                    tg.create_task(call_service(name, sem, report))
    except TimeoutError:
        logger.error("aggregation exceeded deadline")
        report.failures.append(("*", "deadline"))
    except* ConnectionError as eg:
        log_group(eg)
        for e in eg.exceptions:
            report.failures.append(("connection", str(e)))
    except* ValueError as eg:
        log_group(eg)
        for e in eg.exceptions:
            report.failures.append(("validation", str(e)))
    return report


async def main() -> None:
    services = ["users", "billing", "inventory", "fraud", "shipping"]
    report = await aggregate(services)
    total = len(services)
    partial_rate = len(report.failures) / total if total else 0.0
    logger.info(
        "ok=%d failed=%d partial_failure_rate=%.2f",
        len(report.ok), len(report.failures), partial_rate,
    )


asyncio.run(main())

Diagnostic Hook: The log_group walker is the production-critical piece. An ExceptionGroup can nest — a child that is itself a TaskGroup raises a group, which the parent then nests inside its own — and a naive logger.error(eg) prints only the top message, hiding the leaf tracebacks that actually identify the fault. Recursively traversing .exceptions and logging every leaf with %r guarantees each underlying error is captured. Pair it with the partial_failure_rate metric: export it per fan-out and alert when it crosses a threshold, since a creeping rate is the leading indicator of a degrading downstream dependency long before the group fails outright.

Diagnostic Hook — group health metrics & flags

Instrument three signals around any TaskGroup. Partial-failure rate: count len(eg.exceptions) over the number of children spawned and export it as a gauge per fan-out; a rising rate flags a degrading dependency. Subgroup type histogram: tag each leaf in the group by type(exc).__name__ so you can see whether failures are timeouts, connection errors, or validation errors at a glance — they demand different remediation. Unhandled re-raise alarm: wrap the outermost call site in a broad except* Exception that logs the full tree before re-raising, and run with PYTHONASYNCIODEBUG=1 so the loop also surfaces any task created outside the group whose exception was never retrieved.

Failure modes

Failure mode Root cause Detection Fix
Only the first error is noticed A single except caught (or appeared to catch) the group and read just one exception Other concurrent failures never appear in logs or metrics Use except* and iterate eg.exceptions; log every leaf by walking the tree
Handler raises unexpectedly A non-matching subgroup is auto-re-raised because no except* clause matched its type An ExceptionGroup escapes past handlers you believed were exhaustive Add an except* Exception (or except* for each expected type) at the outermost scope; log and decide deliberately
try/except SomeError catches nothing Group wraps even a single error, so a plain except SomeError no longer matches Code that worked pre-3.11 around gather silently stops handling errors Switch to except* SomeError; remember a plain except and except* cannot be mixed in one statement
Siblings keep mutating state after a failure Expecting gather semantics where peers run on; or work done in a finally ignores the cancel Inconsistent partial writes; effects from a task whose batch already failed Treat the group as all-or-nothing; make side effects idempotent or transactional, and honour cancellation in cleanup
A spawned task leaks past the group A bare asyncio.create_task() used inside the block instead of tg.create_task() Task survives the async with; Task exception was never retrieved at GC Always create children via tg.create_task(); never start a task you do not await within the group
CancelledError swallowed inside a child A child catches CancelledError in cleanup and does not re-raise, so the group cannot retire it The async with block hangs on exit; teardown stalls Re-raise CancelledError after cleanup, per cancellation patterns

Frequently Asked Questions

Why does my try/except stop catching errors after switching from gather to TaskGroup?

A TaskGroup never raises a single exception; it wraps every non-cancellation failure, even a lone one, in an ExceptionGroup. A plain except SomeError does not match an ExceptionGroup, so it catches nothing. Use except* SomeError, which peels matching exceptions out of the group and runs your handler with a sub-group of just those types.

What happens to the other tasks in a TaskGroup when one task raises?

The instant a child raises anything other than CancelledError, the group cancels every other still-pending child and stops accepting new create_task calls. Siblings stop at their next await point. Once all children have settled, the group raises an ExceptionGroup containing the non-cancellation errors; the cancellations it injected itself are stripped from the result.

How is asyncio.TaskGroup different from gather with return_exceptions=True?

gather(return_exceptions=True) runs every task to completion and returns a positional list mixing results and exception objects; no task is cancelled on failure. TaskGroup is all-or-nothing: the first failure cancels the remaining siblings and the errors surface together as an ExceptionGroup. Use gather when you want every result regardless of failures, and TaskGroup when a failure should abort the rest.

Can I collect partial results from a TaskGroup when one task fails?

Yes, but not from the group's collective result, which is unavailable once it raises. Have each child write its result into a shared dict or list as it completes, then read that structure inside the except* handler. Which siblings finished before cancellation propagated depends on scheduling order, so treat the partial set as best-effort.

How do I combine a TaskGroup with an overall timeout?

Wrap the TaskGroup in an outer async with asyncio.timeout(seconds) block. On deadline, asyncio.timeout cancels every child and raises a plain TimeoutError from the context manager, so handle it with a normal except TimeoutError. If a child also fails in the same window the error may instead arrive inside the group, so keep an except* clause as well.