Skip to content

How to properly configure asyncio event loops for production

A coroutine that runs cleanly under python app.py on a laptop will, unchanged, exhibit three specific failures in production: it swallows exceptions in fire-and-forget tasks until garbage collection logs them minutes late, it stalls every concurrent request the first time a synchronous database driver is called, and it drops in-flight connections when the orchestrator sends SIGTERM ahead of SIGKILL. None of these are bugs in your business logic — they are the consequences of running the event loop with default configuration. This guide is the concrete, ordered checklist that closes all three gaps: swap the backend, size the executor, install an error boundary with debug gated off, and wire signals into a deterministic shutdown. Each step ends with the exact command or assertion that proves it took effect.

Asyncio production hardening pipeline Four ordered steps: select backend, size executor, gate debug and install exception boundary, and wire signal-driven shutdown, each with the failure it prevents and the check that confirms it. From default loop to hardened daemon 1. Backend uvloop loop_factory stops slow dispatch check: __module__ 2. Executor sized + Semaphore stops loop stalls check: qsize bounded 3. Diagnostics debug off + handler stops silent errors check: get_debug False 4. Shutdown signals + deadline stops dropped reqs check: clean SIGTERM All four run inside one asyncio.Runner(loop_factory=...) before the loop iterates. Skip any step and the matching production failure returns.

Prerequisites

  • Python 3.11+. The bootstrap uses asyncio.Runner(loop_factory=...), asyncio.TaskGroup, asyncio.timeout(), and exception groups (except*). These remove the need for the deprecated policy API.
  • uvloop for the fast backend on Linux/macOS: pip install uvloop. The code keeps a selector-loop fallback so Windows and minimal images still run.
  • Familiarity with the loop iteration model. This guide configures the stages described in the Event Loop Configuration overview, which itself sits under Asyncio Fundamentals & Event Loop Architecture. If terms like ready queue, selector poll, and slow callback are unfamiliar, read those first.
  • A process you can send signals to (a container, or a local run you can kill -TERM).

1. Select the loop backend with a safe fallback

The default backend is SelectorEventLoop on Unix and ProactorEventLoop on Windows. The pure-Python selector loop is correct but spends measurable time in Python-level dispatch under high file-descriptor counts. uvloop replaces the core with libuv and typically delivers 2–4x network throughput. Select it through loop_factory rather than the policy API, and always retain the selector loop as a fallback so a missing wheel does not crash the service.

# bootstrap.py
import asyncio
import logging

logger = logging.getLogger("service")


def make_loop_factory():
    """Prefer uvloop; fall back to the stdlib loop on Windows/Alpine."""
    try:
        import uvloop
        logger.info("loop backend: uvloop")
        return uvloop.new_event_loop
    except ImportError:
        logger.warning("uvloop unavailable; using selector loop")
        return asyncio.new_event_loop


async def main() -> None:
    loop = asyncio.get_running_loop()
    logger.info("active backend: %s", type(loop).__module__)
    await asyncio.sleep(0)


if __name__ == "__main__":
    with asyncio.Runner(loop_factory=make_loop_factory()) as runner:
        runner.run(main())

Verify: the log line should read active backend: uvloop in production. Assert it in a smoke test:

assert asyncio.get_running_loop().__class__.__module__ == "uvloop"

2. Replace and size the default executor

The loop runs on one thread, so any synchronous call blocks every coroutine until it returns. Route blocking work through run_in_executor, but replace the default unbounded pool with an explicitly sized one. For I/O-bound blocking calls, min(32, (os.cpu_count() or 1) * 4) is a safe start; cap actual in-flight submissions with a Semaphore so the pool's unbounded work queue cannot grow without limit.

import asyncio
import os
from concurrent.futures import ThreadPoolExecutor


def configure_executor(loop: asyncio.AbstractEventLoop) -> ThreadPoolExecutor:
    max_workers = min(32, (os.cpu_count() or 1) * 4)
    executor = ThreadPoolExecutor(max_workers=max_workers, thread_name_prefix="io-worker")
    loop.set_default_executor(executor)
    return executor


async def call_blocking(sem: asyncio.Semaphore, fn, *args):
    async with sem:  # bound submissions to pool capacity
        loop = asyncio.get_running_loop()
        return await loop.run_in_executor(None, fn, *args)

Verify: under load, the executor's internal counters should stay bounded. Log them periodically:

print("queue:", executor._work_queue.qsize(), "threads:", len(executor._threads))

A queue depth that climbs while threads is pinned at max_workers means callers are submitting faster than the pool drains — tighten the semaphore.

3. Disable debug and install an exception boundary

Debug mode (PYTHONASYNCIODEBUG=1 or loop.set_debug(True)) adds 10–30% per-tick latency and retains stack frames, so it must be off in production by default and gated behind a flag. Independently, install a loop exception handler: without one, exceptions in detached tasks are logged only at garbage-collection time. The handler swallows CancelledError (expected during shutdown) and forwards everything else to your logging pipeline.

import asyncio
import logging
import os
import traceback
from typing import Any

logger = logging.getLogger("asyncio.errors")


def exception_handler(loop: asyncio.AbstractEventLoop, context: dict[str, Any]) -> None:
    exc = context.get("exception")
    if isinstance(exc, asyncio.CancelledError):
        return
    logger.error(
        "loop exception: %s | %s",
        context.get("message", "unhandled"),
        "".join(traceback.format_exception(exc)) if exc else "",
    )
    loop.default_exception_handler(context)


def configure_diagnostics(loop: asyncio.AbstractEventLoop) -> None:
    loop.set_debug(os.getenv("PYTHONASYNCIODEBUG") == "1")
    loop.slow_callback_duration = 0.1  # log callbacks that stall > 100 ms
    loop.set_exception_handler(exception_handler)

Verify: confirm the boundary actually catches a detached failure.

1
2
3
4
5
6
7
async def _probe(loop):
    loop.create_task(asyncio.sleep(0, result=None))           # benign
    loop.create_task(_raises())                               # should hit handler
    await asyncio.sleep(0.05)

async def _raises():
    raise RuntimeError("probe")

You should see one loop exception: ... RuntimeError: probe line and loop.get_debug() returning False in production.

4. Wire signals into a deterministic shutdown

An orchestrator sends SIGTERM and waits terminationGracePeriodSeconds before SIGKILL. The service must intercept the signal in the loop thread, cancel in-flight tasks, await their cleanup inside a deadline shorter than the grace period, and let Runner drain async generators and close the loop. Use loop.add_signal_handler (loop-thread-safe), never signal.signal.

import asyncio
import signal

SHUTDOWN_GRACE = 25.0  # keep below the orchestrator's grace period


async def graceful_shutdown() -> None:
    tasks = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
    for t in tasks:
        t.cancel()
    try:
        async with asyncio.timeout(SHUTDOWN_GRACE):
            await asyncio.gather(*tasks, return_exceptions=True)
    except TimeoutError:
        logging.error("shutdown exceeded %.0fs; tasks orphaned", SHUTDOWN_GRACE)


def install_signals(loop: asyncio.AbstractEventLoop) -> None:
    for sig in (signal.SIGINT, signal.SIGTERM):
        try:
            loop.add_signal_handler(sig, lambda: asyncio.ensure_future(graceful_shutdown()))
        except NotImplementedError:
            logging.warning("signal %s unsupported on this platform", sig.name)

Verify: run the service, send kill -TERM <pid>, and confirm it exits cleanly within the grace window with no Task was destroyed but it is pending warnings. Time the gap between signal and exit; if it approaches SHUTDOWN_GRACE, a task is not re-raising on cancel.

Verification

After composing steps 1–4 into one main() driven by asyncio.Runner(loop_factory=make_loop_factory()), a correctly hardened process satisfies all of the following:

  • Backend: assert asyncio.get_running_loop().__class__.__module__ == "uvloop" passes in production.
  • Debug off: loop.get_debug() is False; no 10–30% latency tax.
  • Error boundary live: a deliberately failing detached task produces exactly one log line immediately, not at GC.
  • Executor bounded: executor._work_queue.qsize() stays near zero under steady load; threads cap at max_workers.
  • Clean shutdown: kill -TERM exits within the grace period; os.listdir('/proc/self/fd') shows a stable count across restarts, confirming no descriptor leak.

The full reference implementation that stitches these together lives in the integrated bootstrap on the Event Loop Configuration overview.

A quick end-to-end smoke test that exercises all four steps in one run looks like this:

import asyncio


async def main() -> None:
    loop = asyncio.get_running_loop()
    configure_executor(loop)
    configure_diagnostics(loop)
    install_signals(loop)
    assert loop.get_debug() is False, "debug must be off in prod"
    # backend assertion only holds where uvloop is installed:
    # assert loop.__class__.__module__ == "uvloop"
    print("bootstrap verified; awaiting SIGTERM")
    await asyncio.Event().wait()  # block until a signal cancels us


if __name__ == "__main__":
    with asyncio.Runner(loop_factory=make_loop_factory()) as runner:
        try:
            runner.run(main())
        except* asyncio.CancelledError:
            print("clean shutdown")

Run it, send kill -TERM <pid>, and confirm a single clean shutdown line with no pending-task warnings. That single observation proves the backend, executor, diagnostics, and shutdown path are all wired correctly.

Pitfalls & edge cases

  • Setting the backend after the loop exists. set_debug, loop_factory, and the policy API only take effect before the loop runs. Configure inside make_loop_factory/before the first await, or the runtime silently keeps the default and your config is logged but inert.
  • Leaving PYTHONASYNCIODEBUG=1 in the image. It survives into production as a 10–30% latency tax plus memory growth from retained frames. Gate it on an env var that defaults to off, and assert loop.get_debug() is False in a startup check.
  • An unbounded executor or unbounded task creation. The pool's work queue and create_task both accept unlimited backlog. Without a Semaphore or TaskGroup ceiling, a burst enqueues faster than workers drain and RSS climbs to the OOM killer.
  • Swallowing CancelledError in task cleanup. Catching CancelledError without re-raising defeats shutdown — the task keeps running past gather, leaving sockets in TIME_WAIT and connection pools open. Always re-raise after cleanup.
  • SHUTDOWN_GRACE ≥ the orchestrator grace period. If your internal deadline is not strictly shorter than terminationGracePeriodSeconds, the orchestrator SIGKILLs mid-drain and you lose the deterministic teardown entirely. Keep a margin.

Frequently Asked Questions

Do I still need the policy API to install uvloop on Python 3.11+?

No. Pass loop_factory=uvloop.new_event_loop to asyncio.Runner (or asyncio.run(main(), loop_factory=...)). The policy API is deprecated since 3.12 and slated for removal in 3.16, and loop_factory is the forward-compatible path that also keeps a clean selector-loop fallback.

How do I choose max_workers for the executor?

Start at min(32, (os.cpu_count() or 1) * 4) for I/O-bound blocking calls, then watch executor._work_queue.qsize() and len(executor._threads) under load. If the queue grows while threads are pinned, callers outpace the pool — tighten the bounding Semaphore rather than raising the cap, since the GIL limits useful concurrency for CPU-adjacent work.

What grace period should SHUTDOWN_GRACE use?

Strictly less than the orchestrator's kill deadline — terminationGracePeriodSeconds in Kubernetes, which defaults to 30s. Leave a few seconds of margin (e.g. 25s) so the cancel-and-gather completes and the loop closes before SIGKILL arrives.