How to properly configure asyncio event loops for production¶
A coroutine that runs cleanly under python app.py on a laptop will, unchanged, exhibit three specific failures in production: it swallows exceptions in fire-and-forget tasks until garbage collection logs them minutes late, it stalls every concurrent request the first time a synchronous database driver is called, and it drops in-flight connections when the orchestrator sends SIGTERM ahead of SIGKILL. None of these are bugs in your business logic — they are the consequences of running the event loop with default configuration. This guide is the concrete, ordered checklist that closes all three gaps: swap the backend, size the executor, install an error boundary with debug gated off, and wire signals into a deterministic shutdown. Each step ends with the exact command or assertion that proves it took effect.
Prerequisites¶
- Python 3.11+. The bootstrap uses
asyncio.Runner(loop_factory=...),asyncio.TaskGroup,asyncio.timeout(), and exception groups (except*). These remove the need for the deprecated policy API. uvloopfor the fast backend on Linux/macOS:pip install uvloop. The code keeps a selector-loop fallback so Windows and minimal images still run.- Familiarity with the loop iteration model. This guide configures the stages described in the Event Loop Configuration overview, which itself sits under Asyncio Fundamentals & Event Loop Architecture. If terms like ready queue, selector poll, and slow callback are unfamiliar, read those first.
- A process you can send signals to (a container, or a local run you can
kill -TERM).
1. Select the loop backend with a safe fallback¶
The default backend is SelectorEventLoop on Unix and ProactorEventLoop on Windows. The pure-Python selector loop is correct but spends measurable time in Python-level dispatch under high file-descriptor counts. uvloop replaces the core with libuv and typically delivers 2–4x network throughput. Select it through loop_factory rather than the policy API, and always retain the selector loop as a fallback so a missing wheel does not crash the service.
Verify: the log line should read active backend: uvloop in production. Assert it in a smoke test:
2. Replace and size the default executor¶
The loop runs on one thread, so any synchronous call blocks every coroutine until it returns. Route blocking work through run_in_executor, but replace the default unbounded pool with an explicitly sized one. For I/O-bound blocking calls, min(32, (os.cpu_count() or 1) * 4) is a safe start; cap actual in-flight submissions with a Semaphore so the pool's unbounded work queue cannot grow without limit.
Verify: under load, the executor's internal counters should stay bounded. Log them periodically:
A queue depth that climbs while threads is pinned at max_workers means callers are submitting faster than the pool drains — tighten the semaphore.
3. Disable debug and install an exception boundary¶
Debug mode (PYTHONASYNCIODEBUG=1 or loop.set_debug(True)) adds 10–30% per-tick latency and retains stack frames, so it must be off in production by default and gated behind a flag. Independently, install a loop exception handler: without one, exceptions in detached tasks are logged only at garbage-collection time. The handler swallows CancelledError (expected during shutdown) and forwards everything else to your logging pipeline.
Verify: confirm the boundary actually catches a detached failure.
You should see one loop exception: ... RuntimeError: probe line and loop.get_debug() returning False in production.
4. Wire signals into a deterministic shutdown¶
An orchestrator sends SIGTERM and waits terminationGracePeriodSeconds before SIGKILL. The service must intercept the signal in the loop thread, cancel in-flight tasks, await their cleanup inside a deadline shorter than the grace period, and let Runner drain async generators and close the loop. Use loop.add_signal_handler (loop-thread-safe), never signal.signal.
Verify: run the service, send kill -TERM <pid>, and confirm it exits cleanly within the grace window with no Task was destroyed but it is pending warnings. Time the gap between signal and exit; if it approaches SHUTDOWN_GRACE, a task is not re-raising on cancel.
Verification¶
After composing steps 1–4 into one main() driven by asyncio.Runner(loop_factory=make_loop_factory()), a correctly hardened process satisfies all of the following:
- Backend:
assert asyncio.get_running_loop().__class__.__module__ == "uvloop"passes in production. - Debug off:
loop.get_debug()isFalse; no 10–30% latency tax. - Error boundary live: a deliberately failing detached task produces exactly one log line immediately, not at GC.
- Executor bounded:
executor._work_queue.qsize()stays near zero under steady load; threads cap atmax_workers. - Clean shutdown:
kill -TERMexits within the grace period;os.listdir('/proc/self/fd')shows a stable count across restarts, confirming no descriptor leak.
The full reference implementation that stitches these together lives in the integrated bootstrap on the Event Loop Configuration overview.
A quick end-to-end smoke test that exercises all four steps in one run looks like this:
Run it, send kill -TERM <pid>, and confirm a single clean shutdown line with no pending-task warnings. That single observation proves the backend, executor, diagnostics, and shutdown path are all wired correctly.
Pitfalls & edge cases¶
- Setting the backend after the loop exists.
set_debug,loop_factory, and the policy API only take effect before the loop runs. Configure insidemake_loop_factory/before the firstawait, or the runtime silently keeps the default and your config is logged but inert. - Leaving
PYTHONASYNCIODEBUG=1in the image. It survives into production as a 10–30% latency tax plus memory growth from retained frames. Gate it on an env var that defaults to off, and assertloop.get_debug() is Falsein a startup check. - An unbounded executor or unbounded task creation. The pool's work queue and
create_taskboth accept unlimited backlog. Without aSemaphoreorTaskGroupceiling, a burst enqueues faster than workers drain and RSS climbs to the OOM killer. - Swallowing
CancelledErrorin task cleanup. CatchingCancelledErrorwithout re-raising defeats shutdown — the task keeps running pastgather, leaving sockets inTIME_WAITand connection pools open. Always re-raise after cleanup. SHUTDOWN_GRACE≥ the orchestrator grace period. If your internal deadline is not strictly shorter thanterminationGracePeriodSeconds, the orchestratorSIGKILLs mid-drain and you lose the deterministic teardown entirely. Keep a margin.
Frequently Asked Questions¶
Do I still need the policy API to install uvloop on Python 3.11+?
No. Pass loop_factory=uvloop.new_event_loop to asyncio.Runner (or asyncio.run(main(), loop_factory=...)). The policy API is deprecated since 3.12 and slated for removal in 3.16, and loop_factory is the forward-compatible path that also keeps a clean selector-loop fallback.
How do I choose max_workers for the executor?
Start at min(32, (os.cpu_count() or 1) * 4) for I/O-bound blocking calls, then watch executor._work_queue.qsize() and len(executor._threads) under load. If the queue grows while threads are pinned, callers outpace the pool — tighten the bounding Semaphore rather than raising the cap, since the GIL limits useful concurrency for CPU-adjacent work.
What grace period should SHUTDOWN_GRACE use?
Strictly less than the orchestrator's kill deadline — terminationGracePeriodSeconds in Kubernetes, which defaults to 30s. Leave a few seconds of margin (e.g. 25s) so the cancel-and-gather completes and the loop closes before SIGKILL arrives.
Related¶
- Event Loop Configuration — up to the overview with the pattern catalogue and failure-mode table this checklist operationalises.
- Asyncio Fundamentals & Event Loop Architecture — the loop-iteration model these steps tune.
- When to use asyncio.run vs loop.run_until_complete — choosing the entrypoint that drives this configured loop.