Concurrent Execution & Worker Patterns¶
A production-grade reference for selecting, building, and operating concurrent execution models in Python. The hard problem is rarely writing a coroutine or spawning a thread—it is choosing the right model for a given workload, then bounding it so it degrades predictably under load instead of collapsing into queue blow-ups, thread exhaustion, or event-loop starvation. This guide maps workload characteristics to runtimes, then walks the worker topologies—pools, executors, queues, and cross-runtime bridges—that turn a model choice into a deployable system.
Three forces dominate every decision on this page: the Global Interpreter Lock (GIL), the I/O-bound versus CPU-bound split, and the cost of coordination. The GIL serializes execution of Python bytecode within a single process, so threads give you concurrency for I/O waits but never parallelism for pure-Python computation. CPU-bound work needs separate processes (or native code that releases the GIL); I/O-bound work needs cheap suspension, which asyncio provides at thousands of concurrent operations per core. Coordination—queues, locks, IPC serialization—is the tax you pay to combine workers safely, and it is where most production incidents originate.
What this reference covers:
- The decision space: mapping workload type and scale to threads, processes, or
asyncio. - Worker pools and executors: sizing, lifecycle, and graceful shutdown.
- Queues and pipelines: bounded dispatch, backpressure, and dead-lettering.
- Hybrid and cross-runtime designs: bridging sync libraries into an event loop.
- Resource control and diagnostics: primitives, tuning workflows, and observability hooks.
The Concurrency Decision Space¶
Selecting an execution model is a classification problem before it is an engineering one. Misclassifying a workload—running CPU-bound transforms on a thread pool, or blocking an event loop with a synchronous database driver—is the single most common root cause of latency cliffs and resource exhaustion in Python services. Start by measuring: profile a representative request with cProfile or sample a live process with py-spy, and determine where wall-clock time actually goes. Time spent in read, recv, poll, or driver C-extensions waiting on sockets is I/O-bound; time spent in pure-Python loops, serialization, or numeric kernels is CPU-bound.
The conceptual model is a two-axis grid. One axis is what blocks (waiting on I/O versus burning CPU); the other is scale of concurrency (dozens versus tens of thousands of in-flight operations). Threads handle moderate I/O concurrency with the least code change but carry per-thread memory and context-switch costs that cap them in the low thousands. asyncio handles massive I/O concurrency cheaply but demands non-blocking libraries end to end. Processes deliver real parallelism for computation at the price of IPC serialization and startup latency. Most real systems land in a hybrid: an async core that delegates CPU-bound and legacy-sync work to executor-backed pools.
The reference table below is the fastest way to make a first cut. Each model has a characteristic failure mode—the thing that breaks first when you push it past its envelope—and recognizing that failure mode in production telemetry is often how you discover you picked the wrong model.
| Model | Best for | Per-unit overhead | Parallelism | Characteristic failure mode |
|---|---|---|---|---|
asyncio |
Massive I/O concurrency (sockets, HTTP, DB) | ~KB per task, near-zero switch cost | None (single thread) | One blocking call stalls every coroutine |
Threads (ThreadPoolExecutor) |
Moderate I/O, blocking libraries with no async port | ~8 MB stack + OS scheduling | None for Python bytecode (GIL) | Thread-pool exhaustion → unbounded queue → latency cliff |
Processes (ProcessPoolExecutor) |
CPU-bound compute, GIL-free parallelism | Process spawn + pickle/IPC per call | Full, across cores | BrokenProcessPool on worker crash; IPC serialization bottleneck |
| Hybrid (async core + executors) | Real services mixing all three | Sum of the above, per lane | I/O concurrent + CPU parallel | Cross-runtime exception loss; executor starvation |
For the full comparison with benchmarks and migration guidance, see threading vs multiprocessing vs asyncio. The remaining sections take each topology in turn. Because the right runtime depends on the rest of your stack, this overview connects to the sibling references on asyncio fundamentals and event-loop architecture for the scheduler internals, network I/O and protocol handling for the I/O layer these workers drive, and resilience, cancellation, and error handling for the timeout and failure semantics that bound every pattern below.
Concurrency Model Selection¶
The selection rule is mechanical once you have a profile. If the workload is dominated by socket waits and you control the libraries, asyncio wins: it scales to tens of thousands of concurrent connections on a single core because a suspended coroutine costs only its frame and a few kilobytes, not an OS thread. If the workload is I/O-bound but the only available client is synchronous (a legacy SDK, a blocking driver), a thread pool is the pragmatic choice—the GIL is released during the blocking syscall, so threads genuinely overlap I/O waits. If the workload is CPU-bound Python, no amount of threading helps; you need processes.
The trap is the middle ground. Engineers reach for threads because the code change is small, then discover that ThreadPoolExecutor(max_workers=200) does not give 200x throughput—it gives context-switch thrash and memory pressure, because the GIL still serializes the Python parts. The inverse trap is reaching for asyncio and then calling a synchronous library inside a coroutine, which blocks the entire loop. The conceptual model that prevents both: threads and processes are about where work runs; asyncio is about when work yields. They compose—an async loop can own thread and process pools—but they do not substitute for one another.
A practical decision sequence:
- Is the work pure-Python CPU? Use processes. Offload via CPU-bound task offloading so the event loop stays responsive.
- Is it I/O with an async client available? Use
asyncionatively. - Is it I/O with only a sync client? Wrap it in
asyncio.to_thread()orloop.run_in_executor()against a bounded thread pool. - Is it a mix at scale? Build a hybrid: async core, dedicated executors per work type, explicit limits on each.
Migrating an existing synchronous codebase rarely happens all at once; the incremental path is covered in the model-comparison guide, and the cross-runtime mechanics in hybrid concurrency models.
Worker Pools & Executors¶
A worker pool is a bounded set of execution units consuming work from a shared source. The bound is the entire point: an unbounded pool is a denial-of-service vector against your own infrastructure. Fixed pools should align CPU-bound sizing to os.cpu_count() (or the cgroup CPU quota in a container, which os.cpu_count() does not see—read /sys/fs/cgroup/cpu.max or use os.sched_getaffinity(0)), while I/O-bound pools size to the latency-bandwidth product: enough workers to keep the downstream saturated without piling up idle threads.
Long-running services also need a deterministic teardown contract. When SIGTERM arrives—as it does on every Kubernetes pod eviction—a worker pool must stop accepting new work, drain in-flight tasks within a bounded grace window, and forcibly cancel stragglers rather than hang. Skipping this produces the classic incident: orphaned processes, half-written records, and a deploy that "succeeds" while silently dropping the requests in flight at cutover. The full catalogue of pool designs—fixed, elastic, work-stealing—and their sizing math lives in worker pool implementations.
The production snippet below is a bounded async worker pool consuming a queue. It fixes the worker count, uses TaskGroup so a fatal worker error tears the group down cleanly, drains on shutdown with a deadline, and surfaces the metrics you actually need to operate it.
Diagnostic Hook: Export busy_workers, queue.qsize(), and p95() as gauges. The leading indicator of an undersized pool is busy_workers pinned at the worker count while max_qdepth climbs toward maxsize—the queue is absorbing load the workers cannot clear. If qsize() stays near maxsize for sustained windows, producers are being backpressured and you are shedding latency upstream; scale workers or shed load before the deadline timeouts begin firing.
Queues & Pipelines¶
Queues are the connective tissue between producers and the worker pool, and the most important property of a production queue is that it is bounded. An unbounded asyncio.Queue() is the textbook cause of out-of-memory crashes during traffic spikes: producers outrun consumers, the queue grows without limit, and the process is OOM-killed mid-request. Setting maxsize converts that failure into backpressure—await queue.put() suspends the producer when the queue is full, propagating the slowdown upstream where it can be handled deliberately (rate-limit, shed, or 503) instead of catastrophically.
Beyond bounding, real pipelines layer in priority and failure routing. Priority queues let latency-critical work jump ahead of bulk work sharing the same workers; the implementation details and the pitfalls of starvation are covered in implementing a priority queue with asyncio.Queue. Items that exhaust their retries should not be dropped silently—route them to a dead-letter queue for inspection and replay. The end-to-end patterns for bounded dispatch, backpressure, and dead-lettering live in async queue management.
The pipeline shape that scales is a chain of bounded stages, each with its own pool, connected by bounded queues. Backpressure then flows naturally backward through the chain: a slow final stage fills its inbound queue, which throttles the stage feeding it, and so on to the ingress. The key design rule is that every queue in the chain has a maxsize, and every stage handles task_done() exactly once per item—miscounting task_done() is the classic bug that makes queue.join() either hang forever or return early.
Hybrid & Cross-Runtime Models¶
Few real services live inside one model. A typical async gateway accepts thousands of concurrent connections on the event loop, but still has to call a synchronous vendor SDK, run a CPU-heavy serialization step, and write to a driver with no async port. The hybrid pattern isolates each of these into a dedicated lane: the async loop owns the request lifecycle, hands blocking I/O to a thread pool, and hands CPU-bound work to a process pool. The two bridges are asyncio.to_thread() for the common case and loop.run_in_executor() when you need to target a specific, sized executor—covered in depth in offloading CPU work with loop.run_in_executor.
The dangers are specific and easy to miss. First, the default executor created by run_in_executor(None, ...) is shared and small; saturating it blocks every other offloaded call in the process, so production code should pass explicit, separately-sized executors per work type. Second, exceptions and—critically—cancellation do not propagate cleanly across the boundary: cancelling the awaiting coroutine does not stop a thread already running blocking code, so the thread keeps consuming a slot until it finishes. Third, sharing mutable state between async tasks and threads reintroduces classic data races that the single-threaded loop normally protects you from. The safe patterns for shared state and lifetime management are detailed in how to safely share state between async tasks and threads, and the broader topic in hybrid concurrency models.
Diagnostic Hook: Track each executor's queue depth via the internal _work_queue.qsize() (thread pool) and watch for BrokenProcessPool exceptions on the CPU lane—the latter means a worker crashed (often an OOM kill) and the pool is now unusable until recreated. Alarm on offloaded-call latency diverging from the underlying call latency: a growing gap is executor saturation, not slower work.
Resource Control & Concurrency Limits¶
Bounding the pool and the queue bounds the steady state; concurrency primitives bound the transient—how many operations may touch a shared resource at any instant. The most common need is capping fan-out: when one request spawns many downstream calls, an asyncio.Semaphore limits how many run concurrently so you do not stampede a database or a rate-limited API. The primitives below are the ones that matter in worker topologies; the deeper treatment of each lives under asyncio fundamentals.
| Primitive | Use case | Trade-off |
|---|---|---|
asyncio.Semaphore(n) |
Cap concurrent fan-out to a shared resource (DB pool, API quota) | Too low starves throughput; too high defeats the purpose |
asyncio.Lock |
Serialize a critical section across coroutines | Holds the section single-file; a slow holder serializes everyone |
asyncio.Queue(maxsize) |
Backpressured handoff between stages | Bound trades memory for latency; full queue blocks producers |
ProcessPoolExecutor(max_workers) |
Cap parallel CPU jobs to core count | Each slot costs a process + IPC; oversubscription thrashes |
threading.BoundedSemaphore |
Limit blocking-I/O concurrency in a thread lane | Counts releases; over-release raises rather than corrupts |
A bounded fan-out is the canonical example. The semaphore here ensures no more than limit downstream calls are in flight regardless of how many items arrive, and TaskGroup aggregates failures into a single exception group:
If a downstream resource has its own pool (a database connection pool of size 20), set the semaphore at or below that bound—exceeding it just moves the queue from your code into the driver, where it is harder to observe. Timeout and cancellation semantics for these primitives are covered in resilience, cancellation, and error handling.
Diagnostics & Tuning¶
Operating a concurrent system is a measurement discipline. The following workflow takes a misbehaving worker topology from symptom to root cause:
- Confirm where time goes. Sample the live process with
py-spy dump --pid <pid>(no instrumentation, no restart). If threads sit inrecv/poll, you are I/O-bound; if they sit in Python frames, you are CPU-bound and threads are not helping. - Check for a blocked loop. Enable
loop.set_debug(True)and lowerloop.slow_callback_durationto0.05; any callback exceeding it logs a warning naming the blocking coroutine. A single synchronous call inside a coroutine will surface here. - Read the queues. Export
queue.qsize()for every stage. A queue pinned nearmaxsizeidentifies the bottleneck stage; a queue near zero with idle workers means the bottleneck is upstream. - Inspect pool saturation. Compare
busy_workersto pool size and offloaded-call latency to underlying latency. Saturation shows as busy-equals-size with growing queue depth. - Size by the math, then verify. For CPU pools, start at core count and measure throughput per added worker until it plateaus. For I/O pools, raise workers until downstream latency (not your queue) becomes the limit.
- Re-measure under synthetic load. Replay production traffic shape with
locustor a recorded trace; tune against tail latency (p95/p99), never the mean.
The snippet below is a lightweight event-loop lag probe—a background coroutine that schedules itself every interval and reports how late it ran. Sustained lag is the most direct signal that something is blocking the loop.
Diagnostic Hook: Run loop_lag_monitor in every async service and alert on p99 lag exceeding ~50 ms. Loop lag is leading; request-latency alarms are lagging. When lag spikes correlate with executor queue depth, the cause is offload saturation; when they correlate with nothing visible, you have a blocking call that bypassed your async clients—hunt it with py-spy and the slow-callback log.
Common Pitfalls¶
| Anti-Pattern | Impact | Mitigation |
|---|---|---|
| Blocking call inside a coroutine | One sync call stalls every coroutine on the loop; tail latency explodes | Offload via asyncio.to_thread(); catch it early with slow_callback_duration |
Unbounded asyncio.Queue() |
Producers outrun consumers → memory blow-up → OOM kill | Always set maxsize; let put() backpressure producers |
| Oversized thread pool for Python CPU work | GIL serializes anyway; context-switch thrash degrades throughput | Use processes for CPU work; size threads to I/O concurrency |
Sharing the default run_in_executor(None) pool |
Saturating it blocks every offloaded call process-wide | Pass explicit, per-work-type sized executors |
| No graceful-shutdown drain | SIGTERM drops in-flight work; corrupted state, lost records |
Stop intake, queue.join() under a deadline, then cancel |
Swallowing CancelledError |
Cancellation silently ignored; tasks never stop; shutdown hangs | Re-raise CancelledError; do cleanup in finally only |
Ignoring BrokenProcessPool |
A crashed worker poisons the pool; all subsequent calls fail | Catch it, recreate the pool, and treat as a fatal worker-health signal |
Frequently Asked Questions¶
When should I choose asyncio over multiprocessing for Python workloads?
Choose asyncio for high-concurrency I/O-bound work—network, database, and API calls—where you control the libraries and can keep every call non-blocking. A suspended coroutine costs a few kilobytes, so a single core handles tens of thousands of concurrent connections. Choose multiprocessing for CPU-bound, pure-Python computation that needs real parallelism across cores, since the GIL prevents threads from running Python bytecode in parallel. Most production services combine both: an async core that offloads CPU work to a process pool.
How many workers should a pool have?
For CPU-bound pools, start at the available core count—os.cpu_count() on bare metal, or the cgroup CPU quota inside a container, which os.cpu_count() does not report—and add workers only while measured throughput keeps rising. For I/O-bound thread or task pools, size to the latency-bandwidth product: enough workers to keep the downstream saturated without piling up idle ones. Always measure, never guess; raise the count until your queue depth stops growing or downstream latency (not your own queue) becomes the limit.
How do I prevent worker starvation and OOM in high-throughput pipelines?
Bound every queue with maxsize so producers are backpressured instead of allowed to exhaust memory. Use a Semaphore to cap concurrent fan-out to shared resources, route exhausted-retry items to a dead-letter queue rather than dropping them, and monitor the busy-to-idle worker ratio. When the queue stays near maxsize, shed or rate-limit at ingress before deadline timeouts begin cascading.
What is the safest way to share state between concurrent workers?
Prefer message-passing over shared mutable state. Hand work between stages through bounded queues, and for cross-process data use multiprocessing.Manager proxies for small structures or shared_memory/mmap for large arrays. When sharing between async tasks and threads is unavoidable, guard every access with a lock, because the single-threaded loop's race protection disappears the moment work runs on a thread.
How do I diagnose event-loop blocking in production?
Enable loop.set_debug(True) and lower loop.slow_callback_duration to about 50 ms so any over-budget callback logs the offending coroutine. Run a background loop-lag monitor that measures how late a self-scheduling sleep returns, and alert on p99 lag. For a running process with no instrumentation, py-spy dump shows exactly which frames the threads are stuck in—Python frames mean a blocking call slipped past your async clients.
Why doesn't cancelling a coroutine stop the thread it offloaded work to?
asyncio.to_thread() and run_in_executor() schedule blocking code on a separate thread, and Python has no safe way to interrupt arbitrary running thread code. Cancelling the awaiting coroutine raises CancelledError in the coroutine, but the thread keeps running until its function returns, holding a pool slot the whole time. Design offloaded functions to be short or to poll a cooperative stop flag, and size the pool to tolerate slots occupied past cancellation.
Related¶
- Threading vs multiprocessing vs asyncio — the head-to-head comparison and migration guidance behind the selection rule.
- Worker pool implementations — pool topologies, sizing math, and lifecycle in depth.
- Async queue management — bounded dispatch, backpressure, priority, and dead-lettering.
- CPU-bound task offloading — moving compute off the loop without starving it.
- Hybrid concurrency models — bridging sync libraries and processes into an async core.
- Resilience, cancellation & error handling — the timeout and failure semantics that bound every pattern here.