Skip to content

Threading vs Multiprocessing vs Asyncio: A Decision Guide

Python ships three concurrency models — threading, multiprocessing, and asyncio — and the choice between them is not a style preference. It is determined by where your workload spends its time, how CPython's Global Interpreter Lock (GIL) treats that time, and what your memory and failure-isolation budget allows. Pick the wrong model and you get sublinear scaling, multi-gigabyte resident sets, or an event loop that stalls every request behind one blocking call. This guide narrows the scope of the parent Concurrent Execution & Worker Patterns reference to a single question: given a concrete workload, which execution model do you reach for, and how do you combine them when one model is not enough.

The decision reduces to four axes — GIL behavior, I/O- vs CPU-bound classification, per-unit memory and overhead, and failure isolation — applied through a single matrix. The rest of this page works through each axis, catalogues the four patterns that cover almost every production case, and ends with an integrated example and a failure-mode table.

Architectural principles

These constraints govern correct selection. They are non-negotiable; every pattern below is a consequence of them.

  • Concurrency is not parallelism. Concurrency overlaps execution lifecycles; parallelism runs instructions simultaneously on multiple cores. threading and asyncio give you concurrency on one core; only multiprocessing (or a free-threaded build) gives you CPU parallelism.
  • The GIL serializes Python bytecode. Under the standard CPython build, exactly one thread executes Python bytecode at a time. Threads help only when they spend most of their life outside the interpreter — blocked on a syscall that releases the GIL. They never speed up pure-Python compute.
  • Asyncio depends on a cooperative contract. The single-threaded event loop only switches tasks at await points. One synchronous call — time.sleep, requests.get, a blocking DB driver — freezes every other task on the loop.
  • Isolation has a serialization cost. Processes eliminate lock contention and crash isolation by giving each worker its own address space, but every argument and result crossing that boundary is pickled. Large payloads can cost more to ship than to compute.
  • Profile before you choose. The CPU-time-to-wall-clock ratio of a representative task tells you the model. Measure it on real data before committing an architecture.

Execution model: how each maps to the loop and the scheduler

All three models ultimately schedule work, but at different layers, and that layer determines their cost profile. Understanding where each sits relative to the OS scheduler and the asyncio event loop is what makes the worker topologies in the Concurrent Execution & Worker Patterns overview tractable.

threading hands scheduling to the OS: threads are preemptively switched, sharing one virtual address space, and the GIL is acquired/released around bytecode and I/O syscalls. multiprocessing spawns independent interpreters, each with its own GIL, scheduled as separate OS processes — true parallelism at the price of IPC. asyncio runs a single OS thread whose event loop multiplexes thousands of sockets through one epoll/kqueue/IOCP call, switching coroutines cooperatively at await. The hybrid model bridges them: the loop offloads blocking or CPU work to a thread or process pool via loop.run_in_executor() and asyncio.to_thread(), keeping the loop responsive while borrowing the other models' strengths.

The cost difference is mechanical, not stylistic. A preemptive OS thread switch is a kernel transition that saves and restores register state and may flush cache lines; doing it across hundreds of runnable threads burns measurable CPU. The event loop's switch is a Python-level resume of the next ready coroutine — no syscall, no kernel involvement — which is why one loop sustains tens of thousands of idle-but-open connections that a thread-per-connection design could never afford. A process switch is the most expensive of the three and carries a separate page table, so processes are justified only when the work they do dwarfs the cost of starting and feeding them. These relationships are why the decision below keys off where time is spent: the model that wins is the one whose scheduling cost is smallest relative to the work it enables. When no single model fits, the hybrid pattern lets each layer do what it is cheapest at — the loop for waiting, threads for blocking calls, processes for compute.

Workload to concurrency model decision matrix A workload classifier routes I/O-bound work to asyncio or threading, CPU-bound work to multiprocessing, and mixed work to a hybrid loop-plus-executor model, annotated with GIL and memory characteristics. Workload → concurrency model Classify the workload CPU-time / wall-clock ratio I/O-bound, high fan-out asyncio (event loop) 1 thread, flat memory GIL released on await I/O via blocking libs threading / ThreadPool shared memory, ~8MB/stack GIL freed in syscalls CPU-bound compute multiprocessing isolated memory, IPC cost one GIL per process Mixed → hybrid: loop + executor asyncio drives I/O; run_in_executor offloads CPU/blocking

Pattern catalogue

Four patterns cover the overwhelming majority of production workloads. Each names when to use it, its central trade-off, and a minimal example.

Asyncio for high fan-out I/O

When: thousands of concurrent network connections — API aggregation, crawlers, proxy gateways, fan-out RPC. Trade-off: flat memory and the highest connection density of any model, but every library on the hot path must be async-native or the loop stalls.

import asyncio
import aiohttp

async def fetch(session: aiohttp.ClientSession, url: str, sem: asyncio.Semaphore) -> int:
    async with sem:  # bound in-flight requests; see resource boundaries below
        async with session.get(url) as resp:
            await resp.read()
            return resp.status

async def main(urls: list[str], limit: int = 100) -> list[int]:
    sem = asyncio.Semaphore(limit)
    async with aiohttp.ClientSession() as session:
        async with asyncio.TaskGroup() as tg:
            tasks = [tg.create_task(fetch(session, u, sem)) for u in urls]
    return [t.result() for t in tasks]

A TaskGroup (3.11+) gives structured concurrency: if one fetch raises, siblings are cancelled and the error surfaces as an ExceptionGroup. For the head-to-head numbers behind this choice, see asyncio vs threading for 1000 concurrent HTTP requests.

Threads for blocking-call libraries

When: I/O-bound work whose only client is a synchronous library — psycopg2, boto3, a vendor SDK, requests — and rewriting to async is not yet possible. Trade-off: zero code rewrite and shared memory for cheap data access, but each thread costs a stack (commonly ~8 MB of reserved address space), and the GIL caps useful parallelism to whatever time threads spend blocked in syscalls.

from concurrent.futures import ThreadPoolExecutor, as_completed
import requests

def fetch(url: str) -> int:
    return requests.get(url, timeout=10).status_code  # releases GIL while blocked on socket

def run(urls: list[str], workers: int = 32) -> list[int]:
    with ThreadPoolExecutor(max_workers=workers) as pool:
        futures = [pool.submit(fetch, u) for u in urls]
        return [f.result() for f in as_completed(futures)]

Cap max_workers deliberately — a bounded pool is a worker pool, and unbounded thread creation thrashes the scheduler.

Processes for CPU-bound work

When: pure-Python or non-releasing C compute — serialization, parsing, image/crypto/ML transforms — that the GIL would otherwise serialize. Trade-off: real multi-core speedup, but pickle IPC and process-spawn latency dominate small tasks, so chunk work coarsely.

from concurrent.futures import ProcessPoolExecutor
import os

def transform(chunk: list[int]) -> int:
    return sum(x * x for x in chunk)  # CPU-bound: runs on its own core, own GIL

def run(data: list[int]) -> int:
    workers = os.cpu_count() or 1
    size = max(1, len(data) // workers)
    chunks = [data[i:i + size] for i in range(0, len(data), size)]
    with ProcessPoolExecutor(max_workers=workers) as pool:
        return sum(pool.map(transform, chunks))

For the full executor selection logic — IPC thresholds, shared-memory routing, BrokenProcessPool recovery — see choosing between ThreadPoolExecutor and ProcessPoolExecutor for data pipelines and the dedicated CPU-bound task offloading guide.

Hybrid: event loop plus executor

When: a service that is mostly async I/O but has a few blocking or CPU islands — an async API that calls one legacy SDK, or computes a checksum per request. Trade-off: keeps the loop responsive without a full rewrite, but you now own two scheduling domains and the synchronization between them.

import asyncio
from concurrent.futures import ProcessPoolExecutor

def cpu_heavy(n: int) -> int:
    return sum(i * i for i in range(n))  # would block the loop if run inline

async def handle(n: int, pool: ProcessPoolExecutor) -> int:
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(pool, cpu_heavy, n)  # offload, keep loop free

async def main() -> None:
    with ProcessPoolExecutor() as pool:
        async with asyncio.TaskGroup() as tg:
            for n in (10_000_000, 20_000_000):
                tg.create_task(handle(n, pool))

The deep treatment of this split lives in hybrid concurrency models; migrating a thread-based service into this shape is covered in migrating legacy threading code to asyncio without downtime.

The decision matrix

Map a representative task to a row, then read across. The CPU-time-to-wall-clock ratio (measured with time.process_time() over time.perf_counter()) is the single most reliable discriminator.

Workload CPU/wall ratio Model Memory boundary Failure isolation Why
High fan-out network I/O < 0.1 asyncio Single process, flat None — one crash kills the loop Thousands of sockets multiplexed on one thread
Blocking-library I/O < 0.4 threading / ThreadPool Shared, +stack per thread None — shared address space GIL released during syscalls; no rewrite
Pure-Python compute > 0.7 multiprocessing Isolated per process Strong — worker crash is contained Bypasses the GIL for true parallelism
Mixed I/O + compute 0.4–0.7 Hybrid loop + executor Partitioned Process executor isolates CPU work Loop owns I/O; executor absorbs blocking islands

Resource boundaries

Every model needs an explicit ceiling; the failure mode of an unbounded one is always memory exhaustion under load.

  • Asyncio: unbounded create_task() accumulates coroutine state without limit. Gate in-flight work with asyncio.Semaphore and feed work through a bounded queue — see async queue management for backpressure mechanics.
  • Threads: size for I/O concurrency, not cores. A reasonable starting ceiling is min(32, (os.cpu_count() or 1) * 4) for blocking I/O; past a few hundred threads, context-switch overhead and stack memory dominate.
  • Processes: start at os.cpu_count() for CPU-bound work. More processes than physical cores only adds context switching and per-process RSS. Watch for linear RSS growth with max_workers — that signals payloads should move through multiprocessing.shared_memory instead of pickle.

Integrated production example

A hybrid gateway that fetches records over async HTTP, offloads a CPU-bound transform to a process pool, and shuts down cleanly. It combines all four patterns and carries diagnostics.

import asyncio
import logging
import os
import time
from concurrent.futures import ProcessPoolExecutor

import aiohttp

logging.basicConfig(level=logging.INFO)
log = logging.getLogger("gateway")


def transform(payload: bytes) -> int:
    """CPU-bound: runs in a worker process, off the event loop."""
    return sum(b * b for b in payload)


async def fetch_and_process(
    session: aiohttp.ClientSession,
    url: str,
    sem: asyncio.Semaphore,
    pool: ProcessPoolExecutor,
) -> int:
    loop = asyncio.get_running_loop()
    async with sem:  # bound concurrent I/O
        async with asyncio.timeout(15):  # 3.11+ deadline, propagates CancelledError
            async with session.get(url) as resp:
                resp.raise_for_status()
                body = await resp.read()
    # Offload the CPU island so the loop keeps servicing other sockets.
    return await loop.run_in_executor(pool, transform, body)


async def run(urls: list[str], io_limit: int = 100) -> dict[str, int | float]:
    sem = asyncio.Semaphore(io_limit)
    start = time.perf_counter()
    ok, failed = 0, 0
    with ProcessPoolExecutor(max_workers=os.cpu_count()) as pool:
        async with aiohttp.ClientSession() as session:
            async with asyncio.TaskGroup() as tg:
                tasks = [
                    tg.create_task(fetch_and_process(session, u, sem, pool))
                    for u in urls
                ]
    for t in tasks:
        if t.cancelled() or t.exception():
            failed += 1
        else:
            ok += 1
    return {
        "ok": ok,
        "failed": failed,
        "wall_s": round(time.perf_counter() - start, 3),
        "active_tasks": len(asyncio.all_tasks()),
    }


if __name__ == "__main__":
    sample = ["https://example.com"] * 50
    try:
        print(asyncio.run(run(sample)))
    except* aiohttp.ClientError as eg:  # 3.11 exception groups from TaskGroup
        log.error("client errors: %s", [str(e) for e in eg.exceptions])

Diagnostic Hook: export three numbers per cycle and alert on each. (1) Event-loop lag: schedule await asyncio.sleep(0) and time it; sustained values above ~10 ms mean something synchronous is blocking the loop. (2) Executor saturation: log the process pool's pending count; queue depth above 2 × max_workers means CPU work is the bottleneck, not I/O. (3) Thread/process census: log len(threading.enumerate()) and worker RSS to catch leaks and pickle-driven memory growth before OOM.

Diagnostic Hook: the three signals that decide if you chose right

If event-loop lag is high but CPU is idle, you have a blocking call on the loop — move it to an executor. If CPU is pinned but throughput is flat, the GIL is serializing thread work — move it to processes. If memory grows linearly with workers, your IPC payloads are too large — switch to shared_memory. Set PYTHONASYNCIODEBUG=1 and loop.slow_callback_duration = 0.1 in staging to surface accidental blocking early.

Failure modes

Failure mode Root cause Detection Fix
Threads give no speedup on compute GIL serializes Python bytecode CPU% near one core despite N threads; cProfile shows pure-Python hotspots Move work to ProcessPoolExecutor
Event loop freezes intermittently Synchronous call (requests, time.sleep, blocking driver) in a coroutine Loop lag spikes; slow_callback_duration warnings Wrap in run_in_executor / use async-native lib
Process pool OOM under load Large objects pickled across the boundary RSS scales linearly with max_workers; IPC time > compute time Use multiprocessing.shared_memory; chunk coarser
RuntimeError: asyncio.run() cannot be called from a running event loop Nesting sync/async entry points Stack trace on startup or inside a thread Bridge via executors; never call asyncio.run() inside a loop
Zombie threads/processes on shutdown Missing executor.shutdown(wait=True) / undrained tasks Rising thread/process count after restart; ResourceWarning Drain executors and await loop.shutdown_asyncgens()
Unbounded task/queue growth No semaphore or maxsize cap Steady RSS climb tracking request rate Bound with Semaphore and asyncio.Queue(maxsize=N)

Frequently Asked Questions

Can asyncio replace multiprocessing for CPU-bound workloads?

No. asyncio runs on a single thread and multiplexes I/O; a CPU-bound task blocks the event loop and collapses concurrency. Use ProcessPoolExecutor, or offload from the loop via loop.run_in_executor() with a process pool, to get true multi-core parallelism.

Why does my ThreadPoolExecutor perform worse than a single-threaded loop?

Thread creation, context switching, and GIL contention add overhead that outweighs the benefit for lightweight or CPU-bound tasks. Threads only help when work spends most of its time blocked in syscalls that release the GIL. Profile with cProfile and confirm the work is genuinely I/O-bound before scaling thread counts.

How do I choose between threading, multiprocessing, and asyncio?

Measure the CPU-time-to-wall-clock ratio of a representative task. Below ~0.4 the work is I/O-bound: use asyncio for high fan-out or threads when stuck with blocking libraries. Above ~0.7 it is CPU-bound: use multiprocessing. In between, use a hybrid event loop plus executor.

What worker count should I use for each model?

For blocking-I/O threads, start at min(32, os.cpu_count() * 4) and tune to I/O capacity. For CPU-bound processes, start at os.cpu_count() and avoid exceeding physical cores. For asyncio, there is no thread count; bound in-flight work with a Semaphore and a maxsize queue instead.

How do I combine asyncio with threads or processes safely?

Run asyncio for the I/O path and offload blocking or CPU islands with loop.run_in_executor() (or asyncio.to_thread for blocking I/O). Use a ThreadPoolExecutor for blocking I/O and a ProcessPoolExecutor for CPU work. Avoid shared mutable state across the boundary, and drain executors with shutdown(wait=True) on exit.