Skip to content

CPU-Bound Task Offloading

A single CPU-heavy call inside a coroutine — a JSON-of-JSON parse, a regex over a megabyte, an image resize, a hash over a payload — does not yield to the event loop. While it runs, the loop runs nothing else: no timers fire, no sockets are read, no other coroutine advances. The fix is structural rather than clever: move the synchronous compute off the loop thread and into an executor, then await the result so the rest of your async flow continues unchanged. This guide covers the choices that actually matter in production — loop.run_in_executor with a ThreadPoolExecutor versus a ProcessPoolExecutor, the asyncio.to_thread shorthand, the serialization tax you pay to cross a process boundary, chunking large workloads, and bounding executor concurrency so offloading does not simply relocate the bottleneck.

Key architectural principles:

  • Threads help only when the GIL is released. Offloading to a ThreadPoolExecutor removes the work from the loop thread, but Python bytecode in that thread still holds the GIL. It is a real win for C-extension work (NumPy, hashlib, compression, image codecs) that releases the GIL during compute, and a non-win for pure-Python loops.
  • Processes give true parallelism at the cost of serialization. A ProcessPoolExecutor sidesteps the GIL entirely, but every argument and result crosses the boundary as pickled bytes. Below a threshold of compute-per-byte, the pickling and IPC cost exceeds the work you offloaded.
  • An offloaded task is still a resource. Unbounded run_in_executor calls queue against a fixed pool; the futures pile up and memory grows. Bound concurrency with a Semaphore or a queue so backpressure reaches the producer.
  • Detect the block before you fix it. loop.set_debug(True) plus slow_callback_duration turns a silent latency spike into an explicit warning that names the blocking callback.
  • The boundary defines the cost, not the function. The same function is nearly free to offload to a thread and expensive to offload to a process — the difference is entirely the data crossing the boundary. Design the offload around the data flow, not the function signature.

Why a CPU loop starves the scheduler

The asyncio event loop is a single-threaded cooperative scheduler. It runs one ready callback to completion, then picks the next; control returns to the loop only at an await that actually suspends. A synchronous CPU function contains no such suspension point. Once the loop calls into it — directly, or indirectly through a coroutine that forgot to offload — the loop is blocked for the entire duration of that computation. Every other task that was ready to run waits. This is the same scheduling model described across the Concurrent Execution & Worker Patterns overview, and it is why a 200 ms CPU call in a request handler does not cost one request 200 ms — it costs every in-flight request 200 ms of added tail latency.

Offloading breaks the coupling. loop.run_in_executor(executor, fn, *args) submits fn to a thread or process pool and immediately returns an asyncio.Future. The loop is free again the instant the function is submitted, not when it finishes. Your coroutine awaits the future; the loop services other tasks while the compute happens elsewhere, and resumes your coroutine when the result lands. The contract is narrow but strict: the function you submit must be a plain callable (not a coroutine), and the arguments and return value must survive the trip to wherever the work runs. For a thread that trip is free; for a process it is a pickle round-trip, which is the central trade-off below. The reason offloading even works for I/O-adjacent CPU is rooted in event-loop mechanics covered under event loop configuration: the loop's slow_callback_duration is the exact knob that surfaces a blocking callback, and run_in_executor is the exact tool that prevents one.

Offloading CPU work from the event loop The single-threaded event loop submits a CPU function to a thread or process pool, stays responsive for other tasks, and resumes when the future resolves. Offloading CPU work off the event loop Event loop (1 thread) cooperative scheduler await handler() run_in_executor(fn) returns immediately other tasks keep running submit + pickle args future resolves ThreadPoolExecutor GIL-releasing C work NumPy / hashlib / zlib no serialization ProcessPoolExecutor pure-Python CPU true multi-core parallelism pays pickling + IPC cost The loop blocks only for the submit, never for the computation.

Pattern catalogue

The five patterns below are ordered roughly by reach: start with to_thread, escalate to processes only when the workload is pure-Python CPU, and reach for a shared executor and explicit bounding once you offload at any real rate.

asyncio.to_thread for GIL-releasing libraries

asyncio.to_thread(fn, *args, **kwargs) (Python 3.9+) is the one-liner: it runs fn in the loop's default ThreadPoolExecutor and returns an awaitable. It is the right tool when the heavy work happens inside a C extension that releases the GIL for its hot loop — NumPy linear algebra, hashlib, zlib/gzip, Pillow, cryptography, most compiled parsers. While the C code runs without the GIL, the loop thread reacquires it and keeps scheduling other tasks.

import asyncio
import hashlib


def hash_payload(data: bytes) -> str:
    # hashlib releases the GIL for the digest loop, so a thread
    # genuinely runs this in parallel with the event loop.
    return hashlib.sha256(data).hexdigest()


async def handle(data: bytes) -> str:
    return await asyncio.to_thread(hash_payload, data)

Trade-off: zero serialization (the thread shares memory with the loop), but no help at all for pure-Python CPU — the GIL serializes bytecode, so a for loop summing a list will block other tasks almost as badly as if you never offloaded. Verify the win, do not assume it: see running blocking SDK calls with asyncio.to_thread for the I/O-bound counterpart where threads always pay off.

ProcessPoolExecutor for pure-Python CPU

When the work is Python bytecode — a hand-written parser, a graph traversal, a pure-Python numeric kernel — only separate processes give you parallelism, because each process has its own interpreter and its own GIL. Submit through run_in_executor with a ProcessPoolExecutor.

import asyncio
from concurrent.futures import ProcessPoolExecutor


def count_primes(n: int) -> int:
    # Pure-Python CPU: no C extension releases the GIL here.
    return sum(
        all(k % d for d in range(2, int(k ** 0.5) + 1))
        for k in range(2, n)
    )


async def main() -> None:
    loop = asyncio.get_running_loop()
    with ProcessPoolExecutor(max_workers=4) as pool:
        results = await asyncio.gather(*(
            loop.run_in_executor(pool, count_primes, 200_000)
            for _ in range(4)
        ))
    print(results)


asyncio.run(main())

Trade-off: true multi-core throughput, but the function and every argument must be importable and picklable, and results travel back as pickled bytes. A worker takes tens to hundreds of milliseconds to spawn the first time, and the function must be defined at module top level — a closure or a lambda raises PicklingError on submit. This is the pure-CPU end of the threading vs multiprocessing vs asyncio decision; if you find yourself offloading work that is mostly I/O with a CPU tail, route it through a hybrid model instead so the I/O stays on the loop.

run_in_executor with a shared, long-lived executor

Creating a pool per call is the most common offloading mistake: process spawn and thread allocation dominate, and you never amortize the cost. Build one executor at startup, hand it to run_in_executor, and shut it down at teardown.

import asyncio
from concurrent.futures import ProcessPoolExecutor
from contextlib import asynccontextmanager


@asynccontextmanager
async def shared_pool(max_workers: int = 4):
    loop = asyncio.get_running_loop()
    pool = ProcessPoolExecutor(max_workers=max_workers)
    try:
        yield pool
    finally:
        # shutdown joins workers; run it off the loop thread.
        await loop.run_in_executor(None, pool.shutdown, True)

Trade-off: one pool amortizes spawn cost across the process lifetime, but a shared pool is a shared queue — see bounding below, or it becomes the contention point.

Chunking large workloads

A single huge task does not parallelize; one worker chews on it while the others idle, and the pickled payload may dwarf the result. Split the input into chunks, submit each as its own task, and reduce the partial results on the loop side.

import asyncio
from concurrent.futures import ProcessPoolExecutor


def sum_squares(chunk: list[int]) -> int:
    return sum(x * x for x in chunk)


async def parallel_sum(data: list[int], pool, chunk: int = 50_000) -> int:
    loop = asyncio.get_running_loop()
    chunks = [data[i:i + chunk] for i in range(0, len(data), chunk)]
    partials = await asyncio.gather(*(
        loop.run_in_executor(pool, sum_squares, c) for c in chunks
    ))
    return sum(partials)

Trade-off: chunking fills every worker and caps any single pickle payload, but chunks that are too small invert the ratio — IPC overhead per chunk swamps the compute. Size chunks so each is tens of milliseconds of work, not microseconds.

Bounding offload concurrency with a Semaphore

run_in_executor does not block when the pool is busy; it queues. Fire 10,000 offloads at a four-worker pool and you hold 10,000 pending futures (and their pickled arguments) in memory at once. An asyncio.Semaphore sized near the worker count turns that silent backlog into real backpressure on the producer.

import asyncio


async def bounded_offload(pool, fn, arg, sem: asyncio.Semaphore):
    loop = asyncio.get_running_loop()
    async with sem:                       # caps in-flight submissions
        return await loop.run_in_executor(pool, fn, arg)


async def run_all(pool, fn, items):
    sem = asyncio.Semaphore(8)            # ~2x workers for a 4-worker pool
    async with asyncio.TaskGroup() as tg:
        for item in items:
            tg.create_task(bounded_offload(pool, fn, item, sem))

Trade-off: bounded memory and a producer that slows to match the pool, at the cost of a tunable you must size against your pool. The same primitive choice — lock, semaphore, event — is covered under synchronization primitives; here a semaphore is right because you are limiting a count of concurrent offloads.


Resource boundaries

Two numbers govern whether offloading helps or hurts: pool size and pickling overhead.

Pool sizing. For a ProcessPoolExecutor doing CPU work, the ceiling is physical cores — more processes than cores just context-switch. A practical default is max(1, os.cpu_count() - 1), leaving a core for the loop thread, OS scheduling, and GC. For a ThreadPoolExecutor running GIL-releasing C work, you can exceed core count modestly, but the GIL still gates the Python glue between C calls, so returns flatten quickly past core count. Never size a pool from request concurrency — size it from cores, and use the semaphore above to handle request concurrency. In a containerized deployment, remember that os.cpu_count() reports the host's cores, not the cgroup quota; read len(os.sched_getaffinity(0)) on Linux to respect the CPU limit the orchestrator actually granted, or you will oversubscribe a 2-core container with a pool sized for a 32-core host.

Pickling overhead. Crossing a process boundary serializes arguments on submit and results on return. The decision rule is compute-per-byte: offloading a function whose result is a 50 MB pickled DataFrame but whose compute is 5 ms will be slower than running it inline, because you pay the copy twice. For large NumPy or array payloads, pass a multiprocessing.shared_memory handle (a small picklable name) instead of the array itself, so the data is mapped, not copied. For database work, note the inverse case: drivers like those covered under async database drivers are already non-blocking and must never be offloaded — wrapping an await in a thread is pure overhead.

Worker start method and warmth. On Linux the default fork start method clones the parent's memory cheaply, but it is unsafe to fork a process that already holds the asyncio loop and open sockets — prefer the spawn start method (ProcessPoolExecutor(mp_context=multiprocessing.get_context("spawn"))) for a clean interpreter per worker, and accept the higher first-call latency. Because spawn re-imports your module in each worker, keep module-level side effects minimal: a worker that opens a database connection or loads a model at import time multiplies that cost by max_workers. Use the executor's initializer argument to do that setup once per worker, not once per task.

Where the timeout actually applies. asyncio.timeout() wraps the await, so it bounds how long your coroutine waits for the future — it does not bound the worker. A timed-out process keeps a core busy until it returns or the pool is torn down. If you need hard limits, run one task per short-lived worker (max_tasks_per_child on the pool) or enforce limits inside the worker with resource.setrlimit, and treat the loop-side timeout purely as a responsiveness guard.


Integrated production example

A small offloading service: a shared ProcessPoolExecutor, a bounding semaphore, per-task timeouts, structured handling of a crashed pool, and the diagnostic flags that catch a regression. This is the shape to copy into a real handler.

import asyncio
import logging
import os
from concurrent.futures import BrokenExecutor, ProcessPoolExecutor
from contextlib import asynccontextmanager

logging.basicConfig(level=logging.INFO)
log = logging.getLogger("offload")


def cpu_kernel(n: int) -> int:
    """Pure-Python CPU work; must be importable + picklable."""
    return sum(k * k % 97 for k in range(n))


class OffloadService:
    def __init__(self, max_workers: int | None = None, max_inflight: int = 16):
        self._workers = max_workers or max(1, (os.cpu_count() or 2) - 1)
        self._pool: ProcessPoolExecutor | None = None
        self._sem = asyncio.Semaphore(max_inflight)

    @asynccontextmanager
    async def lifespan(self):
        loop = asyncio.get_running_loop()
        self._pool = ProcessPoolExecutor(max_workers=self._workers)
        log.info("pool up: %d workers", self._workers)
        try:
            yield self
        finally:
            await loop.run_in_executor(None, self._pool.shutdown, True)
            log.info("pool down")

    async def run(self, n: int, timeout: float = 10.0) -> int:
        assert self._pool is not None, "use within lifespan()"
        loop = asyncio.get_running_loop()
        async with self._sem:                      # backpressure on producers
            fut = loop.run_in_executor(self._pool, cpu_kernel, n)
            try:
                async with asyncio.timeout(timeout):
                    return await fut
            except TimeoutError:
                # NOTE: cancelling the future does NOT kill the running
                # worker; the process keeps computing until it returns.
                fut.cancel()
                log.warning("offload timed out after %.1fs (n=%d)", timeout, n)
                raise
            except BrokenExecutor:
                # A worker died (OOM kill, segfault in a C ext). The pool
                # is now unusable; rebuild it rather than limping on.
                log.error("executor broken; rebuilding pool")
                self._pool = ProcessPoolExecutor(max_workers=self._workers)
                raise


async def main() -> None:
    loop = asyncio.get_running_loop()
    loop.set_debug(True)                 # enables slow-callback warnings
    loop.slow_callback_duration = 0.1    # warn on any 100 ms+ block

    svc = OffloadService()
    async with svc.lifespan():
        async with asyncio.TaskGroup() as tg:
            for n in (300_000, 600_000, 900_000, 1_200_000):
                tg.create_task(svc.run(n))
    # If any cpu_kernel were called inline here instead of offloaded,
    # set_debug would log "Executing <...> took 0.5xx seconds".


if __name__ == "__main__":
    asyncio.run(main())

Diagnostic Hook: Run with PYTHONASYNCIODEBUG=1 or loop.set_debug(True) and set loop.slow_callback_duration to your latency SLA (e.g. 0.05). Any callback that blocks the loop longer than that logs Executing <Handle ...> took N seconds with a traceback to the offending frame — the fastest way to find a CPU call that should have been offloaded but was not. In a service, export the count of those warnings as a metric and alert on any nonzero rate.


Diagnostic Hook — catch loop blocking with slow_callback_duration

The single most useful signal for this whole topic is slow_callback_duration. Set loop.set_debug(True) and lower the threshold (loop.slow_callback_duration = 0.1) in staging and load tests. A missed offload — a CPU call still running on the loop thread, or a to_thread wrapping pure-Python work that the GIL serializes back onto the loop's critical path — shows up immediately as a took N seconds log line pointing at the exact callback. No warnings under load is your green light that compute is genuinely off the loop.


Failure modes

Failure mode Root cause Detection Fix
Loop freezes during compute CPU function called inline in a coroutine, never offloaded slow_callback_duration warning naming the frame; rising request tail latency Wrap the call in to_thread or run_in_executor
to_thread doesn't help Wrapped work is pure-Python; GIL serializes it back onto the loop Slow-callback warnings persist after offloading; near-zero parallel speedup Move to ProcessPoolExecutor for true parallelism
Offloading slower than inline Pickling/IPC cost exceeds the compute (low compute-per-byte) Total time grows with payload size, not with n workers Pass shared_memory handles; chunk; or keep small work inline
Memory climbs under load Unbounded run_in_executor queues thousands of pending futures + pickled args RSS grows with request rate; pool queue depth unbounded Bound with a Semaphore sized near worker count
Pool wedged after a crash A worker died (OOM, segfault in C ext); pool raises BrokenExecutor BrokenProcessPool/BrokenExecutor on every subsequent submit Catch it, rebuild the executor, reject or retry in-flight work
Timeout fires but CPU keeps running future.cancel() cannot interrupt a running worker thread/process CPU stays pegged after a logged timeout Use process pools with hard per-task limits; treat cancellation as best-effort

Frequently Asked Questions

When should I use ProcessPoolExecutor instead of asyncio.to_thread for CPU work?

Use ProcessPoolExecutor for pure-Python CPU work that needs true multi-core parallelism, because each process has its own interpreter and GIL. asyncio.to_thread runs in a thread that still contends for the GIL, so it only helps when the heavy work is inside a C extension that releases the GIL (NumPy, hashlib, zlib, image codecs). Processes pay a pickling and IPC cost on every call; threads share memory and pay none.

Why is my offloaded function slower than running it inline?

Crossing a process boundary serializes the arguments on submit and the result on return. If the compute-per-byte is low — for example a 50 MB result computed in a few milliseconds — the two pickle copies cost more than the work you offloaded. Pass multiprocessing.shared_memory handles for large arrays instead of the arrays themselves, chunk the workload, or keep small computations inline.

How do I stop run_in_executor from exhausting memory under load?

run_in_executor does not block when the pool is busy; it queues, holding every pending future and its pickled arguments in memory. Bound the number of concurrent submissions with an asyncio.Semaphore sized near the worker count (roughly one to two times the workers). The semaphore turns an unbounded backlog into backpressure that slows the producer to match the pool.

How do I detect a CPU call that is blocking the event loop?

Enable debug mode with loop.set_debug(True) and lower loop.slow_callback_duration to your latency SLA (for example 0.1 seconds). Any callback that runs longer than that logs 'Executing took N seconds' with a traceback to the offending frame, pointing directly at the CPU call that should have been offloaded. Export the warning count as a metric and alert on any nonzero rate in production.