Optimizing worker pool sizes for mixed I/O and CPU workloads¶
A pool sized for I/O thrashes on CPU work, and a pool sized for CPU starves under I/O wait. When a single workload mixes both — say, fetching a payload over the network (I/O), then parsing and hashing it (CPU) — a single max_workers number is wrong for at least half the task. The symptom is recognizable: CPU pins at 100% across all cores while the queue keeps backing up, or cores sit idle while throughput stalls behind network latency. Both look like "the pool is too small" on a dashboard, but the fixes are opposite — add threads in the second case, remove them in the first — which is why guessing at max_workers rarely converges.
The root cause is that the two failure modes are governed by different physics. Thread concurrency helps only while threads are parked in I/O syscalls (where the GIL is released); stack more threads on CPU-bound work and they simply queue for the one interpreter lock, adding context-switch overhead for zero parallelism. Process concurrency gives true parallelism but caps usefully at core count and pays serialization cost on every task. So the right number depends entirely on where each task spends its time — and "each task" is rarely uniform in a real service. This guide walks through measuring the I/O-to-CPU ratio of your tasks, turning that ratio into concrete pool sizes with established concurrency formulas, and replacing static sizing with a feedback-driven controller that resizes safely at runtime.
Prerequisites¶
- Python 3.11+ for
asyncio.timeout()and modernTaskGroupidioms in the surrounding pool code. - Standard library only here:
concurrent.futures,multiprocessing,time,tracemalloc,sys. A profiler such aspy-spyorcProfileis useful for validation but not required to run the snippets. - Working knowledge of the pool mechanics this builds on — the worker-coroutine-plus-queue structure and the executor-backed variant from worker pool implementations, and the loop-blocking constraint from the parent Concurrent Execution & Worker Patterns overview.
Step 1 — Profile each task's I/O-to-CPU ratio¶
Before you can size anything, measure the split between time spent waiting on I/O and time spent burning CPU, per task category. The trick is comparing two clocks: time.perf_counter() measures wall-clock (which includes I/O wait), while time.process_time() measures only CPU time charged to the process. The difference is I/O wait. This is deliberately cheaper than a sampling profiler — you can leave it running in production behind a flag, accumulating a rolling mean per task type, because two clock reads per call cost essentially nothing.
The rolling-mean design matters: a single sample is noisy (one GC pause or one slow DNS lookup skews it), so the profiler below smooths over the last samples calls. Keep a separate profiler instance per task category (e.g. one for "fetch", one for "parse"), not one global instance, because the whole point is that categories have different ratios and averaging across them reproduces the original problem.
Verify: decorate a representative task, run it a few hundred times against real dependencies, then inspect profiler.w_ratio and profiler.c_ratio. A network-bound task should show w_ratio several times larger than c_ratio; a parsing/hashing task the reverse. If a "network" task shows high c_ratio, you have hidden CPU work (TLS handshakes, JSON deserialization, regex compilation) masquerading as I/O — confirm with cProfile.
Step 2 — Convert the ratio into pool sizes¶
With the ratio in hand, apply the standard concurrency formulas. Thread pools scale above core count because threads spend most of their time parked on I/O and yielding the GIL; process pools cap at core count because each process genuinely competes for a physical core and extra processes only add context-switch and memory overhead. The thread formula traces back to Brian Goetz's concurrency work: a thread that waits W and computes C keeps a core busy only C / (W + C) of the time, so to keep all cores busy you need cores × (W + C) / C = cores × (1 + W/C) threads. The formula is a starting point, not a final answer — it assumes the GIL is fully released during the wait, which holds for socket and disk I/O but not for CPU work mislabeled as I/O (see the pitfalls).
- Thread pool:
N_threads = cores × (1 + W/C)— the more a task waits on I/O relative to compute, the more threads you can profitably stack. - Process pool:
N_processes = min(cores, cores + 1)— capping at core count prevents thrashing; the+1only helps if one process is reliably I/O-bound. - Never mix: route I/O and CPU tasks to separate pools via a dispatcher. A single shared executor guarantees GIL contention drags down the I/O tasks.
Verify: feed the measured w_ratio / c_ratio into calculate_pool_sizes. For an I/O-heavy workload (ratio ~4 on 8 cores) you should get ~40 threads and 8 processes; for a CPU-heavy one (ratio ~0.2) ~9 threads and 8 processes. Sanity-check that threads is not absurd — a ratio implying hundreds of threads usually means the work belongs on the loop as coroutines instead.
Step 3 — Replace static sizing with an adaptive controller¶
Static sizes are correct only at the load you measured. Real traffic shifts network latency and arrival rate, so wrap the pool in a controller that tracks rolling latency and queue depth and resizes within bounds. Because concurrent.futures cannot hot-swap max_workers — the attribute is fixed at construction and the thread set is managed internally — resizing means draining and recreating the executor. That is an expensive operation (shutdown(wait=True) blocks until in-flight tasks complete), so it must be rare: gate it behind a cooldown so at most one resize happens per window, and use hysteresis (a scale-up threshold well above the scale-down threshold) so the controller cannot oscillate across a single boundary. The controller resizes on two signals together — latency and queue depth — because either alone is ambiguous: high latency with an empty queue is just slow work, not under-provisioning, and only the combination of high latency and a backed-up queue justifies adding workers.
Verify: drive synthetic load with rising latency and confirm worker count steps up by 2 only after the cooldown, then steps back toward base_workers when latency falls. Log every _resize call; in steady state you should see zero resizes — constant flapping means your up/down thresholds overlap. One subtlety to confirm: because submit() and _adjust() share self._lock, a resize cannot race a submission, but the shutdown(wait=True) inside _resize will hold that lock for the duration of the drain. If your tasks are long-running, that blocks new submissions for the whole drain — acceptable for occasional resizes, but a reason to keep the cooldown generous and the per-task duration bounded.
Verification¶
You cannot validate a pool size by reasoning alone — the GIL, the OS scheduler, and the downstream dependency all interact in ways the formulas approximate. Run a controlled load test and tie the three steps together, watching for these signals:
- Throughput plateau finds the knee. Sweep worker count upward; throughput rises, then flattens, then drops. The flat point just before the drop is your size. Past it, you are paying context-switch overhead for nothing.
- CPU vs queue depth disambiguates the bottleneck. Cores at ~100% with a growing queue means CPU-starved — shift work to processes or reduce threads. Cores idle with a growing queue means you are I/O-bound and can add threads.
- IPC handoff stays cheap. Benchmark per-task serialization for the process pool; overhead above ~15ms per task means payloads are too large — move to shared memory or coarser task granularity rather than more processes.
Diagnostic Hook: In production, export three series: average task latency (from record_latency), live queue_depth, and the controller's current _max_workers. The healthy picture is latency and depth fluctuating while worker count stays flat. A staircase of resizes correlated with steady latency means your thresholds are mis-tuned; a flat worker count with monotonically rising depth means you have hit max_workers and must scale out (more processes/hosts), not up. Periodically capture dump_thread_stacks() in staging — many threads parked on pthread_cond_wait/sem_wait confirm pool exhaustion or over-synchronization.
Pitfalls & edge cases¶
- Setting thread pool size to
os.cpu_count()for I/O-heavy work caps you far below capacity; those threads spend most of their life parked, socores × (1 + W/C)is the right target, notcores. - Running CPU-bound tasks in a thread pool serializes them on the GIL — they execute one at a time on a single core no matter how many threads you add. Route them to a
ProcessPoolExecutor. - Over-provisioning the process pool duplicates interpreter memory per process and adds OS context-switch overhead; past core count you lose throughput and risk OOM.
- Hidden CPU work inside "I/O" tasks (TLS, JSON parsing, regex compilation, decompression) inflates
c_ratioand silently makes a thread pool GIL-bound. Profile withcProfileand reclassify. - Resizing without a cooldown turns the adaptive controller into a thrash machine — each drain-and-recreate is expensive, and overlapping up/down thresholds make it oscillate forever. Always gap the thresholds and enforce the cooldown.
Frequently Asked Questions¶
Should I use threads or processes for mixed I/O and CPU workloads?
Route I/O-bound tasks to ThreadPoolExecutor and CPU-bound tasks to ProcessPoolExecutor. Mixing both in a single pool causes GIL contention and suboptimal utilization. Use a dispatcher that classifies tasks by their measured W/C ratio and routes accordingly.
How do I calculate the ideal thread pool size for high-latency network requests?
Use N = cores times (1 + W/C), where W/C is the ratio of average I/O wait time to CPU processing time. Start around 2 to 4 times cores for moderate latency and scale up as network round-trip time increases, validating against measured queue depth.
Can asyncio replace worker pools for mixed workloads?
asyncio excels at I/O but still runs CPU work on the loop thread, which blocks every other task. Offload CPU work to a separate process pool via loop.run_in_executor() and keep blocking CPU operations out of async handlers entirely.
Related¶
- Worker Pool Implementations — the parent guide on building the pools you are sizing here, including the executor-backed pattern.
- Concurrent Execution & Worker Patterns — up to the parent overview for the threads-vs-processes-vs-tasks decision behind these formulas.