Skip to content

asyncio vs threading for 1000 concurrent HTTP requests

You need to fetch 1000 URLs as fast as the network allows, and the obvious ThreadPoolExecutor(max_workers=1000) either refuses to start, pins a gigabyte of resident memory, or spends more time context-switching than waiting on sockets. The question underneath is concrete: at 1000-way I/O fan-out, does threading + requests keep up, or does asyncio + aiohttp win — and by how much, on which axis? This page runs the head-to-head: a threaded baseline, an asyncio version with a bounded semaphore and a shared session, and the measurements that show where threads fall over. The answer is decisive for memory and connection density, and the numbers explain why.

Prerequisites

Both approaches are valid for I/O — threads release the GIL while blocked on a socket — so this is not a GIL-serialization story. It is a resource-density story: how much memory and scheduler overhead each model spends to keep 1000 requests in flight.

Memory and throughput vs concurrency: threads vs asyncio Two panels: memory rises steeply for threads and stays flat for asyncio as concurrency climbs to 1000; throughput holds for asyncio but tails off for threads past a few hundred workers. Threads vs asyncio at scale Memory vs concurrency RSS concurrency → 1000 threads asyncio Throughput vs concurrency req/s concurrency → 1000 asyncio threads

Step 1: The threaded baseline with ThreadPoolExecutor + requests

Start with the conventional approach: a pool of threads, each running a blocking requests.get(). The blocking call releases the GIL while it waits on the socket, so threads do overlap I/O — the cost is one OS thread (and its stack) per concurrent request.

import time
import tracemalloc
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests

def fetch(session: requests.Session, url: str) -> int:
    return session.get(url, timeout=10).status_code

def run_threaded(urls: list[str], workers: int) -> dict:
    tracemalloc.start()
    start = time.perf_counter()
    ok = 0
    # A shared Session reuses the underlying connection pool across threads.
    with requests.Session() as session, ThreadPoolExecutor(max_workers=workers) as pool:
        futures = [pool.submit(fetch, session, u) for u in urls]
        for f in as_completed(futures):
            if f.result() < 400:
                ok += 1
    wall = time.perf_counter() - start
    _, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return {"ok": ok, "wall_s": round(wall, 2),
            "peak_mb": round(peak / 1024**2, 1), "workers": workers}

Verify: run with workers=200 against 1000 URLs and watch RSS in top or psutil. You cannot simply set workers=1000 — each thread reserves a stack (commonly ~8 MB of address space), so 1000 threads threaten to reserve gigabytes, and the OS scheduler now juggles a thousand runnable threads. The practical ceiling is a few hundred workers; beyond that, context-switch overhead grows faster than added concurrency buys you.

Step 2: The asyncio version with a bounded Semaphore and a shared session

The asyncio version keeps all 1000 requests on one thread. An asyncio.Semaphore bounds how many are actually in flight, and a single shared aiohttp.ClientSession pools connections — both are essential.

import asyncio
import time
import tracemalloc
import aiohttp

async def fetch(session: aiohttp.ClientSession, url: str, sem: asyncio.Semaphore) -> int:
    async with sem:  # bound in-flight requests; not the total task count
        async with asyncio.timeout(10):
            async with session.get(url) as resp:
                await resp.read()
                return resp.status

async def run_asyncio(urls: list[str], limit: int = 200) -> dict:
    tracemalloc.start()
    start = time.perf_counter()
    sem = asyncio.Semaphore(limit)
    ok = 0
    async with aiohttp.ClientSession() as session:
        async with asyncio.TaskGroup() as tg:
            tasks = [tg.create_task(fetch(session, u, sem)) for u in urls]
    ok = sum(1 for t in tasks if not t.cancelled()
             and not t.exception() and t.result() < 400)
    wall = time.perf_counter() - start
    _, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return {"ok": ok, "wall_s": round(wall, 2),
            "peak_mb": round(peak / 1024**2, 1), "limit": limit}

Verify: all 1000 tasks are created immediately, but only limit requests touch the network at once — the semaphore is the backpressure. Sharing one session is what makes this efficient; a session-per-request defeats connection pooling and TLS reuse. See reusing aiohttp.ClientSession across requests and the pool sizing in connection pooling and keepalive for why the session and its connector limit, not the task count, govern throughput.

Step 3: Measure wall-time, memory, and peak threads side by side

Run both against the same 1000 URLs and capture the three axes that decide the choice: wall-clock time, peak memory, and OS thread count.

import threading

def bench(urls: list[str]) -> None:
    base_threads = threading.active_count()
    t = run_threaded(urls, workers=200)
    print(f"threaded: {t} | peak_threads={threading.active_count()}")

    a = asyncio.run(run_asyncio(urls, limit=200))
    print(f"asyncio:  {a} | peak_threads={threading.active_count()}")
    print(f"baseline threads before runs: {base_threads}")

if __name__ == "__main__":
    urls = ["https://example.com"] * 1000
    bench(urls)

Verify: for I/O-bound fetches the two wall-clock times land close at matched in-flight concurrency — both are bounded by network latency, not CPU. The divergence is in memory and threads: the threaded run's peak RSS scales with workers, and threading.active_count() reflects the pool size; the asyncio run stays near the single main thread with near-flat RSS regardless of the 1000 tasks.

Step 4: Interpret the numbers

The measurements have a mechanical explanation:

  • Per-thread stack memory. Each OS thread reserves a stack — typically ~8 MB of address space, with resident pages growing as the stack is used. A thousand threads is a memory liability before any request payload. A coroutine is a few KB of heap object; 1000 of them are negligible.
  • Context-switch cost. Past a few hundred threads, the OS scheduler spends a rising fraction of CPU switching between runnable threads. The event loop switches coroutines in user space at await points with no syscall, so it scales to tens of thousands of concurrent sockets on one thread.
  • GIL release on I/O. Both models overlap I/O because the blocking socket call (requests) and the loop's epoll wait both release/avoid the GIL while waiting. This is why threads are viable for I/O at all — and why this comparison is about resource density, not parallelism. For CPU-bound work the story flips entirely, as the parent Threading vs Multiprocessing vs Asyncio guide details.

Step 5: When threads still win

Asyncio is not the answer to every fan-out:

  • The client library has no async equivalent. A vendor SDK or a driver that only exposes a blocking API runs on threads (or asyncio.to_thread) regardless. Rewriting to async you do not control is not worth it.
  • Concurrency is modest (tens, not thousands). At 50-way fan-out, a ThreadPoolExecutor is simpler, has no event-loop ceremony, and the memory difference is irrelevant. Reach for asyncio when density or connection count is the constraint.
  • The surrounding code is synchronous. Introducing a loop into an otherwise blocking service adds a bridging burden; threads may be the lower-risk choice until a broader migration.

Verification

You have the right model when, at 1000-way fan-out:

  • Memory: asyncio peak RSS is near-flat as you raise the in-flight limit; threaded RSS climbs roughly with workers. A 1000-thread pool that survives at all will dwarf the asyncio process.
  • Threads: threading.active_count() stays near 1 (plus aiohttp internals) for asyncio; it tracks max_workers for the threaded run.
  • Throughput: at matched in-flight concurrency the two are comparable, both pinned by network latency. Push concurrency past a few hundred and threaded throughput tails off from scheduler overhead while asyncio holds — the curves in the diagram above.

Pitfalls & edge cases

  • Unbounded asyncio tasks. Creating 1000 tasks without a Semaphore opens 1000 sockets at once — you will hit file-descriptor limits or the remote's rate limit. The semaphore, not the task count, is the real concurrency control.
  • A fresh session per request. aiohttp.ClientSession() per call discards connection pooling and re-does TLS each time, erasing asyncio's advantage. Create one session and share it.
  • requests inside a coroutine. Calling the blocking requests.get() directly in an async def freezes the loop and serializes everything. Use aiohttp/httpx, or asyncio.to_thread if you must keep the blocking client.
  • Comparing against a single endpoint. Hammering one host conflates client behavior with the server's connection limits. Spread across hosts (or a local mock) to measure the client, not the target.
  • Ignoring the connector limit. aiohttp's default TCPConnector caps total connections (100 by default); if that is below your semaphore limit it, not your code, is the bottleneck. Tune them together.

Frequently Asked Questions

Is asyncio faster than threading for 1000 HTTP requests?

At matched in-flight concurrency the wall-clock times are comparable, because both are bounded by network latency, not CPU. Asyncio wins decisively on memory and connection density: it keeps all 1000 requests on one thread with near-flat memory, while threads cost a stack each and the scheduler tails off past a few hundred workers.

Why can't I just use ThreadPoolExecutor(max_workers=1000)?

Each OS thread reserves a stack, commonly around 8 MB of address space, so 1000 threads threaten gigabytes of reserved memory, and the scheduler must juggle a thousand runnable threads. Past a few hundred workers, context-switch overhead grows faster than the added concurrency helps.

Do I still need a Semaphore if asyncio is single-threaded?

Yes. Creating 1000 tasks without bounding them opens 1000 sockets at once, hitting file-descriptor limits or remote rate limits. The Semaphore caps in-flight requests, and the in-flight limit, not the task count, is the real concurrency control.

When should I keep using threads instead of asyncio?

Keep threads when the only client is a blocking library with no async equivalent, when concurrency is modest (tens of requests), or when the surrounding code is synchronous and adding an event loop introduces more bridging risk than the density gain is worth.