Tuning WebSocket ping/pong Heartbeats¶
A WebSocket can be dead long before TCP notices. The peer's machine sleeps, a load balancer silently severs an idle connection, or the network partitions — and your recv() simply blocks forever, holding an open file descriptor and a half-open socket the OS will not reap for hours. TCP's own keepalive defaults to two hours of idleness before it probes, which is useless for a real-time system; by the time the kernel gives up you have leaked thousands of dead connections and the memory behind them. The application-layer fix is the WebSocket protocol's own keepalive: the websockets library sends ping frames and expects pong frames in return, closing the connection when a pong is overdue.
The catch is that the two knobs that control this — ping_interval and ping_timeout — are easy to mis-tune, and the two failure modes pull in opposite directions. Set them too aggressively and you kill healthy peers whenever ordinary network jitter delays a pong, generating a storm of reconnects that looks exactly like an outage. Set them too loosely and dead connections leak past your idle-timeout proxy, accumulate as zombie file descriptors, and inflate every per-connection metric you have. There is no single correct value — only a value correct for your network's latency distribution and the idle timeouts of the proxies on the path. This guide walks through choosing it: understanding the mechanism, fitting it to your infrastructure, balancing detection speed against false positives, reconnecting cleanly when keepalive does fire, and falling back to an application heartbeat when proxies interfere with control frames.
Prerequisites¶
- Python 3.11+ (for
asyncio.timeout()andTaskGroupused in the reconnect example). - The
websocketslibrary:
- Familiarity with the WebSocket and real-time streams overview and the broader async network I/O reference. This guide assumes you already run a receive loop and reconnect on
ConnectionClosed.
1. Understand the websockets keepalive¶
websockets runs an internal keepalive task per connection. After ping_interval seconds of no traffic it sends a ping frame; it then waits up to ping_timeout seconds for the matching pong. If the pong does not arrive, the connection is closed with a keepalive timeout and your next recv()/send() raises ConnectionClosed. The defaults are ping_interval=20 and ping_timeout=20.
The worst-case time to detect a dead peer is ping_interval + ping_timeout (the connection may go silent just after a ping was answered, so you wait a full interval before the next ping, then the full timeout for its pong). Note that the timer is reset by any traffic, not just pongs — a connection carrying a steady stream of data frames may never need an explicit ping at all, because the library only pings after ping_interval seconds of genuine idleness. This matters for sizing: a chatty connection is effectively self-monitoring, while a mostly-idle one (a control channel, a rarely-updated dashboard) leans entirely on keepalive. Setting ping_interval=None disables pinging and ping_timeout=None disables the timeout half; both together turn keepalive off, which you should only ever do if the other end of the connection is responsible for liveness detection. Verify: with defaults, a peer that vanishes is detected within roughly 40 seconds — confirm by killing the peer process (e.g. kill -STOP to freeze it without a clean close) and timing how long until your loop sees ConnectionClosed.
2. Set values for your network and proxy idle limits¶
The binding constraint is usually not your server — it is the idle timeout of every hop between you and the peer. AWS ALB defaults to 60 s idle, many NGINX setups to 60 s, Cloudflare to 100 s. If ping_interval is longer than the smallest idle timeout on the path, the proxy closes the connection before your keepalive ever fires.
Verify: open a connection through the real proxy, send nothing, and confirm it stays up past the proxy idle limit. If it drops at exactly the proxy's idle value, your ping_interval is too high.
3. Detect a dead peer fast vs avoid false positives¶
To detect death faster, lower ping_interval (ping sooner) and ping_timeout (give up sooner). But every reduction in ping_timeout raises the chance that ordinary network jitter — a GC pause, a mobile handover, a momentarily saturated link — delays a legitimate pong past the deadline and closes a healthy connection.
Choose ping_timeout from observed RTT, not from a guess: it should exceed the peer's worst realistic round trip plus a margin (e.g. p99 RTT plus the longest expected GC/scheduler pause). The trap is reasoning from the median — a link with a 20 ms median RTT can routinely spike to 2–3 seconds under congestion or a cellular handover, and a ping_timeout of 1 second will tear down a perfectly healthy connection during every such spike. The cost asymmetry is stark: detecting a dead peer five seconds later is almost always cheaper than falsely closing a live session and forcing a full reconnect, re-auth, and resubscribe. When in doubt, bias ping_timeout upward. Verify: scrape ws.latency (the library's last measured ping/pong RTT) over a representative window across your real client population; set ping_timeout to several times the p99 you observe, never below it, and re-check after any infrastructure change that alters the path.
4. Handle ConnectionClosed and reconnect with backoff¶
When keepalive fires, the connection closes and your loop must reconnect — not crash. Wrap the session in a retry loop that catches ConnectionClosed and transport errors, then backs off exponentially with jitter, exactly as in retry and backoff strategies.
Note the distinction between ConnectionClosedError and the clean ConnectionClosedOK: a keepalive timeout is not clean, so it surfaces as the error subclass and you can log its close code to separate "peer went silent" (keepalive/1011) from "peer said goodbye" (1000/1001) from "network dropped" (1006). That breakdown is exactly what tells you whether your ping_timeout is too tight — a rising 1011 rate on otherwise-healthy peers is the unmistakable signature of false positives. Verify: kill the server mid-stream and confirm the client reconnects after a bounded, jittered delay and that backoff resets to 1.0 on the next successful connect. Then freeze (rather than kill) the server with kill -STOP and confirm the client closes with a keepalive timeout after ping_interval + ping_timeout and reconnects, proving the heartbeat — not just clean shutdown — drives recovery.
5. App-level heartbeat when behind proxies that strip control frames¶
Some proxies and gateways do not forward WebSocket control frames (ping/pong) end to end — they answer pings at the edge, so your library sees pongs even when the real backend is dead, or they strip pings entirely. When the protocol heartbeat is unreliable, add an application-level heartbeat using ordinary data frames that traverse the full path.
An application heartbeat is strictly more expensive than the protocol one — it crosses your handler, your serialization, and competes with real traffic — so reach for it only when you have evidence that control frames are not making it end to end. The diagnostic is simple: if the protocol keepalive reports a connection as healthy (ws.latency stays low) while the actual backend is unreachable, something on the path is answering pings locally. Verify: behind the real proxy, pause the backend (not the edge) and confirm the app heartbeat times out and closes, where the protocol ping/pong did not register the failure.
Verification¶
- Dead peers detected within target: kill the peer and measure the time to
ConnectionClosed; it should fall at or belowping_interval + ping_timeout. - No spurious closes: over a representative traffic window, the count of keepalive-timeout closes (close code
1011/ "keepalive ping timeout") on healthy peers should be effectively zero. If it is not, raiseping_timeout. - Survives the proxy: an idle connection stays up well past the smallest proxy idle timeout on the path.
Pitfalls and edge cases¶
ping_timeouttoo low on jittery links. A timeout below the peer's p99 RTT (plus its worst GC/scheduler pause) closes healthy connections during ordinary stalls. Derive it from measuredws.latency, not intuition.- Proxies dropping idle connections. If
ping_intervalexceeds the smallest idle timeout on the route, the connection is gone before keepalive runs. Keepping_intervalwell under that limit. - Blocking the recv loop delays pongs. Pongs are received and processed on the event loop. A CPU-bound or blocking call in your handler delays pong handling and can trip
ping_timeouton a perfectly healthy peer. Keep handlers non-blocking; offload heavy work. - Server and client both pinging. Both ends can run keepalive; that is fine and redundant, but tune them independently — the client's
ping_timeoutmust tolerate the server's latency and vice versa. Disabling one side (ping_interval=None) is only safe if the other side reliably detects death. - Edge proxies that answer pings locally. Protocol ping/pong then proves only the edge is alive, masking a dead backend. Use an application-level heartbeat (step 5) when the path may terminate control frames.
Frequently Asked Questions¶
What are good default values for ping_interval and ping_timeout?
The library defaults are ping_interval=20 and ping_timeout=20, giving roughly 40-second worst-case dead-peer detection. On a low-latency internal network you can lower both to ~5s; on jittery mobile or WAN links keep ping_interval at 20 and raise ping_timeout to 30-40 to avoid false disconnects. Always keep ping_interval below the smallest proxy idle timeout on the path.
Why do healthy WebSocket connections keep closing with a keepalive timeout?
ping_timeout is too low relative to real round-trip latency and jitter, or a blocking call in your handler is delaying pong processing on the event loop. Raise ping_timeout above the observed p99 RTT plus the worst expected GC or scheduler pause, and keep handlers non-blocking so pongs are processed promptly.
Should both the WebSocket client and server send pings?
They can; both ends may run keepalive independently and it is harmless redundancy. Tune each side for the latency it sees. Disabling keepalive on one side with ping_interval=None is only safe if the other side reliably detects a dead peer.
Do I still need an application-level heartbeat?
Only when proxies on the path may answer ping frames locally or strip control frames, which lets protocol pong frames return even though the real backend is dead. In that case add a heartbeat using ordinary data frames that traverse the entire path end to end.
Related¶
- WebSocket & Real-Time Streams — the overview this guide drills into: lifecycle, concurrent send/recv, and broadcast.
- Async Network I/O & Protocol Handling — up to the parent overview for the full transport mental model.
- Retry and Backoff Strategies — the reconnect discipline that pairs with keepalive-driven closes.