WebSocket Connection Drops
Troubleshooting WebSocket disconnects and implementing robust reconnection and resumption patterns.
WebSocket connections can drop for many reasons (network, proxies, server restarts, client sleep). A resilient client and a resume-friendly server greatly improve UX. This page covers common causes, detection, reconnection strategies, resumption, buffering, and monitoring.
- Sudden disconnects with no error code or message.
- Repeated 1006 (abnormal closure) on browsers.
- Long pauses (silent period) followed by reconnection attempts.
- High reconnect churn (clients reconnecting in bursts after infra restart).
Mobile network handoffs, NAT timeouts, Wi‑Fi sleep, or device Doze modes can drop sockets.
Idle timeouts on proxies (ALB/NGINX) can close idle sockets; missing TCP keepalives make timeouts more likely.
Server crashes, restarts, or worker recycling without graceful session handoff cause disconnects.
High memory/CPU causing event loop stalls or explicit server-side disconnects on auth/permission changes.
Use exponential backoff with jitter and a maximum cap. Avoid tight reconnect loops (thundering herd) and implement randomized jitter per client.
// Pseudo: reconnect with full jitter
let attempt = 0
const base = 500 // ms
const cap = 30000 // ms
function getBackoff() {
const max = Math.min(cap, base * 2 ** attempt)
// full jitter: uniform random between 0 and max
return Math.floor(Math.random() * max)
}
async function tryConnect() {
try {
await ws.connect()
attempt = 0
} catch (err) {
attempt++
const wait = getBackoff()
await sleep(wait)
tryConnect()
}
}Add a maximum total retry window and surface a UI state (offline / reconnecting) if attempts exceed thresholds.
Heartbeats detect dead connections faster than TCP timeouts. Implement periodic ping/pong or application-level heartbeats.
// Client-side heartbeat (pseudo)
const HEARTBEAT_INTERVAL = 30000 // 30s
let hbTimer
function startHeartbeat() {
hbTimer = setInterval(() => ws.send(JSON.stringify({ type: 'ping', ts: Date.now() })), HEARTBEAT_INTERVAL)
}
ws.onmessage = (msg) => {
const data = JSON.parse(msg.data)
if (data.type === 'pong') {
// ok
}
}On the server, reply to pings quickly and close connections that miss N heartbeats to free resources.
To avoid data loss, support resumption: clients reconnect with a resume token or last-received sequence number and request missed messages. The server should be able to replay (or provide last N messages) for a short retention window.
// Reconnect payload (client)
{
"type": "resume",
"sessionId": "s_abc123",
"lastSeq": 345
}
// Server reply: replay messages after seq 345, or request full state if gap too large
{
"type": "resume_ack",
"replayFrom": 346,
"fallback": false
}Replay window must be documented (e.g., last 5 minutes or last 10k messages). If gap is too large, require a full state sync.
Buffer outbound messages while disconnected and retry when reconnected. Ensure server-side operations are idempotent or require client-generated idempotency keys so retries don't cause duplicates.
// Outbound queue pseudo
const outQueue = []
function send(msg) {
if (ws.readyState !== WebSocket.OPEN) {
outQueue.push(msg)
return
}
ws.send(JSON.stringify(msg))
}
ws.onopen = () => {
while (outQueue.length) ws.send(JSON.stringify(outQueue.shift()))
}- Use sticky sessions / consistent hashing if using multiple WebSocket workers behind a load balancer, or implement a shared session store for resumption.
- Configure proxy idle timeouts longer than your heartbeat interval (e.g., ALB idle timeout > 60s when using 30s heartbeats).
- Run health checks and drain connections gracefully before worker restarts; notify clients to reconnect with resume tokens if possible.
Track connection metrics and alert on anomalies:
- Active connections
- Disconnect rate (per minute)
- Reconnect attempts and backoff saturation
- Missed heartbeat count
- Replay/fallback occurrences (how often full state syncs are required)
Simulate network conditions (packet loss, latency, disconnects) and run chaos tests (restart workers, kill connections) to validate reconnection and resume flows.
// ws-client.js (pseudo)
class WSClient {
constructor(url, sessionId, opts = {}) {
this.url = url
this.sessionId = sessionId
this.attempt = 0
this.outQueue = []
this.lastSeq = 0
this.connect()
}
connect() {
const backoff = Math.min(30000, 500 * 2 ** this.attempt)
setTimeout(() => {
this.ws = new WebSocket(this.url)
this.ws.onopen = () => {
this.attempt = 0
// try resume
this.ws.send(JSON.stringify({ type: 'resume', sessionId: this.sessionId, lastSeq: this.lastSeq }))
// flush outbound queue
while (this.outQueue.length) this.ws.send(JSON.stringify(this.outQueue.shift()))
}
this.ws.onmessage = (ev) => {
const msg = JSON.parse(ev.data)
if (msg.seq) this.lastSeq = Math.max(this.lastSeq, msg.seq)
// handle app messages...
}
this.ws.onclose = () => {
this.attempt++
this.connect()
}
this.ws.onerror = () => {
this.ws.close()
}
}, Math.random() * backoff) // jitter
}
send(obj) {
if (this.ws && this.ws.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify(obj))
} else {
this.outQueue.push(obj) // buffer while offline
}
}
}// server (pseudo)
on('connection', (socket) => {
socket.on('message', (m) => {
const msg = JSON.parse(m)
if (msg.type === 'resume') {
const { sessionId, lastSeq } = msg
const replay = fetchMessagesSince(sessionId, lastSeq)
replay.forEach(s => socket.send(JSON.stringify(s)))
}
})
})Quick checklist
- Implement exponential backoff with full jitter for reconnection.
- Use heartbeat/ping-pong to detect dead sockets faster than TCP timeouts.
- Support session resumption with last-seq or resume tokens and document replay window.
- Buffer outbound messages client-side and use idempotency keys for server operations.
- Configure proxy/load balancer timeouts > heartbeat interval and enable TCP keepalive.
- Instrument disconnect/reconnect rates, missed heartbeats, and replay/fallbacks.
- Test under adverse network conditions and run chaos experiments on servers.
Was this page helpful?
Your feedback helps us improve RunAsh docs.