WebSocket Connection Drops

Troubleshooting WebSocket disconnects and implementing robust reconnection and resumption patterns.

Overview

WebSocket connections can drop for many reasons (network, proxies, server restarts, client sleep). A resilient client and a resume-friendly server greatly improve UX. This page covers common causes, detection, reconnection strategies, resumption, buffering, and monitoring.

Common symptoms
  • Sudden disconnects with no error code or message.
  • Repeated 1006 (abnormal closure) on browsers.
  • Long pauses (silent period) followed by reconnection attempts.
  • High reconnect churn (clients reconnecting in bursts after infra restart).
Root causes
Network & client

Mobile network handoffs, NAT timeouts, Wi‑Fi sleep, or device Doze modes can drop sockets.

Load balancer / proxy

Idle timeouts on proxies (ALB/NGINX) can close idle sockets; missing TCP keepalives make timeouts more likely.

Server

Server crashes, restarts, or worker recycling without graceful session handoff cause disconnects.

Application

High memory/CPU causing event loop stalls or explicit server-side disconnects on auth/permission changes.

Reconnection strategy (client)

Use exponential backoff with jitter and a maximum cap. Avoid tight reconnect loops (thundering herd) and implement randomized jitter per client.

// Pseudo: reconnect with full jitter
let attempt = 0
const base = 500 // ms
const cap = 30000 // ms

function getBackoff() {
  const max = Math.min(cap, base * 2 ** attempt)
  // full jitter: uniform random between 0 and max
  return Math.floor(Math.random() * max)
}

async function tryConnect() {
  try {
    await ws.connect()
    attempt = 0
  } catch (err) {
    attempt++
    const wait = getBackoff()
    await sleep(wait)
    tryConnect()
  }
}

Add a maximum total retry window and surface a UI state (offline / reconnecting) if attempts exceed thresholds.

Keepalive & heartbeat

Heartbeats detect dead connections faster than TCP timeouts. Implement periodic ping/pong or application-level heartbeats.

// Client-side heartbeat (pseudo)
const HEARTBEAT_INTERVAL = 30000 // 30s
let hbTimer

function startHeartbeat() {
  hbTimer = setInterval(() => ws.send(JSON.stringify({ type: 'ping', ts: Date.now() })), HEARTBEAT_INTERVAL)
}

ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data)
  if (data.type === 'pong') {
    // ok
  }
}

On the server, reply to pings quickly and close connections that miss N heartbeats to free resources.

Session resumption and ordering

To avoid data loss, support resumption: clients reconnect with a resume token or last-received sequence number and request missed messages. The server should be able to replay (or provide last N messages) for a short retention window.

// Reconnect payload (client)
{
  "type": "resume",
  "sessionId": "s_abc123",
  "lastSeq": 345
}

// Server reply: replay messages after seq 345, or request full state if gap too large
{
  "type": "resume_ack",
  "replayFrom": 346,
  "fallback": false
}

Replay window must be documented (e.g., last 5 minutes or last 10k messages). If gap is too large, require a full state sync.

Client buffering & idempotency

Buffer outbound messages while disconnected and retry when reconnected. Ensure server-side operations are idempotent or require client-generated idempotency keys so retries don't cause duplicates.

// Outbound queue pseudo
const outQueue = []

function send(msg) {
  if (ws.readyState !== WebSocket.OPEN) {
    outQueue.push(msg)
    return
  }
  ws.send(JSON.stringify(msg))
}

ws.onopen = () => {
  while (outQueue.length) ws.send(JSON.stringify(outQueue.shift()))
}
Infrastructure & deployment tips
  • Use sticky sessions / consistent hashing if using multiple WebSocket workers behind a load balancer, or implement a shared session store for resumption.
  • Configure proxy idle timeouts longer than your heartbeat interval (e.g., ALB idle timeout > 60s when using 30s heartbeats).
  • Run health checks and drain connections gracefully before worker restarts; notify clients to reconnect with resume tokens if possible.
Monitoring & metrics

Track connection metrics and alert on anomalies:

  • Active connections
  • Disconnect rate (per minute)
  • Reconnect attempts and backoff saturation
  • Missed heartbeat count
  • Replay/fallback occurrences (how often full state syncs are required)
Testing & chaos

Simulate network conditions (packet loss, latency, disconnects) and run chaos tests (restart workers, kill connections) to validate reconnection and resume flows.

Example implementations
Browser client — reconnect + resume (pseudo)
// ws-client.js (pseudo)
class WSClient {
  constructor(url, sessionId, opts = {}) {
    this.url = url
    this.sessionId = sessionId
    this.attempt = 0
    this.outQueue = []
    this.lastSeq = 0
    this.connect()
  }

  connect() {
    const backoff = Math.min(30000, 500 * 2 ** this.attempt)
    setTimeout(() => {
      this.ws = new WebSocket(this.url)
      this.ws.onopen = () => {
        this.attempt = 0
        // try resume
        this.ws.send(JSON.stringify({ type: 'resume', sessionId: this.sessionId, lastSeq: this.lastSeq }))
        // flush outbound queue
        while (this.outQueue.length) this.ws.send(JSON.stringify(this.outQueue.shift()))
      }
      this.ws.onmessage = (ev) => {
        const msg = JSON.parse(ev.data)
        if (msg.seq) this.lastSeq = Math.max(this.lastSeq, msg.seq)
        // handle app messages...
      }
      this.ws.onclose = () => {
        this.attempt++
        this.connect()
      }
      this.ws.onerror = () => {
        this.ws.close()
      }
    }, Math.random() * backoff) // jitter
  }

  send(obj) {
    if (this.ws && this.ws.readyState === WebSocket.OPEN) {
      this.ws.send(JSON.stringify(obj))
    } else {
      this.outQueue.push(obj) // buffer while offline
    }
  }
}
Server-side resume & replay (concept)
// server (pseudo)
on('connection', (socket) => {
  socket.on('message', (m) => {
    const msg = JSON.parse(m)
    if (msg.type === 'resume') {
      const { sessionId, lastSeq } = msg
      const replay = fetchMessagesSince(sessionId, lastSeq)
      replay.forEach(s => socket.send(JSON.stringify(s)))
    }
  })
})

Quick checklist

  • Implement exponential backoff with full jitter for reconnection.
  • Use heartbeat/ping-pong to detect dead sockets faster than TCP timeouts.
  • Support session resumption with last-seq or resume tokens and document replay window.
  • Buffer outbound messages client-side and use idempotency keys for server operations.
  • Configure proxy/load balancer timeouts > heartbeat interval and enable TCP keepalive.
  • Instrument disconnect/reconnect rates, missed heartbeats, and replay/fallbacks.
  • Test under adverse network conditions and run chaos experiments on servers.

If you want, we can review your WebSocket architecture, suggest resume retention windows and help implement client libraries for reconnect + resume flows. Reach out via support or consult the deployment & monitoring guides.

Was this page helpful?

Your feedback helps us improve RunAsh docs.