Long-running workflow that starts an async activity which establishes and keeps a WebSocket connection alive indefinitely

Hi! We’re building a WhatsApp Web simulation.

Each WhatsApp session is represented by a long-running workflow that starts an async activity which establishes and keeps a WebSocket connection to WhatsApp alive indefinitely.
We’re using throw new CompleteAsyncError() to make the activity never complete and rely on periodic heartbeats to keep it alive. The workflow handles reconnections, QR updates, new messages received, and state changes via signals, and we use continueAsNew to avoid history limits.

However, we’re issues:

  • Activities sometimes get marked as “completed” unexpectedly, killing the socket.
  • Activity heartbeat timeout.
  • The design feels fragile and expensive to scale (sessions means open workflows, async activities, actions).

Is this architecture recommended? What are the risks and best practices for managing long-lived WebSocket connections like this with Temporal?

  • Activities sometimes get marked as “completed” unexpectedly, killing the socket.
  • Activity heartbeat timeout.

My hunch is that something is blocking your activity heartbeat. This could be CPU starvation or a heartbeat timeout that is too aggressive.

Is any way to get notifications from WhatsApp without keeping a connection per session? The fragility comes from a stateful connections, not from Temporal.

You are right, the fragility comes mainly from the fact that WhatsApp Web is fundamentally stateful and connection-oriented.

Unfortunately, there’s no official push mechanism from WhatsApp that allows us to receive events without maintaining a live connection per session. Each session(phone number) requires a dedicated WebSocket connection with WhatsApp to:

  • Receive new messages in real time
  • Maintain sync state
  • Handle QR code login flow
  • Persist session authentication keys (when reconnecting)

We’re leaning toward moving the socket outside Temporal (to a long-lived worker), and using Temporal only to orchestrate the lifecycle of each session. That seems more robust and scalable.

But if you or the team have seen other architectures or ideas to avoid keeping live WebSockets per session, we’d love to hear about them!

I think moving the websocket out of an activity is going to make managing them more complicated. For example how do you know when to reconnect if a process that hosts the websocket is restarted?

Our current idea is:

  • Each socket process registers itself with a central Session Registry (probably Redis with TTL).
  • It periodically sends a heartbeat to indicate the session is alive.
  • A watchdog process (could be a Temporal scheduled workflow or a cron service) monitors those sessions:
    • If a session stops heartbeating for N seconds → trigger reconnectSessionWorkflow.
    • If the process crashes or is restarted → on boot, the socket worker can rehydrate state from persisted credentials and re-register itself.

IMHO, this is much more complicated and error-prone than a heartbeating activity that maintains the session.