Prevent Traffic Leak to Passive Temporal server connected to same database

Use case

  • We are following active/passive model.

  • Load balancer is configured to route traffic to only active server. In case of a failure, it will be flipped to passive.

  • However, both will be running and will be connected to the same database.

Expected behaviour

  • Given that the traffic won’t be routed to the passive node, passive node should not participate in any transactions/workflows.

Observer behaviour

  • The passive server seems to be involved in processing some workflows occasionally.

Questions

  • Could this be because both active and passive are connected to same database.?

The temporal cluster doesn’t support the “active/passive” model you described. It is expected that all front ends receive traffic and all service hosts are always connected to a database.

Thanks maxim.

If that’s the case, then we will run only the active server and bring up passive server when active goes down.

With that said, I’m interested in knowing why this model is not support and what are the possible implication if we follow it.

Would you explain why you want to run a single server instead of two for availability?


We ran into issues with Active/Active Model.

  • The workers would move between the servers.
  • Some times all the worker would move to one server leaving the other server with no worker.
  • When ever the request would go the server without workers, It would get stuck waiting for worker. In case of queries, it would time out.

I believe this issue stems from the fact that Load balancer’s aren’t smart enough to distribute the workers equally between the servers. This is because the see everything (requests, queries, workers) as just traffic.

Any suggestions.?

I don’t have specific suggestions besides fixing your load balancer setup. The problem is clearly not with Temporal itself, so fixing it by doing unnatural deployment hacks is not the way I would recommend.

In this example,

Each Pod is a temporal server. The traffic to these pods are load balanced by the standard kubernetes service.

Can it so happen that all the workers could long poll to one of the Pod (ie: temporal server) at the same instance and the Query methods goes to the POD3 and fails with no worker avaiable?

Also, does each temporal server instance (pod) have its own queue or all of them point to the same queue ?
Meaning, I get query to Pod3 , it creates the task in the queue, can POD1 fulfill the task ?