The temporal cluster doesn’t support the “active/passive” model you described. It is expected that all front ends receive traffic and all service hosts are always connected to a database.
Some times all the worker would move to one server leaving the other server with no worker.
When ever the request would go the server without workers, It would get stuck waiting for worker. In case of queries, it would time out.
I believe this issue stems from the fact that Load balancer’s aren’t smart enough to distribute the workers equally between the servers. This is because the see everything (requests, queries, workers) as just traffic.
I don’t have specific suggestions besides fixing your load balancer setup. The problem is clearly not with Temporal itself, so fixing it by doing unnatural deployment hacks is not the way I would recommend.
Each Pod is a temporal server. The traffic to these pods are load balanced by the standard kubernetes service.
Can it so happen that all the workers could long poll to one of the Pod (ie: temporal server) at the same instance and the Query methods goes to the POD3 and fails with no worker avaiable?
Also, does each temporal server instance (pod) have its own queue or all of them point to the same queue ?
Meaning, I get query to Pod3 , it creates the task in the queue, can POD1 fulfill the task ?