Incorrect task slots getting created in K8S multi pod deployment

Hi,

We have a K8S deployment where temporal server(Front end - 3 pods, 4 cores 4GB each ; History - 5 pods, 8 cores 8GB each; Matching - 3 pods, 4 cores 4GB each, 4K shards, matching queue partitions at 4) as well as our microservices(hosting workers) deployed in same cluster. Have 12 pods of our service and observed that temporal_worker_task_slots_available get emitted only from few pods out of those 12. And especially this behavior is intermittent and at times when it is not being emitted from any of the pods, still workflows run successfully. We have a code similar to GitHub - applicaai/spring-boot-starter-temporal: The driver making it convenient to use Temporal with Spring Boot. to initialize the workers and all the startup code via a library. Below are my queries :

  1. Has anyone faced similar issue where slots are being generated intermittently or any similar issue faced when using the above mentioned starter project ?
  2. Even when temporal_worker_task_slots_available was not being emitted from a single pod, how is it possible that workflows are running successfully without any errors?

We already asked this kind of query as part of another post(Seeing high latencies between two subsequent activity task executions - #15 by Yimin_Chen), but did not receive an answer related to this specific question, hence asking it in a separate thread.

Please guide.

observed that temporal_worker_task_slots_available get emitted only from few pods

One maybe obvious thing to check is if all your pods are running on java sdk 1.8.0 or greater.
Another is if you are checking with Grafana to see if you don’t get results for none of the labels:
`worker_type=“ActivityWorker”

`worker_type=“LocalActivityWorker”

worker_type=“WorkflowWorker”

still workflows run successfully.

For the workflows that are completing, can you check the “identity” value of their first WorkflowTaskStarted event in history and see if its the identity of any of the workers in pods that are not emitting this metric?
You can set custom identity via WorkflowClientOptions if that makes it easier.

@tihomir ok, will check on the identity part and confirm back on Monday.
Another thing to mention is that when worker initialization is done via spring-boot-starter-temporal/src/main/java/ai/applica/spring/boot at master · applicaai/spring-boot-starter-temporal · GitHub then we see this kind of inconsistent worker creation when we scale up the worker pods in K8S setup. If we do not use the above code and move the worker initialization in our spring boot application without the above annotations, then it seems to be working perfectly fine.

Below are the files doing the annotation processing and creating worker threads :

Do you see any problem in above code of worker creation? It would be really helpful if you can guide on the same…

It would be great if someone sheds light on this area…

then we see this kind of inconsistent worker creation when we scale up the worker pods in K8S setup

We cannot provide support for the mentioned spring boot starter as we don’t control the code and cannot fix possible issues/bugs. Best maybe to ask them as this seems not Temporal related issues.

We are working on spring boot integration within the Temporal SDK, see here. This will be evolved in the future.

If we do not use the above code and move the worker initialization in our spring boot application without the above annotations, then it seems to be working perfectly fine.

Ok it sounds like this is the best way to go then to bypass possible issues / bugs with the mentioned library/starter.

Thanks @tihomir for the clarification and will keep any eye on the springboot support for temporal.
Also, when i checked on the identity part, can see no. of pollers in web ui equals to the no. of worker pods, but somehow the slots are not being emitted by all those pods in prometheus.