We have a K8S deployment where temporal server(Front end - 3 pods, 4 cores 4GB each ; History - 5 pods, 8 cores 8GB each; Matching - 3 pods, 4 cores 4GB each, 4K shards, matching queue partitions at 4) as well as our microservices(hosting workers) deployed in same cluster. Have 12 pods of our service and observed that temporal_worker_task_slots_available get emitted only from few pods out of those 12. And especially this behavior is intermittent and at times when it is not being emitted from any of the pods, still workflows run successfully. We have a code similar to GitHub - applicaai/spring-boot-starter-temporal: The driver making it convenient to use Temporal with Spring Boot. to initialize the workers and all the startup code via a library. Below are my queries :
Has anyone faced similar issue where slots are being generated intermittently or any similar issue faced when using the above mentioned starter project ?
Even when temporal_worker_task_slots_available was not being emitted from a single pod, how is it possible that workflows are running successfully without any errors?
observed that temporal_worker_task_slots_available get emitted only from few pods
One maybe obvious thing to check is if all your pods are running on java sdk 1.8.0 or greater.
Another is if you are checking with Grafana to see if you don’t get results for none of the labels:
still workflows run successfully.
For the workflows that are completing, can you check the “identity” value of their first WorkflowTaskStarted event in history and see if its the identity of any of the workers in pods that are not emitting this metric?
You can set custom identity via WorkflowClientOptions if that makes it easier.
Thanks @tihomir for the clarification and will keep any eye on the springboot support for temporal.
Also, when i checked on the identity part, can see no. of pollers in web ui equals to the no. of worker pods, but somehow the slots are not being emitted by all those pods in prometheus.