I’m facing an issue with workflow task distribution in my Kubernetes cluster, and I hope someone can help me understand what’s happening.
Context:
I have a large K8s cluster where many Pods are used for ML processing.
I dispatch thousands of workflows at one time to be processed by these Pods.
Problem:
While everything works as expected initially, the issue arises when a Pod restarts or when I introduce new Pods to the cluster. These Pods do not pick up any workflows from the task queue, essentially remaining idle. This behavior occurs for both new and restarted Pods.
Question:
Is this by design, where old workflows do not get reassigned to new or restarted worker Pods? If so, is there a way to circumvent this, so that the tasks in the queue are not “orphaned”?
Any help or insight on this matter would be greatly appreciated.
Thank you for your interest in my issue. To give you more context:
I have a client application that creates a batch of 3000 workflows. Within each of these workflows, there are three distinct activities. These activities have their own task queues as they correspond to different services. Additionally, there is one main task queue for the client’s workflow. The configuration looks like this:
Main workflow task-queue: dsp-poc
Task queue for Service-1: service-1
Task queue for Service-2: service-2
Task queue for Service-3: service-3
Initially, when I dispatch the 3000 workflows, everything works fine. I can see the workflows and their activities being processed on the Temporal dashboard. The existing Pods (about 10 for each different service) pick up tasks from their respective queues and process them.
However, the issue occurs when I scale the number of Pods or when a Pod restarts. These new or restarted Pods do not pick up any tasks from the task queues, essentially remaining idle. This behavior is consistent across all services and their corresponding task queues.
I hope this additional context helps clarify my situation.
but I’ve created a new cluster with new pods and workflows so it may be inaccurate as the old clsuter is deleted (very very expensive to keep it runnning)
What is the state of the workflows that don’t make the progress? Do you see any activities in the pending activities view in the UI? What is the StartToClose timeout of these activities?
Thanks for your continued interest in helping me resolve this issue. @maxim@antonio.perez
To further investigate, I conducted a smaller scale test by creating 12 workflows while having Pods already running. The pods are processing the tasks however After restarting the Pods, the new pods did not pick up any tasks from the task queues, leaving them idle.
I also ran some diagnostic commands to dig deeper into the issue
Temporal doesn’t detect worker process failure directly. It relies on activity timeouts for their retries. So as you specified StartToClose timeout of 2 days a worker restart will be detected in 2 days and activities will be retried.
Why do you need such a long timeout for a single activity execution attempt? If you need a long timeout then you have to specify a heartbeat timeout and make sure that the activity heartbeats.
One more update, when I dispatch a new workflow (after restarting the pods) the new dispatched workflows seems to be handled by the new pods without any problems, but old workflows are tuck at running
my main theory is maybe its related to the retry policy.
Thank you for your insights. It’s reassuring to hear that my findings align with my main theory after two days of debugging. The speed of ML processing has improved, so the workflows are now faster.
I do have a question that could further clarify how to handle pod restarts:
I have a workflow composed of three activities: DSP-1, DSP-3, and DSP-4. A single workflow currently takes almost 2 hours to complete. My idea is to set the startToCloseTimeout to 130 minutes to account for the time it takes to complete the workflow as well as any unexpected delays.
Would setting the startToCloseTimeout to 130 minutes ensure that, in case a pod restarts after 2 hours, the workflow would then be picked up by a new pod in its second attempt?
Your clarification on this would be greatly appreciated.
StartToClose timeout applies to an activity. It is not related to workflow duration as workflow can run much longer or shorter depending on the requirements.
What are these activities? Are they expected to be long running? Then make sure that you specify a relatively short heartbeat timeout and that these activities heartbeat. Then the pod restart will be detected by the activity after the heartbeat timeout and the activity will be retried.