Restarted and New Pods Not Picking Up Old Workflows from Task Queue in K8s Cluster

Hello Temporal Community,

I’m facing an issue with workflow task distribution in my Kubernetes cluster, and I hope someone can help me understand what’s happening.

Context:

  • I have a large K8s cluster where many Pods are used for ML processing.
  • I dispatch thousands of workflows at one time to be processed by these Pods.

Problem:

While everything works as expected initially, the issue arises when a Pod restarts or when I introduce new Pods to the cluster. These Pods do not pick up any workflows from the task queue, essentially remaining idle. This behavior occurs for both new and restarted Pods.

Question:

Is this by design, where old workflows do not get reassigned to new or restarted worker Pods? If so, is there a way to circumvent this, so that the tasks in the queue are not “orphaned”?

Any help or insight on this matter would be greatly appreciated.

Best regards,

Hi @2bezzat

do you have long running activities? if so, do you heartbeat?

Can you check the current workers listening or each taskqueue?

temporal task-queue describe --task-queue <tqName> --task-queue-type <workflow or activity>

These Pods do not pick up any workflows from the task queue, essentially remaining idle

what is the state of the workflow executions?

Hello Antonio

Its not that long actually, a workflow takes on average 40 minutes to finish, and I dispatch 3000 workflow and I’ve 30 pods handling them

I don’t have heartbeat no

The state of workflow is “running”

Thanks

If I undestand you correctly,

you have a workflow that creates 3000 child workflows?

where are those workflows stuck? activities? first workflow task? Could you share the even history of one of them?

can you check the number of slot available Temporal SDK metrics reference | Temporal Documentation for workflows an activities (worker_type)?

can you run the command I paste above to double check if you have workers listening on the taskqueue you run your workflows and activities?

Thank you

Hello Antonio,

Thank you for your interest in my issue. To give you more context:

I have a client application that creates a batch of 3000 workflows. Within each of these workflows, there are three distinct activities. These activities have their own task queues as they correspond to different services. Additionally, there is one main task queue for the client’s workflow. The configuration looks like this:

  • Main workflow task-queue: dsp-poc
  • Task queue for Service-1: service-1
  • Task queue for Service-2: service-2
  • Task queue for Service-3: service-3

Initially, when I dispatch the 3000 workflows, everything works fine. I can see the workflows and their activities being processed on the Temporal dashboard. The existing Pods (about 10 for each different service) pick up tasks from their respective queues and process them.

However, the issue occurs when I scale the number of Pods or when a Pod restarts. These new or restarted Pods do not pick up any tasks from the task queues, essentially remaining idle. This behavior is consistent across all services and their corresponding task queues.

I hope this additional context helps clarify my situation.

For your commands, heres the output

but I’ve created a new cluster with new pods and workflows so it may be inaccurate as the old clsuter is deleted (very very expensive to keep it runnning)

What is the state of the workflows that don’t make the progress? Do you see any activities in the pending activities view in the UI? What is the StartToClose timeout of these activities?

Thanks for your continued interest in helping me resolve this issue. @maxim @antonio.perez

To further investigate, I conducted a smaller scale test by creating 12 workflows while having Pods already running. The pods are processing the tasks however After restarting the Pods, the new pods did not pick up any tasks from the task queues, leaving them idle.

I also ran some diagnostic commands to dig deeper into the issue

For Main workflow task-queue (I’ve only have one)

For One of service task queue (3 old pods and 4 new pods )

Also from Temporal UI the workflow seems to be stuck at ActivityTaskScheduled

StartToClose timeout is 2 days and maximumAttempts are 3

Temporal doesn’t detect worker process failure directly. It relies on activity timeouts for their retries. So as you specified StartToClose timeout of 2 days a worker restart will be detected in 2 days and activities will be retried.

Why do you need such a long timeout for a single activity execution attempt? If you need a long timeout then you have to specify a heartbeat timeout and make sure that the activity heartbeats.

One more update, when I dispatch a new workflow (after restarting the pods) the new dispatched workflows seems to be handled by the new pods without any problems, but old workflows are tuck at running

my main theory is maybe its related to the retry policy.

my main activity looks like this

Thank you for your insights. It’s reassuring to hear that my findings align with my main theory after two days of debugging. The speed of ML processing has improved, so the workflows are now faster.

I do have a question that could further clarify how to handle pod restarts:

I have a workflow composed of three activities: DSP-1, DSP-3, and DSP-4. A single workflow currently takes almost 2 hours to complete. My idea is to set the startToCloseTimeout to 130 minutes to account for the time it takes to complete the workflow as well as any unexpected delays.

Would setting the startToCloseTimeout to 130 minutes ensure that, in case a pod restarts after 2 hours, the workflow would then be picked up by a new pod in its second attempt?

Your clarification on this would be greatly appreciated.

StartToClose timeout applies to an activity. It is not related to workflow duration as workflow can run much longer or shorter depending on the requirements.

What are these activities? Are they expected to be long running? Then make sure that you specify a relatively short heartbeat timeout and that these activities heartbeat. Then the pod restart will be detected by the activity after the heartbeat timeout and the activity will be retried.

See The 4 Types of Activity Timeouts.

1 Like