WorkflowTaskTimedOut

Hi Maxim

Having dug further into this, we’ve discovered another dimension to the problem we’re facing. We have temporal deployed on kubernetes, via the helm chart. Now we understand how to interpret the workflow history, we can see that WorkflowTaskTimedOut events are happening because the tasks have been put on a worker task queue named after a pod that no longer exists:

TaskQueue:{Name:workflows-764ccb7f89-g2j8b:6ee94956-d789-49df-b7fb-d8d26dce22ea

This means that these executions will always incur a timeout on the workflow task when they are signalled, because the worker owning that queue is not there any more. We’re frequently deploying new versions, which changes pod names and it looks like that’s what determines the worker task queue name.

In our case I suspect we actually would benefit from disabling sticky execution because we have lots of executions with quite short histories, and we need the signal to be processed quickly. Once the timeout has happened and another worker picks up the workflow task everything happens nice and quickly, which implies that the replay is not an issue for us.

But we can’t work out how to disable sticky execution. The worker DisableStickyExecution property is deprecated and we can’t find anything in the docs that explains how we should do this. Is it possible?

Best wishes

Mark