WorkflowTaskTimedOut

Hi. I wonder if you can help us understand why we’re getting WorkflowTaskTimedOut events and how we might avoid them.

Our workflow in essence does this:

  • executes a set of activities
  • blocks until a signal is received, at which point it sets some search attributes

We have several instances of one worker service (using the go-sdk) which executes both the workflow tasks and activities. Some of the activities are very CPU-intensive and run for several seconds. The workers have a MaxConcurrentActivityExecutionSize set to 8.

When we have heavy load, we frequently see a 5 second delay before the workflow task arising from the signal is processed.

This is a problem for us, because the signal corresponds to a user picking up a task in a web UI and this means they sit for several seconds waiting for work to appear.

Here’s an example workflow history (note the 5 second gap between event 35 and 36):

34 2023-01-13T15:25:13Z WorkflowExecutionSignaled {SignalName:start_client_action:validation,
Input:[{“Timeout”:1800000000000,“Language”:“”}],
Identity:23@workflows-764ccb7f89-hz9jt@}
35 2023-01-13T15:25:13Z WorkflowTaskScheduled {TaskQueue:{Name:workflows-7dd6d6bf57-97t79:8ad97eb5-f4cc-4e11-962a-e1f8bdb41d04,
Kind:Sticky}, StartToCloseTimeout:10s, Attempt:1}
36 2023-01-13T15:25:18Z WorkflowTaskTimedOut {ScheduledEventId:35,
StartedEventId:0,
TimeoutType:ScheduleToStart}
37 2023-01-13T15:25:18Z WorkflowTaskScheduled {TaskQueue:{Name:ACTIVITY_TASK_QUEUE,
Kind:Normal},
StartToCloseTimeout:10s, Attempt:1}
38 2023-01-13T15:25:18Z WorkflowTaskStarted {ScheduledEventId:37,
Identity:23@workflows-764ccb7f89-hz9jt@,
RequestId:1e551c48-7ae4-4268-a808-388b0686fca0}
39 2023-01-13T15:25:19Z WorkflowTaskCompleted {ScheduledEventId:37, StartedEventId:38,
Identity:23@workflows-764ccb7f89-hz9jt@,
BinaryChecksum:fea742682a5e6d8375a0ce428f05fd55}
40 2023-01-13T15:25:19Z TimerStarted {TimerId:40,
StartToFireTimeout:30m0s,
WorkflowTaskCompletedEventId:39}
41 2023-01-13T15:25:19Z UpsertWorkflowSearchAttributes {WorkflowTaskCompletedEventId:39,
SearchAttributes:{IndexedFields:map{ClientAction:“validation”,
State:“pending:complete_client_action”}}}

Our original theory was that “stickiness” is enabled and the workflow task is initially trying to run on the worker it first ran on, but that this worker is too busy to pick it up, then runs on another. Is this a correct interpretation of the event history?

If this is correct, how might we avoid this? We’d considered having two services, one that just runs the workflow tasks (and not the CPU-intensive activities) so we can ensure the workflow tasks are executed quickly but we can’t see how to do this. Is it possible/a good idea?

We could disable stickiness, but we think we would still risk the function being picked up by a service that is busy with activities.

In general there is a large backlog of many workflow executions with CPU-intensive activity tasks to do, but we want to maintain high responsiveness to signals relating to our front-end.

Many thanks for any advice!

Mark

I believe your analysis is correct. The worker is busy and doesn’t pick up the sticky task for 5 seconds, causing its retry on a different worker.

I would recommend separating out workflow worker into a separate pool of processes to avoid such starvation. You can also increase the workflow sticky task ScheduleToStart timeout from 5 seconds as a stopgap measure.

Hi Maxim,

Thanks for your reply, that’s really helpful.

We will create some instances of the service that don’t register the activities and therefore only process workflow tasks.

I don’t follow why increasing StickyScheduleToStartTimeout would help - wouldn’t that mean that tasks waited longer before being queued for any worker to pick up?

Should we also be disabling sticky execution? I don’t think we are getting any particular value from it and it seems like we need to ensure that there is no attempt to schedule back to a specific worker.

It’s not clear to us how to do this now DisableStickyExecution is deprecated. Do we set worker.SetStickyWorkflowCacheSize(0) in all worker processes?

Thanks

Mark

I don’t follow why increasing StickyScheduleToStartTimeout would help - wouldn’t that mean that tasks waited longer before being queued for any worker to pick up?

It means the task would be allowed to sit in the sticky queue longer before the worker picks it up. It reduces the number of timed-out tasks if the worker is overloaded in spikes. The drawback is that if the worker dies, moving the task to another worker will take longer.

Should we also be disabling sticky execution? I don’t think we are getting any particular value from it and it seems like we need to ensure that there is no attempt to schedule back to a specific worker.

There are practically no use cases when disabling sticky execution improves performance. When it is disabled any workflow task requires replaying the whole workflow history from the beginning. The computational complexity of the workflow that has N history events is O(N) with sticky execution and is O(N^2) without it.

Hi Maxim

Having dug further into this, we’ve discovered another dimension to the problem we’re facing. We have temporal deployed on kubernetes, via the helm chart. Now we understand how to interpret the workflow history, we can see that WorkflowTaskTimedOut events are happening because the tasks have been put on a worker task queue named after a pod that no longer exists:

TaskQueue:{Name:workflows-764ccb7f89-g2j8b:6ee94956-d789-49df-b7fb-d8d26dce22ea

This means that these executions will always incur a timeout on the workflow task when they are signalled, because the worker owning that queue is not there any more. We’re frequently deploying new versions, which changes pod names and it looks like that’s what determines the worker task queue name.

In our case I suspect we actually would benefit from disabling sticky execution because we have lots of executions with quite short histories, and we need the signal to be processed quickly. Once the timeout has happened and another worker picks up the workflow task everything happens nice and quickly, which implies that the replay is not an issue for us.

But we can’t work out how to disable sticky execution. The worker DisableStickyExecution property is deprecated and we can’t find anything in the docs that explains how we should do this. Is it possible?

Best wishes

Mark

After the sticky worker is gone (for more than 10s), the server will push the task to its original task queue. This is fixed is available in v1.17. Are you using server version older than v1.17?

Hi. Thanks for the follow-up. Yes, we’re on 1.15 but planning an upgrade so this is another reason for us to do that soon! Thanks!

Mark

hello support, I can see the same issue i faced like mentioned here WorkflowTaskTimedOut - #5 by Mark_Wightman
and it took 4 minutes to remove deleted pod from the temporal task-queue and it is affecting the performance of workflows and I have the latest temporal version 1.22.5.

Please advise as it is very important and critical and I started to receive the below error
WARN temporal_sdk_core::worker::workflow: Task not found when completing error=status: NotFound, message: “Workflow task not found.”, details: [8, 5, 18, 24, 87, 111, 114, 107, 102, 108, 111, 119, 32, 116, 97, 115, 107, 32, 110, 111, 116, 32, 102, 111, 117, 110, 100, 46, 26, 66, 10, 64, 116, 121, 112, 101, 46, 103, 111, 111, 103, 108, 101, 97, 112, 105, 115, 46, 99, 111, 109, 47, 116, 101, 109, 112, 111, 114, 97, 108, 46, 97, 112, 105, 46, 101, 114, 114, 111, 114, 100, 101, 116, 97, 105, 108, 115, 46, 118, 49, 46, 78, 111, 116, 70, 111, 117, 110, 100, 70, 97, 105, 108, 117, 114, 101], metadata: MetadataMap { headers: {“content-type”: “application/grpc”} } run_id=“13e08f19-c999-496c-a8e6-9552b5e30e28”

HI @Mark_Wightman how did you fix it?

主要原因是因为task超时导致的,如果使用的是execute_child_workflow方法的话加上task_timeout参数增加多一些的超时时间,如果使用的是execute_activity增加start_to_close_timeout参数的超时时间