Hi. I wonder if you can help us understand why we’re getting WorkflowTaskTimedOut events and how we might avoid them.
Our workflow in essence does this:
- executes a set of activities
- blocks until a signal is received, at which point it sets some search attributes
We have several instances of one worker service (using the go-sdk) which executes both the workflow tasks and activities. Some of the activities are very CPU-intensive and run for several seconds. The workers have a MaxConcurrentActivityExecutionSize set to 8.
When we have heavy load, we frequently see a 5 second delay before the workflow task arising from the signal is processed.
This is a problem for us, because the signal corresponds to a user picking up a task in a web UI and this means they sit for several seconds waiting for work to appear.
Here’s an example workflow history (note the 5 second gap between event 35 and 36):
34 2023-01-13T15:25:13Z WorkflowExecutionSignaled {SignalName:start_client_action:validation,
Input:[{“Timeout”:1800000000000,“Language”:“”}],
Identity:23@workflows-764ccb7f89-hz9jt@}
35 2023-01-13T15:25:13Z WorkflowTaskScheduled {TaskQueue:{Name:workflows-7dd6d6bf57-97t79:8ad97eb5-f4cc-4e11-962a-e1f8bdb41d04,
Kind:Sticky}, StartToCloseTimeout:10s, Attempt:1}
36 2023-01-13T15:25:18Z WorkflowTaskTimedOut {ScheduledEventId:35,
StartedEventId:0,
TimeoutType:ScheduleToStart}
37 2023-01-13T15:25:18Z WorkflowTaskScheduled {TaskQueue:{Name:ACTIVITY_TASK_QUEUE,
Kind:Normal},
StartToCloseTimeout:10s, Attempt:1}
38 2023-01-13T15:25:18Z WorkflowTaskStarted {ScheduledEventId:37,
Identity:23@workflows-764ccb7f89-hz9jt@,
RequestId:1e551c48-7ae4-4268-a808-388b0686fca0}
39 2023-01-13T15:25:19Z WorkflowTaskCompleted {ScheduledEventId:37, StartedEventId:38,
Identity:23@workflows-764ccb7f89-hz9jt@,
BinaryChecksum:fea742682a5e6d8375a0ce428f05fd55}
40 2023-01-13T15:25:19Z TimerStarted {TimerId:40,
StartToFireTimeout:30m0s,
WorkflowTaskCompletedEventId:39}
41 2023-01-13T15:25:19Z UpsertWorkflowSearchAttributes {WorkflowTaskCompletedEventId:39,
SearchAttributes:{IndexedFields:map{ClientAction:“validation”,
State:“pending:complete_client_action”}}}
Our original theory was that “stickiness” is enabled and the workflow task is initially trying to run on the worker it first ran on, but that this worker is too busy to pick it up, then runs on another. Is this a correct interpretation of the event history?
If this is correct, how might we avoid this? We’d considered having two services, one that just runs the workflow tasks (and not the CPU-intensive activities) so we can ensure the workflow tasks are executed quickly but we can’t see how to do this. Is it possible/a good idea?
We could disable stickiness, but we think we would still risk the function being picked up by a service that is busy with activities.
In general there is a large backlog of many workflow executions with CPU-intensive activity tasks to do, but we want to maintain high responsiveness to signals relating to our front-end.
Many thanks for any advice!
Mark