Documentation of Workflow Task Timeout

Himaja · January 3, 2023, 8:58am

On the documentation page Workflows | Temporal Documentation , under the sub section for Workflow Task Timeout, it says the following -
The default value is 10 seconds. This timeout is primarily available to recognize whether a Worker has gone down so that the Workflow Execution can be recovered on a different Worker. The main reason for increasing the default value would be to accommodate a Workflow Execution that has a very long Workflow Execution History that could take longer than 10 seconds for the Worker to load.

Why would a long workflow execution history cause worker to load late? The worker is remote to the temporal cluster and only ever does the next set of commands with a local state, but the history itself is maintained by the cluster. Wanted to understand this statement better…can you please provide some insight here…thanks

antonio.perez · January 3, 2023, 2:06pm

Hello @Himaja

As you have mentioned above, if a worker crashes the workflow execution is placed on a different worker. As part of the Workflow Task, the worker pulls the workflow history and replays the workflow execution in order to recreate the workflow state, and continues from there.

the history itself is maintained by the cluster

that is true, but it is also cached by the worker executing the workflow.

Himaja · January 4, 2023, 8:02am

Thank you that helps

brianluong · April 9, 2024, 6:05pm

Would another reason for replay is the workflow getting evicted from the Worker’s cache? @antonio.perez

maxim · April 9, 2024, 6:17pm

Yes, a workflow that was pushed out of the cache needs to be recovered when it needs to make progress.

ileylow · May 28, 2024, 10:01am

Hi @maxim @antonio.perez, do you have a list of possible scenarios of when a workflow history may be evicted from the worker’s cache? E.g. could it happen if say Workflow A spawns and executes several child workflows e.g. 20 of them. Then resuming execution of Workflow A, it may need to obtain and replay the full workflow history again?

To give context, in our use case, under heavy load we have 5-6 worker instances running, even so, in some cases we observe that a WorkflowTask StartToClose timeout is thrown when a Workflow Task is started after executing multiple child workflows.

It looks to me like an obvious mitigation would be to increase the WorkflowTaskStartToCloseTimeout to 1 minute (as 1 minute is maximum limit allowed by the Server if I recall correctly?). But could there hypothetically be cases where even 1 minute is not enough?

maxim · May 28, 2024, 4:36pm

You should never hit a 10-second timeout even if the history is replayed when the cluster and workers are provisioned correctly. I recommend keeping the workflow history size under control and not passing large blobs as part of the workflow data.

ileylow · May 28, 2024, 11:58pm

Hey @maxim, thanks for your response.

We’ve actually uncovered the root cause of our WorkflowTaskTimeout error. It appears that the PollWorkflowTaskQueue is actually failing due to the Workflow History Size exceeding the gRPC response payload limit of 4MB. So our worker consistently attempts to poll for the task and workflow history and fails. Looks like we’ll need to do some adjusting on our workflow/activity return structs.

ileylow · May 30, 2024, 12:47pm

Just to provide an update to future thread readers: We found that the PollWorkflowTaskQueue was actually failing to obtain the full workflow size history (around ~9MB) due to an inherent default response payload limit set by gRPC (4MB). If you’re using the temporal client to directly communicate with the temporal server, this limit is actually set by temporal to be 128MB, but if you have some sort of proxy in between, you might want to check if your proxy has its default set to 4MB as well.

Topic		Replies	Views
io.temporal.internal.worker.UnableToAcquireLockException Community Support java-sdk	9	818	June 23, 2023
Open long-running workflows waiitng on workflow.await() Community Support java-sdk	2	515	May 26, 2023
Querying active workflows by their history size Community Support history , workflow-config	5	2961	June 16, 2023
Activity not executed due to history fail to process task Community Support history	2	1780	November 12, 2021
Older workflow not re-executed and new workflows not initiated after stop/start temporal server Community Support temporal-031 , workflow-config	6	730	May 22, 2021

Documentation of Workflow Task Timeout

Related topics