Documentation of Workflow Task Timeout

On the documentation page Workflows | Temporal Documentation , under the sub section for Workflow Task Timeout, it says the following -
The default value is 10 seconds. This timeout is primarily available to recognize whether a Worker has gone down so that the Workflow Execution can be recovered on a different Worker. The main reason for increasing the default value would be to accommodate a Workflow Execution that has a very long Workflow Execution History that could take longer than 10 seconds for the Worker to load.

Why would a long workflow execution history cause worker to load late? The worker is remote to the temporal cluster and only ever does the next set of commands with a local state, but the history itself is maintained by the cluster. Wanted to understand this statement better…can you please provide some insight here…thanks

1 Like

Hello @Himaja

As you have mentioned above, if a worker crashes the workflow execution is placed on a different worker. As part of the Workflow Task, the worker pulls the workflow history and replays the workflow execution in order to recreate the workflow state, and continues from there.

the history itself is maintained by the cluster

that is true, but it is also cached by the worker executing the workflow.

1 Like

Thank you that helps

Would another reason for replay is the workflow getting evicted from the Worker’s cache? @antonio.perez

Yes, a workflow that was pushed out of the cache needs to be recovered when it needs to make progress.

Hi @maxim @antonio.perez, do you have a list of possible scenarios of when a workflow history may be evicted from the worker’s cache? E.g. could it happen if say Workflow A spawns and executes several child workflows e.g. 20 of them. Then resuming execution of Workflow A, it may need to obtain and replay the full workflow history again?

To give context, in our use case, under heavy load we have 5-6 worker instances running, even so, in some cases we observe that a WorkflowTask StartToClose timeout is thrown when a Workflow Task is started after executing multiple child workflows.

It looks to me like an obvious mitigation would be to increase the WorkflowTaskStartToCloseTimeout to 1 minute (as 1 minute is maximum limit allowed by the Server if I recall correctly?). But could there hypothetically be cases where even 1 minute is not enough?

You should never hit a 10-second timeout even if the history is replayed when the cluster and workers are provisioned correctly. I recommend keeping the workflow history size under control and not passing large blobs as part of the workflow data.

Hey @maxim, thanks for your response.

We’ve actually uncovered the root cause of our WorkflowTaskTimeout error. It appears that the PollWorkflowTaskQueue is actually failing due to the Workflow History Size exceeding the gRPC response payload limit of 4MB. So our worker consistently attempts to poll for the task and workflow history and fails. Looks like we’ll need to do some adjusting on our workflow/activity return structs.

1 Like

Just to provide an update to future thread readers: We found that the PollWorkflowTaskQueue was actually failing to obtain the full workflow size history (around ~9MB) due to an inherent default response payload limit set by gRPC (4MB). If you’re using the temporal client to directly communicate with the temporal server, this limit is actually set by temporal to be 128MB, but if you have some sort of proxy in between, you might want to check if your proxy has its default set to 4MB as well.

1 Like