Workflow task timeout after activity is completed

Hi,

We’re using Temporal in production to manage our clients’ scheduled automations. Recently, I’ve noticed a rise in Workflow Task Timeout errors — from zero to 60 over the past two weeks.

To address this, I increased the timeout settings, but the issue persists. Here’s the current configuration:

  • Worker StickyScheduleToStartTimeout: 2m
  • Workflow WorkflowTaskTimeout: 2m
  • Activity ScheduleToStartTimeout: 2m
  • Activity StartToCloseTimeout: 1m

The Workflow Task Timeout seems to occur randomly — sometimes right after the first activity completes, sometimes after the second, and occasionally after a signal is received. It feels like the workflow is idling after completing a task, waiting for something that never happens.

Any insights on what might be causing this?

Check if times match your worker restarts / shard movement on service side.
For restarts could check
sum(rate(service_requests{service_name="frontend", operation="DescribeNamespace"}[1m]) or on () vector(0))
for shard movement

sum(rate(sharditem_created_count{service_name=”history”}[1m]))
sum(rate(sharditem_removed_count{service_name=”history”}[1m]))
sum(rate(sharditem_closed_count{service_name=”history”}[1m]))

From event history looks like also you have non-deterministic issue (after workflow task timeout). Compare identity field of events 25 and 28 to see if its same worker or not. Also not really recommended to set workflow run/execution timeout (reason why your workflow execution timed out), whats use case to set that?

Thank you for the reply @tihomir

I’m using temporal with postgresql.

  • What are these queries?
  • Where and how can I run them?

I think I can remove some of the timeouts, they are legacy from our first implementation.

They are different. It looks like that the temporal tried to change the worker after the timeout and then it got NDE. But I have checked the possible NDE with workflow check tool and there is no error in output.

[TMPRL1100] unknown command CommandType: Activity, ID: c3958b8c-f295-4b72-9c86-30618c4e9593, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition

c3958b8c-f295-4b72-9c86-30618c4e959 is the first activity of the workflow.

Queries are Grafana queries (server metrics). Is the first activity scheduled conditionally in your workflow code? Also check to make sure all workers have same activity code registered.

Found the NDE reason. I was using uuid for activityID for some reason in previous workflow versions. The activityID in temporal must be deterministic. I have removed it and now I can replay the history successfully.

I’m going to run a load test after fixing activityID issue. I expect another worker continue the task after we got workflow task timeout on the previous worker.