Workflow task timeout after activity is completed

hamed_Yousefi1 · April 17, 2025, 1:21pm

Hi,

We’re using Temporal in production to manage our clients’ scheduled automations. Recently, I’ve noticed a rise in Workflow Task Timeout errors — from zero to 60 over the past two weeks.

To address this, I increased the timeout settings, but the issue persists. Here’s the current configuration:

Worker StickyScheduleToStartTimeout: 2m
Workflow WorkflowTaskTimeout: 2m
Activity ScheduleToStartTimeout: 2m
Activity StartToCloseTimeout: 1m

The Workflow Task Timeout seems to occur randomly — sometimes right after the first activity completes, sometimes after the second, and occasionally after a signal is received. It feels like the workflow is idling after completing a task, waiting for something that never happens.

Any insights on what might be causing this?

tihomir · April 20, 2025, 11:00am

Check if times match your worker restarts / shard movement on service side.
For restarts could check
sum(rate(service_requests{service_name="frontend", operation="DescribeNamespace"}[1m]) or on () vector(0))
for shard movement

sum(rate(sharditem_created_count{service_name=”history”}[1m]))
sum(rate(sharditem_removed_count{service_name=”history”}[1m]))
sum(rate(sharditem_closed_count{service_name=”history”}[1m]))

From event history looks like also you have non-deterministic issue (after workflow task timeout). Compare identity field of events 25 and 28 to see if its same worker or not. Also not really recommended to set workflow run/execution timeout (reason why your workflow execution timed out), whats use case to set that?

hamed_Yousefi1 · April 21, 2025, 9:31am

Thank you for the reply @tihomir

I’m using temporal with postgresql.

What are these queries?
Where and how can I run them?

I think I can remove some of the timeouts, they are legacy from our first implementation.

hamed_Yousefi1 · April 21, 2025, 9:58am

They are different. It looks like that the temporal tried to change the worker after the timeout and then it got NDE. But I have checked the possible NDE with workflow check tool and there is no error in output.

[TMPRL1100] unknown command CommandType: Activity, ID: c3958b8c-f295-4b72-9c86-30618c4e9593, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition

c3958b8c-f295-4b72-9c86-30618c4e959 is the first activity of the workflow.

tihomir · April 21, 2025, 12:42pm

Queries are Grafana queries (server metrics). Is the first activity scheduled conditionally in your workflow code? Also check to make sure all workers have same activity code registered.

hamed_Yousefi1 · April 21, 2025, 6:28pm

Found the NDE reason. I was using uuid for activityID for some reason in previous workflow versions. The activityID in temporal must be deterministic. I have removed it and now I can replay the history successfully.

I’m going to run a load test after fixing activityID issue. I expect another worker continue the task after we got workflow task timeout on the previous worker.

Topic		Replies	Views
Activity stuck after activity timeout Community Support activity , timeout	9	1713	June 2, 2021
Temporal workflow seems to stall before starting execution Community Support go-sdk	3	529	February 27, 2024
Temporal activity timeout issue Community Support	4	1538	December 18, 2020
Why does my activity often StartToCloseTimeout? Community Support go-sdk	9	3264	September 28, 2023
WorkflowExecutionTimeout issue Community Support java-sdk	5	869	August 12, 2021

Workflow task timeout after activity is completed

Related topics