Temporal activity timeout issue

xwang · December 12, 2020, 12:32am

Hi,

We are seeing unexpected timeout for temporal activity happening randomly and here are the detailed context:

We have a scheduler service and aggregator service both on Temporal: the scheduler service schedule aggregation workflows every 15 minute, so every 15 minute we would have around 700 workflows coming in spike.

The aggregator service has 8 pods, each pod has 1 Temporal worker with 12 polling thread for both activity and workflow. Currently we have only 1 workflow type (the workflow contains 2 activites) registered on these workers and most of the time the workflow finishes within 1 minute.

However, we see occasionally, one or two aggregation workflow timeout (15 minute) and it gets stuck right at the state PENDING_ACTIVITY_STATE_STARTED. It seems that it is not picked up by any worker thread for the entire 15 minute. I’ve attched a sceenshot of the temporal history.

We want to know under what scenario such activity won’t be picked up? is our configuration looks ok? Or is this caused by any other issues?

Wenquan_Xing · December 12, 2020, 12:49am

Can you try make the activity idempotent and add a retry policy?

Currently there is an issue, if the poller (SDK)'s timeout before getting the activity task, the task will only be timed out: https://github.com/temporalio/temporal/issues/1058
We are actively working this issue, in the mean time, plz add a retry policy to the activity.

xwang · December 12, 2020, 3:55am

Yes, these activites are indeed idempotent and we will reconfigure the timeout to make sure the retry takes effect. Please keep up posted with the investigation result on this issue in the meanwhile. Thanks!

Wenquan_Xing · December 18, 2020, 1:03am

UPDATE:

However, we see occasionally, one or two aggregation workflow timeout (15 minute) and it gets stuck right at the state PENDING_ACTIVITY_STATE_STARTED. It seems that it is not picked up by any worker thread for the entire 15 minute. I’ve attched a sceenshot of the temporal history.

PENDING_ACTIVITY_STATE_STARTED means the from Temporal’s point of view, the activity is already started, server is waiting for SDK to respond, with either heartbeat (if configured) or completion. (meaning that the activity is already picked up by the SDK)

Since SDK can experience network timeout, restarts, caller should (in general) make the activity implementation idempotent, and configure the activity with proper heartbeat timeout and retry.

NOTE: The github issue i mentioned above seems to be related to activity schedule to start timeout being too short.

Wenquan_Xing · December 18, 2020, 1:36am

BTW, also seems that we need to improve the variable naming, I was also tripped by PENDING_ACTIVITY_STATE_STARTED, this means activity is in started state, not waiting to be started

Topic		Replies	Views
Workflow task timeout after activity is completed Community Support go-sdk	5	88	April 21, 2025
Activity poller becomes inactive, activities stuck in PENDING_ACTIVITY_STATE_SCHEDULED state Community Support typescript-sdk	5	151	April 4, 2025
Activity stuck after activity timeout Community Support activity , timeout	9	1740	June 2, 2021
Temporal Activity Poll & Start Delays - Issues under Load Community Support java-sdk , general-impl	6	728	May 24, 2023
Activity scheduled but not started (need help) Community Support go-sdk	22	5304	June 27, 2022

Temporal activity timeout issue

Related topics