Temporal activity timeout issue

Hi,

We are seeing unexpected timeout for temporal activity happening randomly and here are the detailed context:

We have a scheduler service and aggregator service both on Temporal: the scheduler service schedule aggregation workflows every 15 minute, so every 15 minute we would have around 700 workflows coming in spike.

The aggregator service has 8 pods, each pod has 1 Temporal worker with 12 polling thread for both activity and workflow. Currently we have only 1 workflow type (the workflow contains 2 activites) registered on these workers and most of the time the workflow finishes within 1 minute.

However, we see occasionally, one or two aggregation workflow timeout (15 minute) and it gets stuck right at the state PENDING_ACTIVITY_STATE_STARTED. It seems that it is not picked up by any worker thread for the entire 15 minute. I’ve attched a sceenshot of the temporal history.


We want to know under what scenario such activity won’t be picked up? is our configuration looks ok? Or is this caused by any other issues?

Can you try make the activity idempotent and add a retry policy?

Currently there is an issue, if the poller (SDK)'s timeout before getting the activity task, the task will only be timed out: https://github.com/temporalio/temporal/issues/1058
We are actively working this issue, in the mean time, plz add a retry policy to the activity.

Yes, these activites are indeed idempotent and we will reconfigure the timeout to make sure the retry takes effect. Please keep up posted with the investigation result on this issue in the meanwhile. Thanks!

UPDATE:

However, we see occasionally, one or two aggregation workflow timeout (15 minute) and it gets stuck right at the state PENDING_ACTIVITY_STATE_STARTED. It seems that it is not picked up by any worker thread for the entire 15 minute. I’ve attched a sceenshot of the temporal history.

PENDING_ACTIVITY_STATE_STARTED means the from Temporal’s point of view, the activity is already started, server is waiting for SDK to respond, with either heartbeat (if configured) or completion. (meaning that the activity is already picked up by the SDK)

Since SDK can experience network timeout, restarts, caller should (in general) make the activity implementation idempotent, and configure the activity with proper heartbeat timeout and retry.

NOTE: The github issue i mentioned above seems to be related to activity schedule to start timeout being too short.

BTW, also seems that we need to improve the variable naming, I was also tripped by PENDING_ACTIVITY_STATE_STARTED, this means activity is in started state, not waiting to be started