We are seeing unexpected timeout for temporal activity happening randomly and here are the detailed context:
We have a scheduler service and aggregator service both on Temporal: the scheduler service schedule aggregation workflows every 15 minute, so every 15 minute we would have around 700 workflows coming in spike.
The aggregator service has 8 pods, each pod has 1 Temporal worker with 12 polling thread for both activity and workflow. Currently we have only 1 workflow type (the workflow contains 2 activites) registered on these workers and most of the time the workflow finishes within 1 minute.
However, we see occasionally, one or two aggregation workflow timeout (15 minute) and it gets stuck right at the state PENDING_ACTIVITY_STATE_STARTED. It seems that it is not picked up by any worker thread for the entire 15 minute. I’ve attched a sceenshot of the temporal history.
We want to know under what scenario such activity won’t be picked up? is our configuration looks ok? Or is this caused by any other issues?