For context, we have long running activities in temporal and we self-host the temporal cluster.
We’re observing 2 cases of workflow getting stuck intermittently and not progressing indefinitely
Here the workflow has gotten stuck, we see that the workflow execution has started and workflow task has been scheduled, but it stays stuck in this state indefinitely.
In the metrics dashboard, we observe that the Matching Service doesn’t receive the AddWorkflowTask from the History Service in this case, which I believe should be received.
Here the activities are stuck in PENDING_ACTIVITY_STATE_SCHEDULED
In the metrics dashboard, we observe that the Matching Service receives the the AddWorkflowTask, but doesn’t received the AddActivityTask from the History Service in this case. In case of a successful workflow, we observed that both the events are received accurately. We believe this isn’t related to any specific activity code as we see this happening across activities (long and short running).
Note:
We checked the resource utilization of all components in the temporal service, none of them seems to be exceeding 15%.
On running sum(rate(persistence_error_with_type[1m])) by (operation, service_name) I see the following, specifically the errors are related to GetTaskQueueUserData and GetTaskQueue
I am working with @inishchith on this, adding more context that might be useful(this applies to workflows where the ActivityTasks are stuck but the workflow task has completed, i.e. the 2nd case of error in the original message) -
In Temporal Server metrics, we can see that for all failed workflows, we don’t see any “AddActivityTask” event (however, we do see “AddWorkflowTask” and “RespondWorkflowTaskCompleted”)
In the history service dashboard, we don’t see any “TransferActiveTaskActivity” and “TimerActiveTaskActivityTimeout” events
on further investigation with the following setup (1 master/read-write instance and 2 replicas/read-only), we have found that this occurs quite frequently when setup temporal to connect to PgPool. Is yet to occur in case we setup temporal to connect directly to the postgres-master.
Suggesting a requirement for strong consistency, minimal to no read-latency. But would love to read further on this.