Why does my activity often StartToCloseTimeout?

Currently I’m running into a problem and need help!
Temporal-Server version: 1.19
Temporal-SDK: 1.20.0

In my workflow, certain activity StartToClose timeout often occurs;
Some of these activities are idempotent and can be retried and succeed on the nth retry policy (the activity’s timeout is between 30 seconds and 5 minutes). But some activities that are not idempotent cannot use retry policy.

I checked the code of the activity. Some of them only have one database select operation, but StartToClose timeout may still occur. Why?

I checked again whether the worker is alive and confirmed that it is still alive.

I simply sorted out the process of timeout activity in history and matching service, as follows:
First, after the workflow is executed, the activity is added to the matching task Queue through pushActivity() in the history service;
Of course, it can be confirmed through the log that the matching service received the task;
Next, the history service reports that the activity “Fail to process task” and “context deadline”;
Then, the history service starts the automatic retry of the activity, and in the log of the matching service, it can be confirmed that the activity was added for the second time;
A short time after adding the activity for the second time, the matching service reported:
“msg”:“history client encountered error”,“service”:“matching”,“error”:“Activity task already started.”,“service- error-type”:“serviceerror.TaskAlreadyStarted”
“msg”:“Duplicated activity task”,“service”:“matching”,“component”:“matching-engine”,“wf-task-queue-name”:" /_sys/AGENT_TASK_QUEUE_XXXX-5/2",“queue-task-id”:-137,“logging-call-at”:“matchingEngine.go:520”

Finally, the status of the activity on temporal-web remains in ActivityTaskScheduled until it times out.

Is this a bug?

What should I do to determine the root cause of the problem?

I think first check pending activities info for this execution either on UI or via tctl/cli, for example:

tctl wf desc -w <wfid> -r <runid>

that would give you indication that this activity was dispatched to one of your activity workers.

On worker sdk metrics side would also look at activity_execution_latency and also your worker cpu utlization during those times where you notice it happening.

In addition look at request_failure sdk metric and can focus on RespondActivityTaskCompleted/Failed and look at the status_code field, so something like

sum(rate(temporal_request_failure_total{namespace="$namespace", operation="RespondActivityTaskCompleted"}[1m])) by (namespace,operation,activity_type)

So basically need to look if your workers pick up the activity task, but has issues responding back to service for some reason which then causes activity timeout which is managed by the service.

Regarding your description, the task processing issue you mentioned seems to be benign, task in transfer queue can be executed multiple times it could be indicative of possible issues if this happens a lot but if intermittent typically can happen during service or worker restarts.

Yes, I checked the temporal UI and the following is displayed on the UI.

Then, the following screenshots are some printed information I found about the history service and matching service. It seems that there are some abnormalities (I am not sure).

The following two screenshots are the logs I added to the history service and matching service, which correspond to the log information mentioned earlier.

I checked again and again whether the worker is alive. The result obtained every time is that the worker is still alive.

Activity tasks can be lost in transit to the worker if matching or frontend nodes are restarted. Network failures can cause this as well.

Temporal guarantees that an activity is going to either execute or time out (which in usually leads to a retry). But it cannot guarantee exactly-once execution.

matching, fronted, and history have not been restarted during execution.
Yes, network is already OK.

The above log can clearly see that the activity has been pushed twice in the history, and the corresponding two AddActivityTasks may also be seen in the matching service log.
However, I have a doubt.
After the first history pushActivity, matching receives it and adds it to the taskQueue, (AddActivityTask).
Then, the history reports “Fail to process task” and “context deadline” and then automatically retries. (I understand this is normal processing)
Of course, after history was retried, matching adds it to the taskQueue again.

I understand that the worker’s poller will trigger the PollActivityTaskQueue again.
During the execution of PollActivityTaskQueue, I find the following error:
“msg”: “history client encountered error”, “service”: “matching”, “error”: “Activity task already started.”, “service-error-type” “:“serviceerror.TaskAlreadyStarted”
“msg”:“Duplicated activity task”,“service”:“matching”,“component”:“matching-engine”,“wf-task-queue-name”:”/_sys/AGENT_TASK_QUEUE_XXXX-5/2"," queue-task-id":-137,“logging-call-at”:“matchingEngine.go:520”.

Next, the task will execute task.finish(nil).

Is this the reason why the worker cannot receive the response normally?

There could be multiple reasons for failing to process the task. It could be CPU spike, DB slowdown, dropped connection, memory pressure, etc. By design, it is OK to lose a task assuming that it is started. This ensures that a StartToClose timer is created. Temporal never drops tasks without creating a corresponding timeout timer.

For 5-minute activities, I recommend setting shorter heartbeat timeout and heartbeating. This allows much faster retries.

Yes, I also suspect DB slowdown or memory, but how can I prove the impact of DB slowdown (memory) on me?

I would start from DB latency metrics, then request latency metrics, and then correlate them with CPU/memory metrics.

I use postgresql as the temporary persistent storage, and found that there are some commit requests that exceed 2 seconds (or even 5 seconds). How much impact does this have on the history?

That is exceptionally high latency. So, your DB is certainly overloaded.