Activity staying in pending

Some activity got stuck(scheduled but not started) when I executed workflow.
My worker was alive and had no error but the matching service had some logs at that time:

{"level":"info","ts":"2023-07-15T19:29:12.986Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metric_client.go:90"}
{"level":"info","ts":"2023-07-15T16:33:55.400Z","msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metric_client.go:90"}

Here is the screenshot:




The server version is 1.20.3.
The java-sdk version is 1.20.0.

It doesn’t recur steadily, but it does happen.

What are the activity timeouts? My guess is that timeout is too high and it was never retried because of this.

The activity StartToCloseTimeout is 1 day , ScheduleToStartTimeout and ScheduleToCloseTimeout is not set.
Should I set ScheduleToCloseTimeout to a small time to avoid this situation?
If so, what is the root cause of the activity not start?

So on any intermittent failure, it will take 1 day to detect this and retry. Is this activity indeed can run up to a day? Then you have to set a much shorter heartbeat timeout and call heartbeat from the activity periodically.

Because my activity contains user-defined waiting, the shortest may be a few seconds, and the longest may be 30 minutes.
So even if you set StartToCloseTimeout to 30 minutes, it is still too long to retry with this time.
I can set the ScheduleToStartTimeout to 10 seconds, but the maximum number of retries must be 1, because this activity is currently not idempotent.
I want to know in this setting, if the ScheduleToStartTimeout is reached, will the activity retry ignoring the maximum number of attempts?

I can be sure that that activity does not reach my activity code (but other acitivity in parallel executes fine), because the initial log is not printed.

How can I find out where the failure is and avoid it?
If not even if some failure occurs, how to terminate immediately instead of waiting?

In the case of an activity that can potentially run for a long time use heartbeating to detect failures faster. I wouldn’t use ScheduleToStart in this case.