Worker can crash or the activity task can be lost in transit due to networking issues. Temporal relies on timeouts for retries. So on any intermittent issue this activity is retried after 1000 minutes. Why do you need such a long timeout? If it is indeed needed you have to specify a heatbeat timeout and the activity has to heartbeat.
We sometimes encounter the same issues during periods of very high throughput. Do you have any thoughts on how to debug the actual reason? I thought the same — that some cache had reached its limit or something had happened to prevent the activity from starting.
can you describe how many RPS or concurrent activities?
activities are pending or started when you check the UI (or use temporal workflow describe...)?
If pending maybe workers are under provisioned.
Check the schedule to start latency, if it is hight check slots_available for worker_type=ActivityWorkers and for the task_queue. If you have slots available then the poller count might be the issue.
It can be related to the server configuration as well.