Activity is not executing, it stays in 'Pending Activities' tab and there's no other error

‘activityStartToCloseTimeInMinutes’ is 1000 minutes,

here’s server side pod scaling stats
NAME CPU(cores) MEMORY(bytes)
temporal-admintools-ddb9bcb76-ngsvd 9m 45Mi
temporal-frontend-89fb84f8d-b56wb 14m 92Mi
temporal-grafana-6f8777ccfb-6mf5d 10m 115Mi
temporal-history-85d5475d66-2rpmf 77m 774Mi
temporal-kube-state-metrics-5d6b488f69-cfdx4 10m 70Mi
temporal-matching-cfccf5465-tfzmn 10m 110Mi
temporal-prometheus-node-exporter-7qcmj 1m 11Mi
temporal-prometheus-node-exporter-qjnhd 4m 11Mi
temporal-prometheus-node-exporter-rx8c6 1m 9Mi
temporal-prometheus-node-exporter-xlfmg 4m 11Mi
temporal-prometheus-node-exporter-zrh29 1m 11Mi
temporal-prometheus-pushgateway-764cb87c99-djzfr 8m 63Mi
temporal-prometheus-server-855474d848-qk9wz 20m 901Mi
temporal-web-5699cccc5d-mqxpg 9m 88Mi
temporal-worker-8c4477cd4-xs5qp 6m 80Mi

please let me know if any other info is needed or any resources to figure it out

It looks like this activity is waiting for the StartToClose timeout.

one more piece of info:

when i reset the workflow to a just previous completed state (using UI), the activity starts normally

startToClose time is 1000 minutes and activity just started

Worker can crash or the activity task can be lost in transit due to networking issues. Temporal relies on timeouts for retries. So on any intermittent issue this activity is retried after 1000 minutes. Why do you need such a long timeout? If it is indeed needed you have to specify a heatbeat timeout and the activity has to heartbeat.

I see the issue

I set this as 1000 minutes thinking it meant ‘total execution time an activity allowed to run for in it’s lifecycle ‘

I did read the documentation while implementing and even now I read it but I always thought that this is what it meant by timeouts

The total time would be ScheduleToCloseTimeout. StartToCloseTimeout is the max time of a single attempt.

We sometimes encounter the same issues during periods of very high throughput. Do you have any thoughts on how to debug the actual reason? I thought the same — that some cache had reached its limit or something had happened to prevent the activity from starting.

any metrics / logs what could help ?

@Piotr_Zawadzki

very high throughput.

can you describe how many RPS or concurrent activities?

activities are pending or started when you check the UI (or use temporal workflow describe...)?
If pending maybe workers are under provisioned.
Check the schedule to start latency, if it is hight check slots_available for worker_type=ActivityWorkers and for the task_queue. If you have slots available then the poller count might be the issue.

It can be related to the server configuration as well.

~7000 workflows per minute, ~100-120 workflows completed per sec, about 2000 State Transitions per Sec, 80 000 QPS, 40 000 TPS

activities are pending or started when you check the UI (or use temporal workflow describe...)?

I am not sure, but I think they were started ( may be wrong ).

Start To Close Timeout → 15 min after this timeout, activities are restarted and all done.

activiti wokrers available, schedule to start max 4 sec in highest point. ( I think we are reaching db limits )