Hi,
We have in house Temporal cluster where we are observing an issue many times - Timer Started because of Workflow.await() is being fired with delay. Here are the details -
Cluster
Temporal version - 1.19.1, K8s based deployment with 3 history service pods, 2 frontend service pods, 2 matching service pods. Number of shards configured = 4k and running 20 worker pods.
Workflow Details
Workflow is like - Local activity 1, then Timer (Workflow.await()), then Local activity 2. Local activities are very short running one.
Please find the attached screenshot of a workflow where a Timer was started for 26 mins while it got fired after 2 hours (approx). cpu, mem usage etc. were within limits.
Approx. 1 million workflows would be running, so approx. 1 million timers would be there.
Kindly guide on possible reasons for debugging this.
Thanks,
Suresh