What I found is that the workflow get completed very slowly while the logic of each workflow is very simple.
When I navigate into the history of a random workflow, I can see Workflow Task Timed Out event but they appeared in the middle of a bunch of Workflow Task Scheduled → Completed.
I read the forum and found out that I can potentially increase the sticky ScheduleToStart timeout duration to avoid the Timed Out error.
However, I wonder why there are a lot of Workflow Task Scheduled → Completed events in the history. What are they for? And what problem do they indicate?
My guess is that you have a long-running (or stuck in retry) local activity. These workflow task completions are the heartbeat mechanism the workflow worker uses to keep the workflow task open for a long time.
Make sure that you don’t use long-running local activities.
I added some logs today to measure execution time and all of them shows 0ms. This is what I expect because this Local activity reads from a Caffeine local cache, which is supposed to complete super fast.
I also increased the Sticky ScheduledToStart Timeout configs to 2 minutes and the number of timeout decreased but I still have a long list of Workflow Task heartbeat :.
This is quite weird… I’m wondering if logic is fast but heartbeat is still going on, does it mean there’s contention at Temporal cluster side? For example, workflow cannot write history into DB and thus, cannot proceed?
Upon further tests, we identified a consistent pattern.
When we started ~6k workflows, execution initially ran very quickly. Suddenly, when roughly half of the workflows already completed, execution slowed down dramatically and got stuck at 2816 running for quite a while before very slowly progressed again. It may take like a few minute for 1 workflow to get to Completed stage.
I’m wondering if this picture means that the Workflow Task Completed events are getting published repeatedly while Workflow Task Started events are down because Completed events cannot be persisted?