Getting Workflow Task Timed Out in between of a long list of Workflow task Scheduled -> Completed

In my application, I started around 6k workflows to start measuring performance.

What I found is that the workflow get completed very slowly while the logic of each workflow is very simple.

When I navigate into the history of a random workflow, I can see Workflow Task Timed Out event but they appeared in the middle of a bunch of Workflow Task Scheduled → Completed.

I read the forum and found out that I can potentially increase the sticky ScheduleToStart timeout duration to avoid the Timed Out error.

However, I wonder why there are a lot of Workflow Task Scheduled → Completed events in the history. What are they for? And what problem do they indicate?

I’d really appreciate some helps.

My guess is that you have a long-running (or stuck in retry) local activity. These workflow task completions are the heartbeat mechanism the workflow worker uses to keep the workflow task open for a long time.

Make sure that you don’t use long-running local activities.

1 Like

Thank you for the quick reply maxim.

You’re right that I have 1 local activity called FindExecutablePath that showed up toward the end of the event history in my picture.

2025-02-13 16:02:38.413  INFO 1 --- [Local Activity Executor taskQueue="MarketingExecutionWorkflowTaskQueue", namespace="lifecycle-pre": 23867] t.p.m.w.d.MarketingWorkflowExecutingView --- [Local Activity Executor taskQueue="MarketingExecutionWorkflowTaskQueue", namespace="lifecycle-pre": 23867raceId]: MarketingWorkflowExecutingView findExecutablePath, node ID: 0194f83c-f6ff-787b-a2fa-a1a4a33b7635, taken: 0ms

2025-02-13 15:59:30.639  INFO 1 --- [Local Activity Executor taskQueue="MarketingExecutionWorkflowTaskQueue", namespace="lifecycle-pre": 23812] t.p.m.w.d.MarketingWorkflowExecutingView --- [Local Activity Executor taskQueue="MarketingExecutionWorkflowTaskQueue", namespace="lifecycle-pre": 23812raceId]: MarketingWorkflowExecutingView findExecutablePath, node ID: START, taken: 0ms

2025-02-13 15:58:50.414  INFO 1 --- [Local Activity Executor taskQueue="MarketingExecutionWorkflowTaskQueue", namespace="lifecycle-pre": 23794] t.p.m.w.d.MarketingWorkflowExecutingView --- [Local Activity Executor taskQueue="MarketingExecutionWorkflowTaskQueue", namespace="lifecycle-pre": 23794raceId]: MarketingWorkflowExecutingView findExecutablePath, node ID: 0194f83c-f6ff-787b-a2fa-a1a4a33b7635, taken: 0ms

I added some logs today to measure execution time and all of them shows 0ms. This is what I expect because this Local activity reads from a Caffeine local cache, which is supposed to complete super fast.

I also increased the Sticky ScheduledToStart Timeout configs to 2 minutes and the number of timeout decreased but I still have a long list of Workflow Task heartbeat :.

This is quite weird… I’m wondering if logic is fast but heartbeat is still going on, does it mean there’s contention at Temporal cluster side? For example, workflow cannot write history into DB and thus, cannot proceed?

Upon further tests, we identified a consistent pattern.

When we started ~6k workflows, execution initially ran very quickly. Suddenly, when roughly half of the workflows already completed, execution slowed down dramatically and got stuck at 2816 running for quite a while before very slowly progressed again. It may take like a few minute for 1 workflow to get to Completed stage.

Toward the end, execution suddenly became very fast once again and the remaining workflows got to Completed stage in just < 20 seconds.

I’m sure there’s no contention on the server resource for other features because this is a test server we use purely to test out Workflow performance.

I’m wondering if this picture means that the Workflow Task Completed events are getting published repeatedly while Workflow Task Started events are down because Completed events cannot be persisted?

More statistics around the same time below.



look at your service metrics resource exhausted, see if find any SystemOverloaded cause during same times

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

1 Like

Thank you for stepping in to help :slight_smile:

Below are the metrics in the last 24 hours. While we were conducting the testing, the Service Errors Break Down chart was flat the entire time.


Do you think the problem came from elsewhere?