Workflows getting stuck after some N workflows with timers

lucasmls · May 3, 2021, 2:50pm

Experts.

My application is going to receive some heavy load in the next weeks, so we are doing some load testing on it.
The first thing we want to measure is the StartWorkflow throughput. We created a simple workflow that only sleeps for 30min, and begin to start them (something like 10k/min until now). But, the first ones that get started do their job well, but the last ones get stuck waiting after DecisionTaskScheduled.

I’ve tried to setup the Scalable Tasklist feature, but we didn’t get any increase in the performance… And If i take a look at the hardware usage of the workers, the CPU usage barely reachs 15%.

Is this behavior expected? Or i am doing something pretty wrong using the timers?
How i can validate if the Scalable Tasklist was setup correctly?

8521 is the number of timers that was started. 4047 is the number of workflows that were completed. And the rest of the workflows did not even starts.

I just did another load test, this time using a new workflow that simple runs through a loop. And everything works great.

maxim · May 3, 2021, 4:16pm

Stuck in DecisionTaskScheduled usually means that your workflow workers cannot keep up with the load. Try increasing the number of poller threads in the workers.

lucasmls · May 4, 2021, 12:41am

What you mean by “cannot keep up with the load”? Means that my task list has some limit, and I have reach them? Even though my workflows have only started timers and my workers still have hardware to use?

Btw, i will try to increase the number of the poller threads, as soon as I get some news, I will come back here.

maxim · May 4, 2021, 1:17am

What is the DB CPU? In a well configured cluster the DB CPU is always a bottleneck.

lucasmls · May 4, 2021, 9:44am

We are running three Cassandra replicas, on three n1-standard-4 on GKE.
When i start to send the load, in fact the CPU usage gets high, but barely reaches 60%. After the initial load, the CPU usage drops to 10%.

maxim · May 4, 2021, 3:52pm

One possibility is that workflow tasks are constantly failing or timing out. This results in WorkflowTaskScheduled being the last event in the history. Could you check the workflow task completion and error rates?

lucasmls · May 4, 2021, 4:34pm

I have just increased the number of poller threads, and it helped a lot! Thanks!
Sorry for the delay to test this config.

Topic		Replies	Views
Workflow Performance with Java SDK Community Support java-sdk	1	731	February 20, 2023
WorkflowTaskTimedOut when testing performance Community Support timeout	6	3140	September 8, 2024
Stuck workflows after hight database load Community Support general-impl	11	411	July 11, 2024
Tuning Temporal setup for better performance Community Support cassandra , performance , kubernetes	5	8761	November 13, 2021
Workflow Handler stopped Community Support go-sdk , mysql	7	1657	November 27, 2020

Workflows getting stuck after some N workflows with timers

Related topics