Workflows getting stuck after some N workflows with timers


My application is going to receive some heavy load in the next weeks, so we are doing some load testing on it.
The first thing we want to measure is the StartWorkflow throughput. We created a simple workflow that only sleeps for 30min, and begin to start them (something like 10k/min until now). But, the first ones that get started do their job well, but the last ones get stuck waiting after DecisionTaskScheduled.

I’ve tried to setup the Scalable Tasklist feature, but we didn’t get any increase in the performance… And If i take a look at the hardware usage of the workers, the CPU usage barely reachs 15%.

Is this behavior expected? Or i am doing something pretty wrong using the timers?
How i can validate if the Scalable Tasklist was setup correctly?

8521 is the number of timers that was started. 4047 is the number of workflows that were completed. And the rest of the workflows did not even starts.

I just did another load test, this time using a new workflow that simple runs through a loop. And everything works great.

Stuck in DecisionTaskScheduled usually means that your workflow workers cannot keep up with the load. Try increasing the number of poller threads in the workers.

What you mean by “cannot keep up with the load”? Means that my task list has some limit, and I have reach them? Even though my workflows have only started timers and my workers still have hardware to use?

Btw, i will try to increase the number of the poller threads, as soon as I get some news, I will come back here.

What is the DB CPU? In a well configured cluster the DB CPU is always a bottleneck.

We are running three Cassandra replicas, on three n1-standard-4 on GKE.
When i start to send the load, in fact the CPU usage gets high, but barely reaches 60%. After the initial load, the CPU usage drops to 10%.

One possibility is that workflow tasks are constantly failing or timing out. This results in WorkflowTaskScheduled being the last event in the history. Could you check the workflow task completion and error rates?

I have just increased the number of poller threads, and it helped a lot! Thanks!
Sorry for the delay to test this config.