My application is going to receive some heavy load in the next weeks, so we are doing some load testing on it.
The first thing we want to measure is the StartWorkflow throughput. We created a simple workflow that only sleeps for 30min, and begin to start them (something like 10k/min until now). But, the first ones that get started do their job well, but the last ones get stuck waiting after DecisionTaskScheduled.
I’ve tried to setup the Scalable Tasklist feature, but we didn’t get any increase in the performance… And If i take a look at the hardware usage of the workers, the CPU usage barely reachs 15%.
Is this behavior expected? Or i am doing something pretty wrong using the timers?
How i can validate if the Scalable Tasklist was setup correctly?
Stuck in DecisionTaskScheduled usually means that your workflow workers cannot keep up with the load. Try increasing the number of poller threads in the workers.
What you mean by “cannot keep up with the load”? Means that my task list has some limit, and I have reach them? Even though my workflows have only started timers and my workers still have hardware to use?
Btw, i will try to increase the number of the poller threads, as soon as I get some news, I will come back here.
We are running three Cassandra replicas, on three n1-standard-4 on GKE.
When i start to send the load, in fact the CPU usage gets high, but barely reaches 60%. After the initial load, the CPU usage drops to 10%.
One possibility is that workflow tasks are constantly failing or timing out. This results in WorkflowTaskScheduled being the last event in the history. Could you check the workflow task completion and error rates?