Temporal workflow piling up with high load. Not completing quick enough

We have been noticing that our implementation completes very slow under high load and we would like to know how we can increase throughput.

A key feature of our implementation is the use of recursion to spawn child workflows. We have an API that triggers one parent workflow and spawns child workflows for the data input. Then, each child’s workflow may produce its own data inputs for additional child workflows that the parent then executes. The parent workflow from a singular input may generate 50+ workflows by the end of the entire process. Each workflow completes relatively quickly (30 seconds to 5 minutes). We are using asynchronous Python and our Temporal implementation uses Postgres.

We noticed that if we get several hundred inputs, eventually this produces thousands of workflows. This is expected. The unexpected thing is the workflow completion grinds to a halt, and 10,000 running workflows eventually fully complete in about 4 hours.

We noticed that the worker containers running the recursion pretty well max the host machines memory out; and CPU resources get spiked, but not overly so. Interestingly, the DB connections are very underutilized. They go from about 400 TPS to 1500 TPS and increase a bit when we scale the workers higher, but the bottleneck appears to be somewhere else.

Our initial idea is that maybe the worker queues have some sort of configuration that we haven’t adjusted yet because throwing more containers at it helps somewhat, but that isn’t sustainable (can’t scale horizontally forever, and we have scaled as vertically as possible).

We have changed the default configs to increase DB connections; adjusted workflow and activity concurrency options in the SDK; tried scaling the number of workers to 8+ each; and, scaling the frontend, history, matching, and worker services to 2+.

We’ve read that other teams’ implementations have no problem handling thousands of sustained workflows, so we either have a resource limitation or we have some configs that aren’t tuned properly.

I know it’s a wide open problem that might relate specifically to our implementation, but I’d appreciate any feedback or ideas that can help up increase throughput. Thank you very much.

I’ve read here in the forums that the number of events per second that a single workflow instance can process is limited, perhaps something like 10-ish per second depending on how your Temporal server is tuned. So a parent workflow waiting on 50 child workflows might take perhaps 5 seconds to process the all the child workflow completed events. I don’t know if your parent workflow is waiting on the child workflows, and I don’t know how you’d get from 5 seconds to 4 hours :man_shrugging:, but it might be one thing to check to see if any of your workflows are a bottleneck.