We have been noticing that our implementation completes very slow under high load and we would like to know how we can increase throughput.
A key feature of our implementation is the use of recursion to spawn child workflows. We have an API that triggers one parent workflow and spawns child workflows for the data input. Then, each child’s workflow may produce its own data inputs for additional child workflows that the parent then executes. The parent workflow from a singular input may generate 50+ workflows by the end of the entire process. Each workflow completes relatively quickly (30 seconds to 5 minutes). We are using asynchronous Python and our Temporal implementation uses Postgres.
We noticed that if we get several hundred inputs, eventually this produces thousands of workflows. This is expected. The unexpected thing is the workflow completion grinds to a halt, and 10,000 running workflows eventually fully complete in about 4 hours.
We noticed that the worker containers running the recursion pretty well max the host machines memory out; and CPU resources get spiked, but not overly so. Interestingly, the DB connections are very underutilized. They go from about 400 TPS to 1500 TPS and increase a bit when we scale the workers higher, but the bottleneck appears to be somewhere else.
Our initial idea is that maybe the worker queues have some sort of configuration that we haven’t adjusted yet because throwing more containers at it helps somewhat, but that isn’t sustainable (can’t scale horizontally forever, and we have scaled as vertically as possible).
We have changed the default configs to increase DB connections; adjusted workflow and activity concurrency options in the SDK; tried scaling the number of workers to 8+ each; and, scaling the frontend, history, matching, and worker services to 2+.
We’ve read that other teams’ implementations have no problem handling thousands of sustained workflows, so we either have a resource limitation or we have some configs that aren’t tuned properly.
I know it’s a wide open problem that might relate specifically to our implementation, but I’d appreciate any feedback or ideas that can help up increase throughput. Thank you very much.