Hello, I’ve recently started to use Temporal as part of a new project the company I work at is doing, were we’re going to be using Temporal to run tasks scheduled by our users. We already have a scheduling service running in another project that handles ETL operations, this service is currently handling about 500 scheduled executions during peek hours. Although this new Temporal service won’t be used for the same kind of operations I’m using that number as an upper bound of what I’ll have to handle.
I’ve been doing some tests with workers by scheduling multiple workflows at the same time to see how it handles it and found out that Temporal workers do not run a workflow from start to finish and instead interweave between them. This is causing some problems when scaling the amount of workflows that are submitted at once. Here’s an example:
1 worker running 10 workflows → All complete (~1m/workflow)
10 workers running 100 workflows → All complete, but I get some warnings (~1m/workflow)
50 workers running 500 workflows → A lot of warning, workflows that completed in 1 minute hanging for over 20m with no progress, some start failing after that.
At first I thought it would make sence that scaling the amount of workers I have linearly would also increase the amount of workflows I can handle linearly, but the way workers executes workflows is causing it to take too long for a worker to start a workflow/get back to it.
As I increase the amount of workflows executed I start getting the warning:
[TMPRL1101] Potential deadlock detected: workflow didn’t yield within 2 second(s)
Looking at the workflow it seems that the wait time for workflows to start running and the time for workers to get back to a workflow that was already started is causing Workflow Task Timeout errors.
When scaling up to 500 workflows scheduled to run at once workflows will constantly emit this errors and workers will go from one workflow to the other and never complete any, waiting long enough causes some of them to fail with a Activity Task Timeout error.
I’ve tried playing around with some worker params such as max_concurrent_workflow_tasks, max_concurrent_activities, max_concurrent_activity_task_polls and max_concurrent_workflow_task_polls but it didn’t help. To be honest I didn’t notice them impact the executions in any way.
Would apreciate some help figuring out what I’m doing wrong here
