TMPRL1101 “deadlock” error under high parallelism causes worker hang

We encountered a TMPRL1101 deadlock error in one of our python worker deployments. On encountering the error, the worker process hanged and no subsequent workflows were processed. The worker had to be brought back to a healthy state by restarting the worker (through k8s pod).


Environment Details

  • Workflow Structure
initial set of activities
...
...
a group of 4 parallel activities that run for (n) different inputs in parallel
...
...
final set of activities
  • Resource Utilization
    • When we encountered the deadlock error, the CPU utilization was at 100% of the k8s limit and memory utilization at 70%
  • Workflow State
    • The number of the parallel inputs (n) was ~120
    • When we encountered the deadlock error, the in-flight activities belonged to the parallel group
  • Temporal Python SDK version: v1.6.0

What we’ve tried

  1. To pin down the root cause (high concurrency / high resource utilization) we did the following experiments – a) Increase pod resources to avoid hitting the resource limits. b) Limit the number of parallel inputs to ~30. In both cases, we encountered fewer WorkflowTaskTimedOut and TMPRL1101 errors and the worker continued even after encountering the deadlock error.
  2. As the cause of the TMPRL1101 is longer execution(> 2s) of workflow tasks, we separated the workers for activity and workflow tasks on separated processes and were able to avoid the time out and deadlock errors.

Questions for the community

  1. Expected behavior understanding: With higher resource (or lower concurrency), the worker is able to recover after encountering the deadlock error, but the worker hanged in the other case. What is the expected recovery behaviour of the worker if it encounters the deadlock error?

  2. Prevention: Is there a recommended config or SDK option to disable or mitigate TMPRL1101 entirely?

  3. High-parallelism best practices: For hundreds of concurrent activities in one workflow, what patterns (e.g. child workflows, worker tuning / separation) does the community recommend?

Thanks in advance for any tips or pointers!

What is the expected recovery behaviour of the worker if it encounters the deadlock error?

Worker reports workflow task failure to service, service will keep retrying the workflow task. Workflow execution remains in running status.

Is there a recommended config or SDK option to disable or mitigate TMPRL1101 entirely?

There are ways but not really recommended to do in production:
Set debug_mode to True when creating worker. You can also set TEMPORAL_DEBUG env var to 1
Better to set worker options, especially max_concurrent_activities to a number where your worker cpu does not go over 70%, then scale on number of worker pods to achieve needed activity execution parallelism.

For hundreds of concurrent activities in one workflow, what patterns (e.g. child workflows, worker tuning / separation) does the community recommend?

Depends on your workers and their capacity in terms of compute resources and desired parallelism needs.
One way to limit could be to run smaller batches of parallel running activities, wait for them to complete, then start a new batch.

1 Like

One thing that we analyzed from further investigation is that within this activity we call an AWS service via boto3. This is a synchronous I/O bound operation within an asynchronous activity definition. Could this be an issue?