We encountered a TMPRL1101 deadlock error in one of our python worker deployments. On encountering the error, the worker process hanged and no subsequent workflows were processed. The worker had to be brought back to a healthy state by restarting the worker (through k8s pod).
Environment Details
- Workflow Structure
initial set of activities
...
...
a group of 4 parallel activities that run for (n) different inputs in parallel
...
...
final set of activities
- Resource Utilization
- When we encountered the deadlock error, the CPU utilization was at 100% of the k8s limit and memory utilization at 70%
- Workflow State
- The number of the parallel inputs (n) was ~120
- When we encountered the deadlock error, the in-flight activities belonged to the parallel group
- Temporal Python SDK version: v1.6.0
What we’ve tried
- To pin down the root cause (high concurrency / high resource utilization) we did the following experiments – a) Increase pod resources to avoid hitting the resource limits. b) Limit the number of parallel inputs to ~30. In both cases, we encountered fewer
WorkflowTaskTimedOutandTMPRL1101errors and the worker continued even after encountering the deadlock error. - As the cause of the
TMPRL1101is longer execution(> 2s) of workflow tasks, we separated the workers for activity and workflow tasks on separated processes and were able to avoid the time out and deadlock errors.
Questions for the community
-
Expected behavior understanding: With higher resource (or lower concurrency), the worker is able to recover after encountering the deadlock error, but the worker hanged in the other case. What is the expected recovery behaviour of the worker if it encounters the deadlock error?
-
Prevention: Is there a recommended config or SDK option to disable or mitigate TMPRL1101 entirely?
-
High-parallelism best practices: For hundreds of concurrent activities in one workflow, what patterns (e.g. child workflows, worker tuning / separation) does the community recommend?
Thanks in advance for any tips or pointers!