Cause of temporal deadlock errors

David_Zhang · January 25, 2024, 6:38pm

Hi, we are having issues with sporadic Deadlock errors occurring at different points in our workflow.

Looking at other posts, it seems like the most common cause for Deadlock errors are external API calls that block, non-ending loops, or blocking function calls not part of the temporal SDK APIs. Looking through the workflow code, I can’t identify any such examples. The most confusing example that I have seen is this, where in production I saw a deadlock error occur after the last activity before WorkflowExecutionCompleted

We directly return the response of this activity to complete the workflow, so there is no workflow code in between to investigate. Would love to hear some insight on how this is possible and next steps to investigate.

Chad_Retz · January 25, 2024, 7:33pm

This can also be caused by overloaded workers or other blocking/CPU heavy code in the workflow or if your data converter (payload converter or payload codec) is doing slow work. Can you reliably replicate? If so, can you reduce down to a small, easy-to-run standalone replication?

David_Zhang · January 26, 2024, 12:11am

Its not reliably replicable, I have not managed to replicate myself and have only noticed inside my production dashboards.

“This can also be caused by overloaded workers or other blocking/CPU heavy code in the workflow or if your data converter (payload converter or payload codec) is doing slow work.” Do you have any tips on how to further investigate this?

Chad_Retz · January 26, 2024, 2:54pm

Not specifically besides to check the resource consumption of the workers around that time. What this deadlock means is that a Python workflow thread (not to be confused with an activity thread) took 2 seconds to run code before yielding back to asyncio/Temporal. Workflow code usually executes in just a few milliseconds before yielding back. So somehow the workflow code (which can include payload conversion but not codecs) in its own thread took 2s before yielding back to Temporal. This is usually due to overloaded worker or busy/expensive in-workflow code.

Topic		Replies	Views
Sleep in workflow Community Support java-sdk , cassandra	2	7435	March 5, 2021
Potential Deadlock: Any suggestions Community Support debugging	1	1398	August 5, 2021
Potential deadlock detected: workflow goroutine 'root' didn't yield for over a second Community Support go-sdk	1	201	December 2, 2024
Potential deadlock detected. Workflow thread and didn't yield control for over a second Community Support	3	712	January 9, 2024
How to debug Workflows without PotentialDeadLockException Community Support java-sdk	3	1196	September 20, 2021

Cause of temporal deadlock errors

Related topics