Cause of temporal deadlock errors

Hi, we are having issues with sporadic Deadlock errors occurring at different points in our workflow.

Looking at other posts, it seems like the most common cause for Deadlock errors are external API calls that block, non-ending loops, or blocking function calls not part of the temporal SDK APIs. Looking through the workflow code, I can’t identify any such examples. The most confusing example that I have seen is this, where in production I saw a deadlock error occur after the last activity before WorkflowExecutionCompleted

We directly return the response of this activity to complete the workflow, so there is no workflow code in between to investigate. Would love to hear some insight on how this is possible and next steps to investigate.

This can also be caused by overloaded workers or other blocking/CPU heavy code in the workflow or if your data converter (payload converter or payload codec) is doing slow work. Can you reliably replicate? If so, can you reduce down to a small, easy-to-run standalone replication?

Its not reliably replicable, I have not managed to replicate myself and have only noticed inside my production dashboards.

“This can also be caused by overloaded workers or other blocking/CPU heavy code in the workflow or if your data converter (payload converter or payload codec) is doing slow work.” Do you have any tips on how to further investigate this?

Not specifically besides to check the resource consumption of the workers around that time. What this deadlock means is that a Python workflow thread (not to be confused with an activity thread) took 2 seconds to run code before yielding back to asyncio/Temporal. Workflow code usually executes in just a few milliseconds before yielding back. So somehow the workflow code (which can include payload conversion but not codecs) in its own thread took 2s before yielding back to Temporal. This is usually due to overloaded worker or busy/expensive in-workflow code.