Hello, I am working on a system that only has 1 Worker per site environment, with a Jenkins job that triggers a number of Workflows on an hourly basis from an external script.
This means that if one of the Workflows in that script encounters a DeadlockError from not yielding within 2s, it hangs indefinitely as there are no other Workers to pick up the task ( TMPRL1101 “deadlock” error under high parallelism causes worker hang ).
I’m not so interested in avoiding the DeadlockError, but rather how to query in Python whether or not a Workflow has encountered a DeadlockError so that it can be terminated externally.
So far I have something like the following, which checks the History events for every Running Workflow (specifically the history_record.workflow_task_failed_event_attributes.failure.application_failure_info.type )
# Deadlock Workflows should still be stuck in a 'Running' state.
async for running_workflow in client.list_workflows('ExecutionStatus = "Running"'):
# Get the specific Workflow handle and fetch its history
workflow_handle = client.get_workflow_handle(running_workflow.id)
async for history_record in workflow_handle.fetch_history_events():
# Parse each history event looking for a Deadlock error
if (
history_record.workflow_task_failed_event_attributes.failure.application_failure_info.type
== "_DeadlockError"
):
await workflow_handle.terminate(
reason="Manually terminated due to DeadlockError"
)
break
This works as far as I can tell, but is this the proper way to detect DeadlockErrors in Workflows, or is there is a better way with the Python SDK?