Hi, I’m trying to understand why the following exception is triggered for a workflow that is try to spawn a lot of ChildWorkflow’s (more than 3k child workflows need to be spwaned). Is it because the EventHistory for this workflow is getting too large?

io.temporal.internal.worker.UnableToAcquireLockException Workflow lock for the run id hasn't been released by one of previous execution attempts, consider increasing workflow task timeout.

the suggested remediation is to increase the taskTimeout in the workflow options. I’m trying to understand why this timeout is getting triggered?

Thank you.

Error means that workflow task timed out (default 10s), service then dispatched it again to same worker while the previous one (that timed out) still didnt finish on same worker and when it does you get this error.
Question is yes why did your workflow task time out?

Sdk provides two metrics: workflow_task_execution_latency, workflow_task_replay_latency.
If your executions have a large number of events in their history would check workflow_task_replay_latency closer to see if that could be reason for workflow task timeout.

Would not increase default workflow task timeout in most cases. Would first try to find out cause of why its timing out first and go from there.

Thank you for getting back. I see that the workflow_task_execution_latency is around 16 seconds. Would that be triggering the timeout?

Also, what would cause the workflow_task_execution_latency to be so high?

If you use default workflow task timeout of 10s (you can look at your event history json in first WorkflowExecutionStarted event) then yes, that would cause service to time out this workflow task.

what would cause the workflow_task_execution_latency to be so high?

When a workflow task starts worker has to

  • get event history for this execution if its not in its in-memory cache and replay it against your workflow code (workflow_task_replay_latency). Note for very large event histories this can mean multiple calls to service to get full history (get history api is paginated)
  • deserialize payloads using your data converter/codec (also check if you have any remote codecs)
  • run your workflow code until a blocking point (like sync activity invocation, workflow sleep etc) to complete the workflow task and respond back to service with all accumulated commands from your workflow code.

All this has to complete within the workflow task timeout managed by service.

So would look at replay latency first, if that is well under 10s then maybe try looking at your data converters if they are taking longer than expected, would also look at your worker process cpu/mem utilization see if their are overwhelmed during this time.
Workflow code such as for example invoking some apis or doing some heavy weight operations in workflow code could also be a possible culprit.
So can network latencies between your workers and service. This is probably one scenario where you might want to increase workflow task timeout in case there is nothing that can be done about high
network latencies.

You could also try to reproduce and debug this via WorkflowReplayer in test.

Thank you for the detailed explanation.

  • workflow_task_replay_latency is well under 10s
  • the DataCoverter’s also seems to respond well within the time frame.
  • I do see that there is a place in our workflow code where we spawn thousands of ChildWorkflow’s and use the Promise::get to await on them. I’m not sure if that could cause the WorkflowTaskTimeout to be triggered.

I do see that there is a place in our workflow code where we spawn thousands of ChildWorkflow

Could you share the event history json for this execution (could send in dm if you want).

we spawn thousands of ChildWorkflow

How many? In your service logs do you see “history size exceeds error limit.”?

I don’t think I can send the event history. But, here the flow.

  1. Get a list of items from an activity
  2. For each item in the list we spawn a new child workflow and collect all the promises into a new list
  3. And then we call promises.forEach(promise::get).

In this case the List<T> returned from the activity contains about 3,000 objects so we are spawn 3,000 ChildWorkflows from the parent. I suspect that while trying to spawn so many workflows we are exceeding the timeout. But, I don’t see a single workflow being registered in the temporal UI.

Do you see the WorkflowTaskTimedOut in event history? Do you see WorkflowTaskFailed before it?
Don’t think that the number of child workflows is reason. Would try to debug with WorkflowReplayer and maybe focus on that loop in workflow code where you create the promises, see if there is any way it could loop “forever” or block on something (non-temporal apis)

In your event history, after WorkflowTaskTimedOut, do you see your execution progress or is this the last event in history?

WorkflowTaskTimedOut is the last event in the history of that workflow.

Also, in the loop I don’t see a place where we can block “forever”. It’s really trivial we just .map() the object to a different type and start a new childworkflow.

Got it, hard to pinpoint to specifics without more info, so would try to go through the list of possible causes of this that were mentioned and see if can reproduce while debugging.