Hello everyone, i was doing some test using the Bank example (Java client), and for slow down the workflows i tried to put a Thread.sleep(5)… Now this is not right i guess since the worker and the server can’t communicate anymore and the WorkflowTaskFailed :
io.temporal.internal.sync.PotentialDeadlockException: Potential deadlock detected: workflow thread “workflow-method” didn’t yield control for over a second.
make sense… But, if during the sleep, i stop the worker and then i restart it, the server goes in WorkflowTaskTimedOut and will keep retry to execute the missing tasks instead of going on WorkflowTaskFailed… Is it normal or is it a bug? Can someone explain a little bit more about it? Thanks for your help
Workflow code is executed only when a worker process receives so called workflow task. The workflow task has a timeout (default is 10 seconds). See this post that explains the workflow task.
In your case, when a thread is blocked using the prohibited method (instead of supported Workflow.sleep) the deadlock detector aborts the workflow task. When a task is aborted WorkflowTaskFailed is recorded into the history. When you kill the worker, nothing is reported back to the service, so the workflow task timeout fires, and WorkflowTaskTimedOut is recorded into the history. In both cases, the task is retried to ensure that workflow continues execution.
Hey @Lorenzo_Di_Giacomo, first let me try to explain why Thread.sleep(5) results in WorkflowTaskFailed error. Temporal workflows have this restriction where workflow code needs to be deterministic and avoid any side-effects. This restriction is needed because Temporal replays the entire workflow execution from history to recover state on a different host. Due to this restriction there is no reason to have blocking user code (like Thread.sleep) in workflow implementation. We have a deadlock detector on SDK which fails a workflow task if the user code blocks over a second without relinquishing control back to pump. You can disable this behavior by setting TEMPORAL_DEBUG environment variable to true. This is useful when you are debugging your workflow implementation within a debugger and stepping through code.
The second issue you raised is when you put Thread.sleep in your workflow implementation, first WorkflowTask dispatch to worker fails with WorkflowTaskFailed response but then subsequent workflow tasks fails with WorkflowTaskTimeout. Putting Thread.sleep in your workflow implementation is equivalent to introducing a bug which will result in your workflow executions to not able to make progress beyond that point due to the reason I explained above. So the first time workflow task is dispatched to a host resulting application failure is responded back to the service as WorkflowTaskFailed with the details of the error so users can easily debug issues with their implementation. Temporal will immediately retry a failed workflow task as some other worker might pick it up which may not have the same problem. If successive workflow tasks fails again then client SDK instead of responding back same failure just drops the workflow task which eventually results in WorkflowTaskTimeout. This behavior serves two purposes. First it prevents the execution history to grow unbounded with repeated failures. Second it prevents tight spins due to workflow task getting dispatched immediately after each failure.
For your situation, if you remove Thread.sleep from your workflow code then your workflow executions will start making forward progress again.