Configure Retrying workflow execution after temporal-worker outage

Hi,
Recently our DevOps team has been updating infrastructure (terminating nodes), and the temporal worker was being restarted/redeployed as well. One workflow has failed with

Caused By: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED- WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]
Potential deadlock detected: workflow thread “workflow-root” didn’t yield control for over a second.

If I understand the flow correctly,

  • Temporal has scheduled a task
  • the worker took it from a task queue but had no enough time to execute any activity
  • the worker went down due to an infrastructure event
  • Temporal saw a timeout, recognized that as a deadlock, marked the workflow as failed, and did not retry it because of RetryPolicyNotSet

I’m looking for a way to configure retries only on DevOps-related occasions. For example, it would be great to retry the workflow when the worker was not able to start executing the workflow (e.g. because it has been redeployed right at this time), without having retries when some activity or workflow business logic fails.

Here is a screenshot and exception of this case:

Event #5 WorkflowExecutionFailed - failure
io.temporal.internal.replay.InternalWorkflowTaskException: Failure handling event 3 of ‘EVENT_TYPE_WORKFLOW_TASK_STARTED’ type. IsReplaying=false, PreviousStartedEventId=3, workflowTaskStartedEventId=3, Currently Processing StartedEventId=3
io.temporal.internal.statemachines.WorkflowStateMachines.createEventProcessingException(WorkflowStateMachines.java:221)
io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:201)
io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:175)
io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:177)
io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:146)
io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:201)
io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114)
io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:319)
io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:279)
io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.base/java.lang.Thread.run(Thread.java:833)

Caused By: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]
io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:151)
io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:101)
io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:67)
io.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:233)
io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:199)
io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:175)
io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:177)
io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:146)
io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:201)
io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114)
io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:319)
io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:279)
io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.base/java.lang.Thread.run(Thread.java:833)

Caused By: io.temporal.internal.sync.PotentialDeadlockException: Potential deadlock detected: workflow thread “workflow-root” didn’t yield control for over a second. Other workflow threads:

java.base@17.0.5/jdk.internal.misc.Unsafe.park(Native Method)
java.base@17.0.5/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
java.base@17.0.5/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
java.base@17.0.5/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:938)
java.base@17.0.5/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
java.base@17.0.5/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
io.temporal.internal.sync.WorkflowThreadContext.setStatus(WorkflowThreadContext.java:172)
io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:128)
java.base@17.0.5/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
java.base@17.0.5/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base@17.0.5/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.base@17.0.5/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.base@17.0.5/java.lang.Thread.run(Thread.java:833)

Hi, do you set WorkflowImplementationOptions->setFailWorkflowExceptionTypes to Throwable.class when you register your workflow impls with worker?
This would cause your workflow to actually fail, otherwise it would block the workflow task (not fail workflow) and allow you to fix the workflow code.

Caused By: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]

Caused By: io.temporal.internal.sync.PotentialDeadlockException: Potential deadlock detected: workflow thread “workflow-root” didn’t yield control for over a second. Other workflow threads:

This error is typically some issues in your workflow code itself (don’t think its associated in your case with workers failing/restarting). Can happen when you have:

  • Loop in your workflow code that spins forever
  • External api calls you might have in your workflow code itself (not via activity) or a data converter that blocks for over a second (for example on workflow data input)
  • Using non-temporal apis for things like synchronization, or blocking with Thread.sleep for example

Feel free to share your workflow code and we can take a look.