Potential deadlock detected

Hi @tihomir , @maxim ,
With temporal in prod , we are facing one issue where kubernetes pods are getting restarted suddenly.
below errors are seen while they restart.
Few points

  • We are using spring boot.
  • instantiating workflows like this
stub = workflowClient.newUntypedWorkflowStub(workflowName, options);
                stub.start(arg);
                return (T) stub.getResult(clazz);
  • Some of our activities are calling external apis and are slow
  • we have timeout of 60 sec , at times activities timeout too
  • might be possible ( not sure) that we are runing lesser instances of temporal services like history,matching compared to load

Do you think any of the above can cause the pods to restart?

2022-02-09 09:30:25,529 [Workflow Executor taskQueue="queue_default", namespace="default": 957] ERROR i.t.i.r.ReplayWorkflowTaskHandler - Workflow task failure. startedEventId=3, WorkflowId=WF_ID_93bdf8b3-f43d-4158-9d73-8c85eb742a11_93bdf8b3-f43d-4158-9d73-8c85eb742a11_1644399016547, RunId=add8ee41-16fe-45e2-9072-b2ca1bacf92c. If seen continuously the workflow might be stuck.
io.temporal.internal.replay.InternalWorkflowTaskException: Failure handling event 3 of 'EVENT_TYPE_WORKFLOW_TASK_STARTED' type. IsReplaying=false, PreviousStartedEventId=3, workflowTaskStartedEventId=3, Currently Processing StartedEventId=3
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:188)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:201)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:319)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:279)
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]
	at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:140)
	at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:91)
	at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:63)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventImpl(WorkflowStateMachines.java:205)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:173)
	... 11 common frames omitted
Caused by: io.temporal.internal.sync.PotentialDeadlockException: Potential deadlock detected: workflow thread "workflow-root" didn't yield control for over a second. Other workflow threads:



	at java.lang.String.toCharArray(String.java:2899)
	at sun.reflect.generics.parser.SignatureParser.parseTypeSig(SignatureParser.java:187)
	at sun.reflect.annotation.AnnotationParser.parseSig(AnnotationParser.java:436)
	at sun.reflect.annotation.AnnotationParser.parseAnnotation2(AnnotationParser.java:241)
	at sun.reflect.annotation.AnnotationParser.parseAnnotations2(AnnotationParser.java:120)
	at sun.reflect.annotation.AnnotationParser.parseAnnotations(AnnotationParser.java:72)
	at java.lang.reflect.Executable.declaredAnnotations(Executable.java:602)
	at java.lang.reflect.Executable.declaredAnnotations(Executable.java:600)
	at java.lang.reflect.Executable.getAnnotation(Executable.java:570)
	at java.lang.reflect.Method.getAnnotation(Method.java:622)
	at io.temporal.common.metadata.POJOActivityMethodMetadata.<init>(POJOActivityMethodMetadata.java:38)
	at io.temporal.common.metadata.POJOActivityInterfaceMetadata.getActivityInterfaceMethods(POJOActivityInterfaceMetadata.java:174)
	at io.temporal.common.metadata.POJOActivityInterfaceMetadata.<init>(POJOActivityInterfaceMetadata.java:117)
	at io.temporal.common.metadata.POJOActivityInterfaceMetadata.newInstance(POJOActivityInterfaceMetadata.java:92)
	at io.temporal.internal.sync.ActivityInvocationHandlerBase.init(ActivityInvocationHandlerBase.java:54)
	at io.temporal.internal.sync.ActivityInvocationHandler.<init>(ActivityInvocationHandler.java:57)
	at io.temporal.internal.sync.ActivityInvocationHandler.newInstance(ActivityInvocationHandler.java:45)
	at io.temporal.internal.sync.WorkflowInternal.newActivityStub(WorkflowInternal.java:231)
	at io.temporal.workflow.Workflow.newActivityStub(Workflow.java:400)

The shown error says that workflow thread did not yield control via a Temporal api call for over a second.
This can be caused by things like

  • Loop in your workflow code that spins forever
  • External api calls you might have in your workflow code or data converter that block for over a second
  • Using non-temporal apis for things like synchronization

Would help to look at your workflow code and history to see what could be causing this.

Not sure that these would trigger a pod restart tho.

1 Like

we tried replicating this error on local by reducing DeadlockDetectionTimeout to 2ms.
WorkerOptions.newBuilder().setDefaultDeadlockDetectionTimeout(2).build()

We could see the deadlock error for first request but then workflow recovered and completed the first workflow.
After that next workflows completed as expected.

can this error have any impact on workflows ? as even after this error we see workflows complete.

I am facing a similar problem, in that the error-log shows

"WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]",

and

"Potential deadlock detected. Workflow thread \"workflow-method-1-2-16699178275836899328-fdac1f97-01ce-4837-9e32-2b975dcda099\" didn't yield control for over a second. "

If we introduce a Workflow.sleep(1) within the suspect method (long - running method) could this help satisfy this 1 sec constraint and help mitigate this problem ? We have a longish method which performs parsing and expression evaluations and may take more than a second under some conditions.

I would recommend moving such CPU-intensive logic into an activity.

1 Like