Workflow task worker error: failure executing SCHEDULED->WORKFLOW_TASK_STARTED

Seeing the following error in my logs on the workflow worker. Could you someone tell me what could be causing this?

io.temporal.internal.replay.ReplayWorkflowTaskHandler: Workflow task failure. startedEventId=9, WorkflowId=484e0fbb-e463-4c8b-af37-464408112fac, RunId=ce2ed254-8777-4c46-9f29-e375155a6ed2. If see continuously the workflow might be stuck. io.temporal.internal.replay.InternalWorkflowTaskException: Failure handling event 9 of 'EVENT_TYPE_WORKFLOW_TASK_STARTED' type. IsReplaying=false, PreviousStartedEventId=9, workflowTaskStartedEventId=9, Currently Processing StartedEventId=9
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:193) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:204) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114) ~[pservice-1.0.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:309) ~[pservice-1.0.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:275) ~[pservice-1.0.jar:?]
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73) ~[pservice-1.0.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]
	at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:140) ~[pservice-1.0.jar:?]
	at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:91) ~[pservice-1.0.jar:?]
	at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:63) ~[pservice-1.0.jar:?]
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventImpl(WorkflowStateMachines.java:210) ~[pservice-1.0.jar:?]
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:178) ~[pservice-1.0.jar:?]
	... 11 more
Caused by: io.temporal.internal.sync.PotentialDeadlockException: Potential deadlock detected: workflow thread blocked for over a second
	at jdk.internal.misc.Unsafe.park(Native Method) ~[?:?]
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) ~[?:?]
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) ~[?:?]
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917) ~[?:?]
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240) ~[?:?]
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:267) ~[?:?]
	at io.temporal.internal.sync.WorkflowThreadContext.getStatus(WorkflowThreadContext.java:153) ~[pservice-1.0.jar:?]
	at io.temporal.internal.sync.WorkflowThreadContext.initialYield(WorkflowThreadContext.java:59) ~[pservice-1.0.jar:?]
	at io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:106) ~[pservice-1.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	... 3 more

Looks like an SDK bug.
Could you reproduce it under a debugger? We need to find out which thread holds the lock that getStatus is waiting on.

This exception is there to indicate problems with the workflow code.
Is it possible that you are running something slow in your workflow tasks? (e.g. calling external APIs, using concurrency primitives, sleep, etc?)
All workflow tasks should be very lightweight and execute in the order of 10s of milliseconds, 1 second is the hard limit set in the SDK.
Also could you clarify which version of the SDK you are using? Must be 1.0.3-SNAPSHOT?

@maxim This happens very infrequently, maybe once a day and not exactly sure how, So it is going to be difficult to replicate and debug.

@Vitaly My workflow is pretty lightweight, the only this it does it call activities and branch based on activity response. The workflow also does Workflow.await() and waits for a signal.

I am using the latest sdk version: io.temporal:temporal-sdk:1.0.3

Is it possible that something like garbage collection occurs and causes this problem once in a while?
From my understanding if it’s very infrequent it shouldn’t be causing problems as workflow should be able to proceed up on the retry.
Do you have a way to view client side metrics, such as workflow task latency?

I do have client side metrics set up, i will try to post them here the next time this occurs

Sounds good, thanks for your cooperation and let us know if you find something interesting!

Deadlock detection is a new feature that we’ve added recently and we might consider adjusting its sensitivity in future. Seeing 1-off errors of this type is not concerning (especially if we can confirm that it’s a result of GC) as it shouldn’t be impacting your processing flow and transient errors should be addressed upon the retry, but if it happens regularly we would need to investigate further.

Hello Vitaly, I am getting consistent deadlock exceptions in Debugger. Ping me if you need repro steps, I can do screen sharing, will not be able send you any code.

Hello Vitaly, I am getting consistent deadlock exceptions in Debugger.

It is a known issue and @Vitaly is working on adding support for disabling the deadlock detector for debugging purposes.

This PR adds debug mode to java sdk, which can be enabled by setting environment variable TEMPORAL_DEBUG. It should be available in the next release.

If this is a blocking you right now, you can use our latest snapshot build (1.0.4-SNAPSHOT) for local debugging, which includes this feature already.

To use snapshot builds add the following repository to your gradle file:

repositories {
  ...
    maven {
        url "https://oss.sonatype.org/content/repositories/snapshots/"
    }
    ...
}

This is not a blocker for now! Thank you for getting fix into next release.