Workflow task worker error: failure executing SCHEDULED->WORKFLOW_TASK_STARTED

nithin · December 9, 2020, 6:10pm

Seeing the following error in my logs on the workflow worker. Could you someone tell me what could be causing this?

io.temporal.internal.replay.ReplayWorkflowTaskHandler: Workflow task failure. startedEventId=9, WorkflowId=484e0fbb-e463-4c8b-af37-464408112fac, RunId=ce2ed254-8777-4c46-9f29-e375155a6ed2. If see continuously the workflow might be stuck. io.temporal.internal.replay.InternalWorkflowTaskException: Failure handling event 9 of 'EVENT_TYPE_WORKFLOW_TASK_STARTED' type. IsReplaying=false, PreviousStartedEventId=9, workflowTaskStartedEventId=9, Currently Processing StartedEventId=9
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:193) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:204) ~[pservice-1.0.jar:?]
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114) ~[pservice-1.0.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:309) ~[pservice-1.0.jar:?]
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:275) ~[pservice-1.0.jar:?]
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73) ~[pservice-1.0.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]
	at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:140) ~[pservice-1.0.jar:?]
	at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:91) ~[pservice-1.0.jar:?]
	at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:63) ~[pservice-1.0.jar:?]
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventImpl(WorkflowStateMachines.java:210) ~[pservice-1.0.jar:?]
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:178) ~[pservice-1.0.jar:?]
	... 11 more
Caused by: io.temporal.internal.sync.PotentialDeadlockException: Potential deadlock detected: workflow thread blocked for over a second
	at jdk.internal.misc.Unsafe.park(Native Method) ~[?:?]
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) ~[?:?]
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) ~[?:?]
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917) ~[?:?]
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240) ~[?:?]
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:267) ~[?:?]
	at io.temporal.internal.sync.WorkflowThreadContext.getStatus(WorkflowThreadContext.java:153) ~[pservice-1.0.jar:?]
	at io.temporal.internal.sync.WorkflowThreadContext.initialYield(WorkflowThreadContext.java:59) ~[pservice-1.0.jar:?]
	at io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:106) ~[pservice-1.0.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	... 3 more

maxim · December 9, 2020, 6:31pm

Looks like an SDK bug.
Could you reproduce it under a debugger? We need to find out which thread holds the lock that getStatus is waiting on.

Vitaly · December 9, 2020, 7:16pm

This exception is there to indicate problems with the workflow code.
Is it possible that you are running something slow in your workflow tasks? (e.g. calling external APIs, using concurrency primitives, sleep, etc?)
All workflow tasks should be very lightweight and execute in the order of 10s of milliseconds, 1 second is the hard limit set in the SDK.
Also could you clarify which version of the SDK you are using? Must be 1.0.3-SNAPSHOT?

nithin · December 9, 2020, 7:53pm

@maxim This happens very infrequently, maybe once a day and not exactly sure how, So it is going to be difficult to replicate and debug.

@Vitaly My workflow is pretty lightweight, the only this it does it call activities and branch based on activity response. The workflow also does Workflow.await() and waits for a signal.

I am using the latest sdk version: io.temporal:temporal-sdk:1.0.3

Vitaly · December 10, 2020, 6:14am

Is it possible that something like garbage collection occurs and causes this problem once in a while?
From my understanding if it’s very infrequent it shouldn’t be causing problems as workflow should be able to proceed up on the retry.
Do you have a way to view client side metrics, such as workflow task latency?

nithin · December 10, 2020, 9:22pm

I do have client side metrics set up, i will try to post them here the next time this occurs

Vitaly · December 11, 2020, 6:35am

Sounds good, thanks for your cooperation and let us know if you find something interesting!

Deadlock detection is a new feature that we’ve added recently and we might consider adjusting its sensitivity in future. Seeing 1-off errors of this type is not concerning (especially if we can confirm that it’s a result of GC) as it shouldn’t be impacting your processing flow and transient errors should be addressed upon the retry, but if it happens regularly we would need to investigate further.

abelokry · January 7, 2021, 6:33pm

Hello Vitaly, I am getting consistent deadlock exceptions in Debugger. Ping me if you need repro steps, I can do screen sharing, will not be able send you any code.

maxim · January 7, 2021, 6:46pm

Hello Vitaly, I am getting consistent deadlock exceptions in Debugger.

It is a known issue and @Vitaly is working on adding support for disabling the deadlock detector for debugging purposes.

Vitaly · January 8, 2021, 5:03am

This PR adds debug mode to java sdk, which can be enabled by setting environment variable TEMPORAL_DEBUG. It should be available in the next release.

If this is a blocking you right now, you can use our latest snapshot build (1.0.4-SNAPSHOT) for local debugging, which includes this feature already.

maxim · January 8, 2021, 4:56pm

To use snapshot builds add the following repository to your gradle file:

repositories {
  ...
    maven {
        url "https://oss.sonatype.org/content/repositories/snapshots/"
    }
    ...
}

abelokry · January 14, 2021, 10:30pm

This is not a blocker for now! Thank you for getting fix into next release.

Topic		Replies	Views
PotentialDeadlockException, when does this happen? Community Support java-sdk	7	794	September 28, 2024
Potential deadlock detected. Workflow thread and didn't yield control for over a second Community Support	3	734	January 9, 2024
Configure Retrying workflow execution after temporal-worker outage Community Support java-sdk , retries	1	857	November 2, 2022
Sleep in workflow Community Support java-sdk , cassandra	2	7523	March 5, 2021
Facing issues with temporal schedule Community Support	5	584	March 7, 2024

Workflow task worker error: failure executing SCHEDULED->WORKFLOW_TASK_STARTED

Related topics