Timeout issues after upgrading to 1.6.3 temporal and 1.0.5 java client

Hi Recently i upgraded to 1.0.5 fo java-sdk and 1.6.3 of temporal server, after this upgrade i see most of my workflows timing out after certain time.

If i start my worker (jvm ) things work fine, but eventually most workflows get struck when i opent he temporal ui i see Timeout event every where

Could this be because of 1.0.5 upgrade?

can the below cause this behaviour ?

or could it be because of 1.6.1 patch which contains Bugfix: activity heartbeat timeout by wxing1292 · Pull Request #1220 · temporalio/temporal · GitHub

What i have observed is most of my workflows are long running workflows, and this timeout happens immediately after reciving a signal

the 1.6.1 patch fix a bug which activity heartbeat is not correctly handled.
in your case the issue has something related to workflow task handling, i.e. SDK received the workflow task, SDK cannot send back the result of workflow task within timeout.

usually, there shall not be any workflow task failed event at the very beginning, e.g. event ID 5.

  • could you plz check you implementation, e.g. if the workflow input matches the implementation (type / position / number of inputs)
  • take a try of 1.0.4 java SDK

also, please provide the error message of event id 5

is there any thing in log which i can check to confirm that?

let us begin with please provide the error message of event id 5 (within the UI, assuming the error message is not truncated)

It’s very unlikely that changes which you’ve mentioned above have anything to do with what you are seeing.

Are there any errors in the worker log? Do you have access to worker/server metrics?
Can you also post full stack trace of the first workflow task failure?

yes see null pointer issue, in workflow, but i am what i am wondering is why is that object null?
the object i am trying to access is the input which i passed to workflow (which iam anyway able to see it in workflow ui)

Blockquote
{
“eventTime”: “2021-01-27T11:27:54.000Z”,
“eventType”: “WorkflowTaskFailed”,
“eventId”: “5”,
“details”: {
“scheduledEventId”: “2”,
“startedEventId”: “4”,
“cause”: “WORKFLOW_TASK_FAILED_CAUSE_UNSPECIFIED”,
“failure”: {
“message”: “Failure handling event 4 of ‘EVENT_TYPE_WORKFLOW_TASK_STARTED’ type. IsReplaying=false, PreviousStartedEventId=4, workflowTaskStartedEventId=4, Currently Processing StartedEventId=4”,
“source”: “JavaSDK”,
“stackTrace”: “io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:193)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:201)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:314)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:280)\nio.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)\njava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\njava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\njava.lang.Thread.run(Thread.java:748)\n”,
“cause”: {
“message”: “WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]”,
“source”: “JavaSDK”,
“stackTrace”: “io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:140)\nio.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:91)\nio.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:63)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEventImpl(WorkflowStateMachines.java:210)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:178)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:201)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:314)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:280)\nio.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)\njava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\njava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\njava.lang.Thread.run(Thread.java:748)\n”,
“cause”: {
“message”: “”,
“source”: “JavaSDK”,
“stackTrace”: "MyWorkflowImpl.myMethod(“MyWorkflowImpl.java:393)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\nio.temporal.internal.sync.WorkflowInternal.lambda$registerListener$155fbe99$1(WorkflowInternal.java:141)\nio.temporal.internal.sync.SyncWorkflowContext.lambda$registerSignal$3ac69c8f$1(SyncWorkflowContext.java:679)\nio.temporal.internal.sync.SyncWorkflowContext.signal(SyncWorkflowContext.java:642)\nio.temporal.internal.sync.WorkflowExecuteRunnable.processSignal(WorkflowExecuteRunnable.java:71)\nio.temporal.internal.sync.SyncWorkflow.lambda$handleSignal$2(SyncWorkflow.java:136)\nio.temporal.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:101)\nio.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:107)\njava.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\njava.util.concurrent.FutureTask.run(FutureTask.java:266)\njava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\njava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\njava.lang.Thread.run(Thread.java:748)\n”,
“cause”: null,
“applicationFailureInfo”: {
“type”: “java.lang.NullPointerException”,
“nonRetryable”: false,
“details”: null
},
“failureInfo”: “applicationFailureInfo”
},
“applicationFailureInfo”: {
“type”: “java.lang.RuntimeException”,
“nonRetryable”: false,
“details”: null
},
“failureInfo”: “applicationFailureInfo”
},
“applicationFailureInfo”: {
“type”: “io.temporal.internal.replay.InternalWorkflowTaskException”,
“nonRetryable”: false,
“details”: null
},
“failureInfo”: “applicationFailureInfo”
},
“identity”: “18073@box-i-0496636ce335c14bb”,
“baseRunId”: “”,
“newRunId”: “”,
“forkEventVersion”: “0”,
“binaryChecksum”: “”,
“eventId”: “5”,
“eventType”: “WorkflowTaskFailed”,

Can you find a stack trace for the NPE?

Blockquote
ERROR i.t.internal.worker.PollerOptions - uncaught exception
java.lang.RuntimeException: Failure processing workflow task. WorkflowId=jan20212701-wf, RunId=650e7cbe-5cbe-489e-92fa-1ae85fc005c8
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:342)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:280)
at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:79)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.temporal.internal.replay.InternalWorkflowTaskException: Failure handling event 11 of ‘EVENT_TYPE_WORKFLOW_TASK_STARTED’ type. IsReplaying=false, PreviousStartedEventId=11, workflowTaskStartedEventId=11, Currently Processing StartedEventId=11
at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:193)
at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140)
at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180)
at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:201)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:314)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:280)
at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)
… 3 common frames omitted
Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]
at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:140)
at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:91)
at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:63)
at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventImpl(WorkflowStateMachines.java:210)
at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:178)
… 11 common frames omitted
Caused by: java.lang.NullPointerException: null
at MyWorkflowImpl.mySignalMethod(MyWorkflowImpl.java:437)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at io.temporal.internal.sync.WorkflowInternal.lambda$registerListener$155fbe99$1(WorkflowInternal.java:162)
at io.temporal.internal.sync.SignalDispatcher.handleInterceptedSignal(SignalDispatcher.java:100)
at io.temporal.internal.sync.SyncWorkflowContext.handleInterceptedSignal(SyncWorkflowContext.java:168)
at io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.handleSignal(POJOWorkflowImplementationFactory.java:374)
at io.temporal.internal.sync.SignalDispatcher.handleSignal(SignalDispatcher.java:124)
at io.temporal.internal.sync.SyncWorkflowContext.handleSignal(SyncWorkflowContext.java:172)
at io.temporal.internal.sync.WorkflowExecuteRunnable.handleSignal(WorkflowExecuteRunnable.java:72)
at io.temporal.internal.sync.SyncWorkflow.lambda$handleSignal$2(SyncWorkflow.java:136)
at io.temporal.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:101)
at io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:107)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
… 3 common frames omitted

This excepton was on receiving a signal method. In many of my other workflows too i see timeout immediately after the signal is received.

Also I see that the the workflowhandler goes down in some time usually in half hour
I am not sure if this all in related!

in that signal, i am trying to log an attribute /field of the workflow input object, don know why the input object is null

roughtly the workflow code looks like this

Myworkflow {
Input input;
boolean siganlReceived=false;

workflowMethod(input i){
   input =il
   Workflow.await(signalRecieved);
   .....
} 

MySignal() {
  log.info("processing signal for  input  {}", input.getId()); ---> null pointer here
   signalReceived=true;

}

}

Is it possible that you perform SignalWithStart to start the workflow?
If so, signal is getting applied before workflow run method is called. (Applying signals before running workflow tasks is the way how we guarantee that signals are not getting lost)

WorkflowExecution workflowExecution = WorkflowClient.start(myworkflow::workflowMethd,input)

how wever with in the workflow i have

a) continueAsNewMyWorkflow.workflowMethod(input)
b) Workflow.retry(retryOptions, Optional.empty(), () → doJob(input.getId()));

Whatever is causing it, it looks like you are getting a signal at the same time when the workflow gets started. Can you rewrite your signal processing method to not require input being set?

Sure , i shalll dig and see if thats possible, by my fundamental question is can one rouge workflow bring entire stuff down?

  • I see signals queued up/timed out across workflows
  • I see queries queued up as reporetd here
  • I see the workflow handler not heard beating as posted in above screen shot…

My larger worry is if “am i missing something, or doing something wrong”

When a workflow constantly throws a NullPointerException its execution gets blocked. So it cannot process both queries and signals in this state. So fix the bug and it is going to work as expected.

thanks @maxim will check that out.

But what i notice is other workflows are also affected, and i am not able to query other workflows too . also any idea why the Workflow Task handler heart beat stops ?

What is the history of other affected workflow?