Nondeterministic error when temporal server down and being restart

Meli_Lee · August 27, 2022, 6:04am

Hi All

I would like to ask more regarding nondeterministic error that was being throw after temporal server restart (possibly because of the replay function), that makes the services print spam log, we need to terminate the workflow, if we don’t want to be spam with the error log.

For the introduction
This is our current flow in diagram

I would explain the chronology in here:

Workflow ID start to run at Aug 25th 11:47:51 am

image1490×475 25.4 KB
Workflow continue to run the flow :
a. Signal 1
b. Signal 2
c. Activity 1
→ Status is NOT OK
d. Activity 2
Workflow then continue to wait for Signal 1 again.
When timer for signal 1 started it’s at Aug 25th 13:19pm, those if we add 24 hr,
timer action should be triggered on Aug 26th 13:19 pm.
if within 24 hr no signal being received we gonna end this workflow.
All of the workflow is being limit using timer mechanism, so that all of this need to be finish within 24 hour.
At step number 2 while we wait Signal 1, there was an accident on our temporal server that it being shutdown at Aug 26th 09:00 am - 13:31 pm
When the temporal server being up, temporal seem to fire all the timer that actually need to be executed at 13:19 pm, and it’s actually continue to end the workflow and continue next activity

image1082×650 81.3 KB

image1087×681 36.1 KB

We have some miss configuration in here, but we don’t know if this cause non deterministic
scheduleToCloseTimeout 5s
scheduleToStartTimeout 5s
startToCloseTimeout 5s

and we think it seem to cause the activity time out,
when getting this activity error we actually have handle by checking the error and do some other activity which we wait for signal for another action

At 6.49 pm while we try to restart the server again, it caused nondeterministic error

PanicError: unknown command CommandType: Timer, ID: 25, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition
process event for QUEUE [panic]:
go.temporal.io/sdk/internal.panicIllegalState(...)
/build/vendor/go.temporal.io/sdk/internal/internal_decision_state_machine.go:409
go.temporal.io/sdk/internal.(*commandsHelper).getCommand(0xc041719f90, 0xc000000004, 0xc04261104a, 0x2, 0x0, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_decision_state_machine.go:881 +0x185
go.temporal.io/sdk/internal.(*commandsHelper).handleTimerClosed(0xc041719f90, 0xc04261104a, 0x2, 0xc0422ae180, 0xc042898d80)
/build/vendor/go.temporal.io/sdk/internal/internal_decision_state_machine.go:1298 +0x4b
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).handleTimerFired(0xc00054a078, 0xc041df64c0)
/build/vendor/go.temporal.io/sdk/internal/internal_event_handlers.go:1097 +0x7d
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc00054a078, 0xc041df64c0, 0x0, 0x0, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_event_handlers.go:862 +0x648
go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc04223a600, 0xc042384a50, 0x1b06d80, 0xc00026e000, 0xc04223a600, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_task_handlers.go:902 +0x7bf
go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc00012e370, 0xc042384a50, 0xc042384ed0, 0x0, 0x0, 0x0, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_task_handlers.go:749 +0x6d9
go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc000494820, 0xc042384a50, 0x0, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_task_pollers.go:284 +0x4ae
go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc000494820, 0x109c560, 0xc042384a50, 0xc0006ab798, 0xfec75e)
/build/vendor/go.temporal.io/sdk/internal/internal_task_pollers.go:255 +0x85
go.temporal.io/sdk/internal.(*baseWorker).processTask(0xc000198780, 0x109c120, 0xc0003c5ce0)
/build/vendor/go.temporal.io/sdk/internal/internal_worker_base.go:400 +0x115
created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
/build/vendor/go.temporal.io/sdk/internal/internal_worker_base.go:305 +0xff

i will post example code in bellow thread

once again thank you

Meli_Lee · August 27, 2022, 6:19am

this is the current code example
any feedback given?

 cbOptions := workflow.ActivityOptions{
 			StartToCloseTimeout:    wfConfig.StartToCloseTimeoutInSecond,
 			ScheduleToCloseTimeout: wfConfig.ScheduleToCloseTimeoutInMinutes,
 			RetryPolicy:            retryPolicy,
 		}
 		signal1CbCtx := workflow.WithActivityOptions(ctx, cbOptions)
 		exit24Timer := false
 		isStatusOK := false
 		receiveSignalOne := false
 		duration := config.Hour24
 		endDateTime := workflow.Now(signal1CbCtx).Local().Add(duration)
 		var receiveFirstSignalAtTime time.Time

		for !exit24Timer && !isStatusOK {
			timerCtx, timerCancel := workflow.WithCancel(signal1CbCtx)
			timer := workflow.NewTimer(signal1CbCtx, duration)
			selector := workflow.NewSelector(timerCtx)
			selector.AddFuture(timer, func(f workflow.Future) {
				if signal1CbCtx.Err() != nil {
					logger.Error("Error occurred.", "error", signal1CbCtx.Err())
				} else {
					exit24Timer = true
					return
				}
			}).AddReceive(workflow.GetSignalChannel(timerCtx, utility.FirstSignal), func(c workflow.ReceiveChannel, more bool) {
				c.Receive(timerCtx, &firstReq)
				timerCancel()
				receiveFirstSignalAtTime = workflow.Now(timerCtx).Local()
				receiveSignalOne = true
			}).Select(timerCtx)

			if !exit24Timer {
				duration = endDateTime.Sub(receiveFirstSignalAtTime)
				timer2 := workflow.NewTimer(signal1CbCtx, duration)
				receiveSignalTwo := false
				for !receiveSignalTwo && !exit24Timer {
					timer2Ctx, timerCancel := workflow.WithCancel(signal1CbCtx)
					selectorTwo := workflow.NewSelector(timer2Ctx)
					selectorTwo.AddFuture(timer2, func(f workflow.Future) {
						err1 := f.Get(timer2Ctx, nil)
						if err1 == nil {
							exit24Timer = true

						} else if signal1CbCtx.Err() != nil {
							logger.Error("Error occurred.", "error", signal1CbCtx.Err())
						}
					}).AddReceive(workflow.GetSignalChannel(timer2Ctx, utility.SecondSignal), func(c workflow.ReceiveChannel, more bool) {
						c.Receive(timer2Ctx, &receiveSignalTwoReq)
						timerCancel()
						receiveSignalTwo = true
					}).Select(timer2Ctx)
				}
				if receiveSignalTwo {
					 
					isStatusOK=//assign value from executeActivity1 can be OK/NOT OK
				}
			}
		}

		//-------------------------====================================Wait Until 24 hours====================================----------------------------------------------------
		if exit24Timer && receiveSignalOne {

			isStatusOK =//assign value from outside diagram
		}

Meli_Lee · August 27, 2022, 6:26am

After we get activity timeout error, it should trigger this function bellow

we just make a function to receive signal and block the workflow to continue, by using this function :

func GetRetrySignal(ctx workflow.Context ) signal.SignalPayload {
var retrySignal signal.SignalPayload
ch := workflow.GetSignalChannel(ctx, RetryWorkflow)
ch.Receive(ctx, &retrySignal)

return retrySignal
}

Seem the cause of non deterministic error also happened on this block of code

maxim · August 27, 2022, 8:14pm

Are you on the latest version of Go SDK?

Meli_Lee · August 28, 2022, 9:28am

this is the information for the go sdk version
go.temporal.io/api v1.8.0
go.temporal.io/sdk v1.15.0
go version 1.16

Chad_Retz · August 28, 2022, 8:33pm

I will look into replicating with the code given. Is there anything besides the code I need to know to replicate? Can you reliably replicate this situation?

Technically using Time.Local is non-deterministic if this code executes on workers with different time zones. Can you confirm that all of your workers have the exact same time zone since this workflow started? I’ll look for other non-determinisms in the code too though it is possible there is an SDK bug.

Meli_Lee · August 29, 2022, 4:19am

I am running this in one site timezone, there is no need to run this in multiple timezone.
Should we provide the setup config from temporal server / db cassandra ?

Chad_Retz · August 29, 2022, 12:48pm

Not necessary as replication should be able to happen on any server. Can you reliably replicate? I am going to try and replicate from what you have given, but ideally if you can reliably replicate, if you could provide an entire runnable workflow that’d really help us debug.

Chad_Retz · August 29, 2022, 2:23pm

I am struggling to replicate. What we’d want to do is use the replayer. If you can download the JSON history from tctl or the UI, then we can run it through the replayer. You may need to take off the last history event in JSON that has the panic to try and replicate. Running through the replayer should hopefully replicate the issue.

If you want, I can also try to replicate this but I’ll need that full history JSON and the code. If this is acceptable, you can DM me here or DM me on our public Slack or send to chad@temporal.io.

Meli_Lee · August 29, 2022, 3:58pm

Sure, let me reach you through slack first, and we will post the update of our discission to this forum

Topic		Replies	Views
A workflow containing a timer encounters an error after the worker restarting Community Support go-sdk	3	109	October 24, 2024
Unknown NonDeterministicError Community Support go-sdk , activity	0	16	July 19, 2024
A sub-process with workflow.Sleep() encountered non-deterministic error after Cassandra server migration Community Support go-sdk , cassandra , timeout , server	20	1091	February 8, 2023
Workflow.Sleep caused Non-deterministic error Community Support go-sdk , general-impl	10	694	September 27, 2023
Go SDK Troubleshooting Developer Corner go-sdk	0	3153	April 14, 2022

Nondeterministic error when temporal server down and being restart

Related topics