Nondeterministic error when temporal server down and being restart

Hi All

I would like to ask more regarding nondeterministic error that was being throw after temporal server restart (possibly because of the replay function), that makes the services print spam log, we need to terminate the workflow, if we don’t want to be spam with the error log.

For the introduction
This is our current flow in diagram

I would explain the chronology in here:

  1. Workflow ID start to run at Aug 25th 11:47:51 am
  2. Workflow continue to run the flow :
    a. Signal 1
    b. Signal 2
    c. Activity 1
    → Status is NOT OK
    d. Activity 2
    Workflow then continue to wait for Signal 1 again.
    When timer for signal 1 started it’s at Aug 25th 13:19pm, those if we add 24 hr,
    timer action should be triggered on Aug 26th 13:19 pm.
    if within 24 hr no signal being received we gonna end this workflow.
  3. All of the workflow is being limit using timer mechanism, so that all of this need to be finish within 24 hour.
  4. At step number 2 while we wait Signal 1, there was an accident on our temporal server that it being shutdown at Aug 26th 09:00 am - 13:31 pm
    When the temporal server being up, temporal seem to fire all the timer that actually need to be executed at 13:19 pm, and it’s actually continue to end the workflow and continue next activity

We have some miss configuration in here, but we don’t know if this cause non deterministic
scheduleToCloseTimeout 5s
scheduleToStartTimeout 5s
startToCloseTimeout 5s

and we think it seem to cause the activity time out,
when getting this activity error we actually have handle by checking the error and do some other activity which we wait for signal for another action

At 6.49 pm while we try to restart the server again, it caused nondeterministic error

PanicError: unknown command CommandType: Timer, ID: 25, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition
process event for QUEUE [panic]:
go.temporal.io/sdk/internal.panicIllegalState(...)
/build/vendor/go.temporal.io/sdk/internal/internal_decision_state_machine.go:409
go.temporal.io/sdk/internal.(*commandsHelper).getCommand(0xc041719f90, 0xc000000004, 0xc04261104a, 0x2, 0x0, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_decision_state_machine.go:881 +0x185
go.temporal.io/sdk/internal.(*commandsHelper).handleTimerClosed(0xc041719f90, 0xc04261104a, 0x2, 0xc0422ae180, 0xc042898d80)
/build/vendor/go.temporal.io/sdk/internal/internal_decision_state_machine.go:1298 +0x4b
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).handleTimerFired(0xc00054a078, 0xc041df64c0)
/build/vendor/go.temporal.io/sdk/internal/internal_event_handlers.go:1097 +0x7d
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc00054a078, 0xc041df64c0, 0x0, 0x0, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_event_handlers.go:862 +0x648
go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc04223a600, 0xc042384a50, 0x1b06d80, 0xc00026e000, 0xc04223a600, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_task_handlers.go:902 +0x7bf
go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc00012e370, 0xc042384a50, 0xc042384ed0, 0x0, 0x0, 0x0, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_task_handlers.go:749 +0x6d9
go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc000494820, 0xc042384a50, 0x0, 0x0)
/build/vendor/go.temporal.io/sdk/internal/internal_task_pollers.go:284 +0x4ae
go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc000494820, 0x109c560, 0xc042384a50, 0xc0006ab798, 0xfec75e)
/build/vendor/go.temporal.io/sdk/internal/internal_task_pollers.go:255 +0x85
go.temporal.io/sdk/internal.(*baseWorker).processTask(0xc000198780, 0x109c120, 0xc0003c5ce0)
/build/vendor/go.temporal.io/sdk/internal/internal_worker_base.go:400 +0x115
created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
/build/vendor/go.temporal.io/sdk/internal/internal_worker_base.go:305 +0xff

i will post example code in bellow thread

once again thank you

this is the current code example
any feedback given?

 cbOptions := workflow.ActivityOptions{
 			StartToCloseTimeout:    wfConfig.StartToCloseTimeoutInSecond,
 			ScheduleToCloseTimeout: wfConfig.ScheduleToCloseTimeoutInMinutes,
 			RetryPolicy:            retryPolicy,
 		}
 		signal1CbCtx := workflow.WithActivityOptions(ctx, cbOptions)
 		exit24Timer := false
 		isStatusOK := false
 		receiveSignalOne := false
 		duration := config.Hour24
 		endDateTime := workflow.Now(signal1CbCtx).Local().Add(duration)
 		var receiveFirstSignalAtTime time.Time

		for !exit24Timer && !isStatusOK {
			timerCtx, timerCancel := workflow.WithCancel(signal1CbCtx)
			timer := workflow.NewTimer(signal1CbCtx, duration)
			selector := workflow.NewSelector(timerCtx)
			selector.AddFuture(timer, func(f workflow.Future) {
				if signal1CbCtx.Err() != nil {
					logger.Error("Error occurred.", "error", signal1CbCtx.Err())
				} else {
					exit24Timer = true
					return
				}
			}).AddReceive(workflow.GetSignalChannel(timerCtx, utility.FirstSignal), func(c workflow.ReceiveChannel, more bool) {
				c.Receive(timerCtx, &firstReq)
				timerCancel()
				receiveFirstSignalAtTime = workflow.Now(timerCtx).Local()
				receiveSignalOne = true
			}).Select(timerCtx)

			if !exit24Timer {
				duration = endDateTime.Sub(receiveFirstSignalAtTime)
				timer2 := workflow.NewTimer(signal1CbCtx, duration)
				receiveSignalTwo := false
				for !receiveSignalTwo && !exit24Timer {
					timer2Ctx, timerCancel := workflow.WithCancel(signal1CbCtx)
					selectorTwo := workflow.NewSelector(timer2Ctx)
					selectorTwo.AddFuture(timer2, func(f workflow.Future) {
						err1 := f.Get(timer2Ctx, nil)
						if err1 == nil {
							exit24Timer = true

						} else if signal1CbCtx.Err() != nil {
							logger.Error("Error occurred.", "error", signal1CbCtx.Err())
						}
					}).AddReceive(workflow.GetSignalChannel(timer2Ctx, utility.SecondSignal), func(c workflow.ReceiveChannel, more bool) {
						c.Receive(timer2Ctx, &receiveSignalTwoReq)
						timerCancel()
						receiveSignalTwo = true
					}).Select(timer2Ctx)
				}
				if receiveSignalTwo {
					 
					isStatusOK=//assign value from executeActivity1 can be OK/NOT OK
				}
			}
		}

		//-------------------------====================================Wait Until 24 hours====================================----------------------------------------------------
		if exit24Timer && receiveSignalOne {

			isStatusOK =//assign value from outside diagram
		}

After we get activity timeout error, it should trigger this function bellow

we just make a function to receive signal and block the workflow to continue, by using this function :

func GetRetrySignal(ctx workflow.Context ) signal.SignalPayload {
var retrySignal signal.SignalPayload
ch := workflow.GetSignalChannel(ctx, RetryWorkflow)
ch.Receive(ctx, &retrySignal)

return retrySignal
}

Seem the cause of non deterministic error also happened on this block of code

Are you on the latest version of Go SDK?

this is the information for the go sdk version
go.temporal.io/api v1.8.0
go.temporal.io/sdk v1.15.0
go version 1.16

I will look into replicating with the code given. Is there anything besides the code I need to know to replicate? Can you reliably replicate this situation?

Technically using Time.Local is non-deterministic if this code executes on workers with different time zones. Can you confirm that all of your workers have the exact same time zone since this workflow started? I’ll look for other non-determinisms in the code too though it is possible there is an SDK bug.

I am running this in one site timezone, there is no need to run this in multiple timezone.
Should we provide the setup config from temporal server / db cassandra ?

Not necessary as replication should be able to happen on any server. Can you reliably replicate? I am going to try and replicate from what you have given, but ideally if you can reliably replicate, if you could provide an entire runnable workflow that’d really help us debug.

I am struggling to replicate. What we’d want to do is use the replayer. If you can download the JSON history from tctl or the UI, then we can run it through the replayer. You may need to take off the last history event in JSON that has the panic to try and replicate. Running through the replayer should hopefully replicate the issue.

If you want, I can also try to replicate this but I’ll need that full history JSON and the code. If this is acceptable, you can DM me here or DM me on our public Slack or send to chad@temporal.io.

Sure, let me reach you through slack first, and we will post the update of our discission to this forum