Workflow.Sleep caused Non-deterministic error

Hi, our workflow is currently facing a non-deterministic error. The code uses workflow.Sleep() to implement below logic:

It will sleep for 6 hours, then execute an activity, then sleep again if success condition is not met.
image

After 6 hours, timer is fired:
image

The function works as expected.

A sleep timer started at Aug 10th 6:30:00 pm. It is expected to be fired 6 hours later, which is Aug 11th 12:30:00. However, a problem occurred when switching the DB to another site (where cassandraDB data is stored). The activity occurred during the sleep period. After the activity completed, the timer was fired at Aug 11th 1:02:46 am (32 mins delay):
image

Then, it got WorkflowTaskTimedOut error:

The next timer started at Aug 11th 1:02:59 am, and fired at Aug 11th 7:02:59 am. Then, it got WorkflowTaskFailed:

PanicError: unknown command CommandType: Timer, ID: 384, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition
process event for PSB_WEB_WF_QUEUE [panic]:
go.temporal.io/sdk/internal.panicIllegalState(...)
/build/vendor/go.temporal.io/sdk/internal/internal_decision_state_machine.go:409
go.temporal.io/sdk/internal.(*commandsHelper).getCommand(0xc040de3360, 0xc000000004, 0xc0005b7028, 0x3, 0x0, 0x0)

So, there was no code change, only DB migration, which caused delay in the sleep timer, and then NDE.
My question is, for this scenario, can the transaction still continue? Or is it a loss and can only be terminated?

Thank you.

How do you calculate the delay? You have to use workflow.Now(…) to get the current time.

Hi Maxim. The delay is calculated using subtraction. Below is the implementation:

deliveryHasIncident := false
subProcessDurationLeft := time.Duration(2 * 24  * 60) * time.Minute
sleepDuration := time.Duration(6 * 60) * time.Minute

for true {
  status = executeActivity()

  if status == SUCCESS {
    break
  } else {
    if subProcessDurationLeft <= 0 {
      deliveryHasIncident = true
      break
    } else if subProcessDurationLeft < sleepDuration {
      workflow.Sleep(ctx, subProcessDurationLeft)
    } else {
      workflow.Sleep(ctx, sleepDuration)
    }

    subProcessDurationLeft = subProcessDurationLeft - sleepDuration
  }
}

Do you have any suggestion on how to implement the sleep function?

Could you share the full history of the workflow? Given your description of the workflow I don’t understand what events occurred between ID 384 and 391.

Below is the history:

Thank you for sharing the history. It looks like your history is corrupted, the TimerFired event at event 389 thinks the timer was started at event ID 384, but event 384 is a WorkflowTaskScheduled event not a timer event. That bad TimerFired event is written by the server. Your DB migration, was the DB that backs your temporal cluster yes?

Yes that is correct. The history is corrupted because of the workflow.Sleep() implementation? Without it, the history should be fine and Temporal auto-recovers when DB is up again right?

If a transaction history is corrupted, can it still be recovered?

I guess that it is corrupted due to DB not being consistent.

switching the DB to another site

Cassandra cross-cluster replication is known to be not consistent and should not be used with Temporal.

I see. In my case, this transaction cannot be saved anymore and can only be terminated?

You can reset this workflow to the point before the corruption.

I see. I just realized that there is reset feature. Thank you.