Hi, our workflow is currently facing a non-deterministic error. The code uses workflow.Sleep() to implement below logic:
It will sleep for 6 hours, then execute an activity, then sleep again if success condition is not met.
After 6 hours, timer is fired:
The function works as expected.
A sleep timer started at Aug 10th 6:30:00 pm. It is expected to be fired 6 hours later, which is Aug 11th 12:30:00. However, a problem occurred when switching the DB to another site (where cassandraDB data is stored). The activity occurred during the sleep period. After the activity completed, the timer was fired at Aug 11th 1:02:46 am (32 mins delay):
Then, it got WorkflowTaskTimedOut error:
The next timer started at Aug 11th 1:02:59 am, and fired at Aug 11th 7:02:59 am. Then, it got WorkflowTaskFailed:
PanicError: unknown command CommandType: Timer, ID: 384, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition
process event for PSB_WEB_WF_QUEUE [panic]:
go.temporal.io/sdk/internal.panicIllegalState(...)
/build/vendor/go.temporal.io/sdk/internal/internal_decision_state_machine.go:409
go.temporal.io/sdk/internal.(*commandsHelper).getCommand(0xc040de3360, 0xc000000004, 0xc0005b7028, 0x3, 0x0, 0x0)
So, there was no code change, only DB migration, which caused delay in the sleep timer, and then NDE.
My question is, for this scenario, can the transaction still continue? Or is it a loss and can only be terminated?
Thank you.